CompVis / taming-transformers

Taming Transformers for High-Resolution Image Synthesis
https://arxiv.org/abs/2012.09841
MIT License
5.82k stars 1.15k forks source link

small question about the vq-vae paper #221

Open ghost opened 1 year ago

ghost commented 1 year ago

hello! thank you for your great work!

I have a question about loss function in paper. L = log p(x|z(x)) + ||sg[z(x)] − e|| + β||z(x) − sg[e]||

the author mentioned that a third term exists because e can grow arbitrarily if it doesn't train as fast as the encoder parameters. but I see that term only helps the encoder to be trained faster.

will it help the e to be trained faster too? but I assume that sg[e] is meaning that the e won't be trained by the term. I hope this isn't a silly question ;) thx in advance.

Parisa-Boodaghi commented 2 months ago

Hi, you explanation is correct, the last term exist to help with optimization and loss reduction. The idea here is without the last term the error between the embedding and encoder output might grow arbitrary which is not good for us and it is because of the slow rate of embedding optimization. To solve this problem the suggest to add the third term which helps the encoder to be optimized with low rate to instead help with the loss reduction in total and compensate for the slow optimization rate of the second term which is embedding loss. I hope this helps you!