CompVis / stable-diffusion

A latent text-to-image diffusion model
https://ommer-lab.com/research/latent-diffusion-models/
Other
67.52k stars 10.08k forks source link

About decode_firs_stage in sampling. #575

Open LoveU3tHousand2 opened 1 year ago

LoveU3tHousand2 commented 1 year ago

I 've noticed that 'decode_to_img' function in taming-transformer and vq-vae using decode_code or get_codebook_entry, but in ldm, decode_first_stage is quantize -> decode if not set predict_cid = True, why is this? What is the difference between quantize->decode and get_codebook_entry->decode?

huddyyeo commented 1 year ago

I'm guessing its because in regular vqvae, the quantisation happens in the encoder and the decoder simply takes the embedding and decodes it. In latent diffusion, the diffusion is done in z space, and the output is later used in the decoder. This output has to be first quantised before being decoded.

LoveU3tHousand2 commented 1 year ago

I'm guessing its because in regular vqvae, the quantisation happens in the encoder and the decoder simply takes the embedding and decodes it. In latent diffusion, the diffusion is done in z space, and the output is later used in the decoder. This output has to be first quantised before being decoded.

So it will be work too if I quantise z before ldm training and decode directly after sampling ?