Open kebijuelun opened 9 months ago
I've noticed the same issue. It appears that the author has just utilized the encoder and decoder network architecture from movqgan and has retrained a VAE model. I'm not sure if my understanding is correct, so I would appreciate clarification from the author.
After training, each token of the encoder's output is very close to a certain vector in the codebook. I've tried adding the VQ to refine the encoder's output and find the MSE between the features before VQ and after VQ is small, thus w/ VQ or w/o VQ produces nearly the same decoding results.
I have some questions regarding the implementation of MoVQ and would appreciate your clarification.
From the original MoVQ paper, it is mentioned that a multi-channle VQ is adopted.
However, the implementation of kandinsky3 does not involve any vector quantization operation:
May I ask if it is a misunderstanding on my part regarding MoVQ, or if Kandinsky has made some modifications to the implementation of MoVQ?