Open kabachuha opened 1 week ago
Hi,
Thanks for your interests. For training code and beyond, we are still working on the internal approval process. Before that, feel free to let me know if you have any questions regarding technical details and I am more than happy to address them :)
Awesome work! May I ask which VQGAN implementation was used for the proxy codes?
Awesome work! May I ask which VQGAN implementation was used for the proxy codes?
Thanks for your interests! For proxy codes at warm-up training, we used MaskGIT-VQGAN, the original implementation was in Jax and can be found at https://github.com/google-research/maskgit We used the pytorch version from huggingface's open-muse which provides a pytorch reimplementation and weight ported from Jax.
Thanks for the great work!
I have a detailed question about the proxy codes. The Maskgit VQGAN provides a fixed length set of codes (256 or 1024). How do you distill that into 32 or 64 codes during the warm-up procedure for the smaller models? Perhaps I'm misunderstanding the paper. Thanks!
Thanks for the great work!
I have a detailed question about the proxy codes. The Maskgit VQGAN provides a fixed length set of codes (256 or 1024). How do you distill that into 32 or 64 codes during the warm-up procedure for the smaller models? Perhaps I'm misunderstanding the paper. Thanks!
The MaskGIT-VQGAN's code is used to supervise the output of TiTok's de-tokenizer, similar to BEIT. As we use the mask token sequence (BERT, MAE style) to reconstruct the target sequence, it does not matter how many tokens we are using or they are using. We do not apply any "distill" or "loss" between TiTok's codebook/embedding and MaskGIT-VQ's codebook/embedding etc. Hope it addresses your question.
Thank you for the reply. I think I get it, but please correct me if I am wrong here -- so during the warm-up stage, eq. (4) in the main paper is different: 1) it produces the codes of MaskGIT-VQGAN instead of pixels directly; 2) the codes will be processed by MaskGIT-VQGAN decoder into pixels.
I would appreciate it if you could update the main paper with an equation clearing these differences if this is the case. That will help the readers a lot!
Hi!
It's an extremely nice work. Do you have training code release in plans?