bytedance / 1d-tokenizer

This repo contains the code for our paper An Image is Worth 32 Tokens for Reconstruction and Generation
Apache License 2.0
232 stars 7 forks source link

Training code? #1

Open kabachuha opened 1 week ago

kabachuha commented 1 week ago

Hi!

It's an extremely nice work. Do you have training code release in plans?

cornettoyu commented 1 week ago

Hi,

Thanks for your interests. For training code and beyond, we are still working on the internal approval process. Before that, feel free to let me know if you have any questions regarding technical details and I am more than happy to address them :)

reedscot commented 1 week ago

Awesome work! May I ask which VQGAN implementation was used for the proxy codes?

cornettoyu commented 1 week ago

Awesome work! May I ask which VQGAN implementation was used for the proxy codes?

Thanks for your interests! For proxy codes at warm-up training, we used MaskGIT-VQGAN, the original implementation was in Jax and can be found at https://github.com/google-research/maskgit We used the pytorch version from huggingface's open-muse which provides a pytorch reimplementation and weight ported from Jax.

vkramanuj commented 1 week ago

Thanks for the great work!

I have a detailed question about the proxy codes. The Maskgit VQGAN provides a fixed length set of codes (256 or 1024). How do you distill that into 32 or 64 codes during the warm-up procedure for the smaller models? Perhaps I'm misunderstanding the paper. Thanks!

cornettoyu commented 1 week ago

Thanks for the great work!

I have a detailed question about the proxy codes. The Maskgit VQGAN provides a fixed length set of codes (256 or 1024). How do you distill that into 32 or 64 codes during the warm-up procedure for the smaller models? Perhaps I'm misunderstanding the paper. Thanks!

The MaskGIT-VQGAN's code is used to supervise the output of TiTok's de-tokenizer, similar to BEIT. As we use the mask token sequence (BERT, MAE style) to reconstruct the target sequence, it does not matter how many tokens we are using or they are using. We do not apply any "distill" or "loss" between TiTok's codebook/embedding and MaskGIT-VQ's codebook/embedding etc. Hope it addresses your question.

jeasinema commented 1 week ago

Thank you for the reply. I think I get it, but please correct me if I am wrong here -- so during the warm-up stage, eq. (4) in the main paper is different: 1) it produces the codes of MaskGIT-VQGAN instead of pixels directly; 2) the codes will be processed by MaskGIT-VQGAN decoder into pixels.

I would appreciate it if you could update the main paper with an equation clearing these differences if this is the case. That will help the readers a lot!