bytedance / 1d-tokenizer

This repo contains the code for our paper An Image is Worth 32 Tokens for Reconstruction and Generation
Apache License 2.0
231 stars 7 forks source link

About the proxy code during training #6

Open jeasinema opened 1 week ago

jeasinema commented 1 week ago

Hi,

Congrats on this remarkable achievement -- I am quite fascinated by the idea of query-based image compression. Since now the code has been partially released, I have a question regarding the "proxy code" you mentioned in the main paper. I quote the text below:

Specifically, in the first “warm-up” stage, instead of directly regressing the RGB values and employing
a variety of loss functions (as in existing methods), we propose to train 1D VQ models with the discrete
codes generated by an off-the-shelf MaskGIT-VQGAN model, which we refer to as proxy codes.

Based on my understanding, the code being produced by TiTok should be significantly less than MaskGIT-VQGAN, no? In a 256^2 setting, MaskGIT has claimed to use a fixed /16 factor resulting in 256 tokens. However, TiTok allows K=32/64/128 on this level, how did you warm up the quantized encoder with a teacher with more output tokens than your own?

Thank you, and I think releasing the training code on this part could also help a lot! Again, congrats on this breakthrough!

Best, XM

cornettoyu commented 1 week ago

Hi, does this answer your question?

https://github.com/bytedance/1d-tokenizer/issues/1#issuecomment-2184198667