About the proxy code during training

Hi,

Congrats on this remarkable achievement -- I am quite fascinated by the idea of query-based image compression. Since now the code has been partially released, I have a question regarding the "proxy code" you mentioned in the main paper. I quote the text below:

Specifically, in the first “warm-up” stage, instead of directly regressing the RGB values and employing
a variety of loss functions (as in existing methods), we propose to train 1D VQ models with the discrete
codes generated by an off-the-shelf MaskGIT-VQGAN model, which we refer to as proxy codes.

Based on my understanding, the code being produced by TiTok should be significantly less than MaskGIT-VQGAN, no? In a 256^2 setting, MaskGIT has claimed to use a fixed /16 factor resulting in 256 tokens. However, TiTok allows K=32/64/128 on this level, how did you warm up the quantized encoder with a teacher with more output tokens than your own?

Thank you, and I think releasing the training code on this part could also help a lot! Again, congrats on this breakthrough!

Best, XM

bytedance / 1d-tokenizer

About the proxy code during training #6