Closed StarCycle closed 2 weeks ago
Thanks for your interest in our work.
Yes, we use different training data. Their models are trained on Open-Images and we only use ImageNet for training (we are planning to scale-up training data). However, when testing both on ImageNet, we achieved much better reconstruction results.
Specifically,
When choosing tokenizers for your own project, we recommend to take both the reconstruction capability and downsampling rate into consideration. Smaller downsampling rate generally lead to better reconstruction, however, the large number of tokens may make Transformer hard to learn.
Hope to provide some insights for you.
We have added their results to our comparison table (https://github.com/TencentARC/Open-MAGVIT2?tab=readme-ov-file#-quantitative-comparison). Thank you again for the valuable suggestion!
Hi @ShiFengyuan1999 @yxgeee,
Thank you for the answer and your ckpts are so powerful!
When you apply Open-MAGVIT on your autoregressive transformer, you dont have to generate tokens one by one. You can generate image tokens in a single forward pass and generate language tokens one by one. For example, take a similar approach to the original Magvit2 or Bytedance's VAR.
Looking forward to your future progress! It's really a great work!
Best, StarCycle
Hi @StarCycle , Thank you for the suggestion. You are welcome to contribute to our repo if you have impressive results using Open-MAGVIT2 tokenizers.
Hi,
Do you compare your ckpt with the vae/vqvae here?
If the direct comparison is not reasonable because their ckpts are trained with more data, did they have a version that was only trained with imagenet data but with the same config? Or can you evaluate their ckpt on imagenet even if these are trained on more data...I just want to know which ckpt is most suitable for my current application.
Best, StarCycle