Do you compare your ckpt with stability's?

StarCycle commented 2 weeks ago

Hi,

Do you compare your ckpt with the vae/vqvae here?

If the direct comparison is not reasonable because their ckpts are trained with more data, did they have a version that was only trained with imagenet data but with the same config? Or can you evaluate their ckpt on imagenet even if these are trained on more data...I just want to know which ckpt is most suitable for my current application.

Best, StarCycle

ShiFengyuan1999 commented 2 weeks ago

Thanks for your interest in our work.

Yes, we use different training data. Their models are trained on Open-Images and we only use ImageNet for training (we are planning to scale-up training data). However, when testing both on ImageNet, we achieved much better reconstruction results.

Specifically,

They provide VQVAE checkpoints at a down sampling rates of 4/8/16, and we provide 8/16. They receive the optimal results of 0.58 rFID at 4x while we obtain 0.39 rFID at 8x. Note that smaller downsample ratios make more tokens, and lead to better reconstruction. So we may achieve even better results if we also use 4x downsampling.
They also provide non-VQ version for diffusion model training. However, only the 4x downsampling version achieve better results than our VQ model, which means that we achieved better results even with visual quantization at the same downsampling rate 😆. We do not plan to train non-VQ models (sole VAE) since our main focus is autoregressive visual generation which needs visual tokens for training.

When choosing tokenizers for your own project, we recommend to take both the reconstruction capability and downsampling rate into consideration. Smaller downsampling rate generally lead to better reconstruction, however, the large number of tokens may make Transformer hard to learn.

Hope to provide some insights for you.

yxgeee commented 2 weeks ago

We have added their results to our comparison table (https://github.com/TencentARC/Open-MAGVIT2?tab=readme-ov-file#-quantitative-comparison). Thank you again for the valuable suggestion!

StarCycle commented 2 weeks ago

Hi @ShiFengyuan1999 @yxgeee,

Thank you for the answer and your ckpts are so powerful!

When you apply Open-MAGVIT on your autoregressive transformer, you dont have to generate tokens one by one. You can generate image tokens in a single forward pass and generate language tokens one by one. For example, take a similar approach to the original Magvit2 or Bytedance's VAR.

Looking forward to your future progress! It's really a great work!

Best, StarCycle

yxgeee commented 2 weeks ago

Hi @StarCycle , Thank you for the suggestion. You are welcome to contribute to our repo if you have impressive results using Open-MAGVIT2 tokenizers.

TencentARC / Open-MAGVIT2

Do you compare your ckpt with stability's? #3