Open Tom-zgt opened 2 weeks ago
Thanks for your interest! Note that here KL-16 and KL-8 denote the downsampling stride of the tokenizer (KL-16 downsamples 256x256x3 image into 16x16x16 tokens, and KL-8 downsamples it into 32x32x4 tokens).
We don't have an ablation on this feature dimension in the paper. A higher VAE dimension typically improves reconstruction performance. However, we also found that the higher the VAE feature dimension, the harder it is for the simple DiffLoss to model it, so it is a trade-off.
I'm currently following your excellent work MAR. I would like to know the impact of the VAE feature dimensions on model performance. I saw that you experimented with 16 and 8 dimensions features of VAE in the paper. Have you tried using 32 dimensions or larger dimensions? @LTH14