LTH14 / mar

PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
MIT License
868 stars 45 forks source link

The influence of VAE feature dim #53

Open Tom-zgt opened 2 weeks ago

Tom-zgt commented 2 weeks ago

I'm currently following your excellent work MAR. I would like to know the impact of the VAE feature dimensions on model performance. I saw that you experimented with 16 and 8 dimensions features of VAE in the paper. Have you tried using 32 dimensions or larger dimensions? @LTH14 屏幕快照 2024-09-27 下午1 43 40

LTH14 commented 2 weeks ago

Thanks for your interest! Note that here KL-16 and KL-8 denote the downsampling stride of the tokenizer (KL-16 downsamples 256x256x3 image into 16x16x16 tokens, and KL-8 downsamples it into 32x32x4 tokens).

We don't have an ablation on this feature dimension in the paper. A higher VAE dimension typically improves reconstruction performance. However, we also found that the higher the VAE feature dimension, the harder it is for the simple DiffLoss to model it, so it is a trade-off.