Thanks for sharing your excellent work.
I'm trying to replicate the results but encountered some problems. According to your code, the MaskedTransformer predicts the base tokens encoded by VQVAE, and it has no relation to the residual. My VQVAE without the residual layer only achieves an FID score of 0.2, while my MaskedTransformer achieves 0.09. This is really confusing for me since the MaskedTransformer learns from VQVAE but shows better performance.
Could you explain why this happens?
Thanks for sharing your excellent work. I'm trying to replicate the results but encountered some problems. According to your code, the MaskedTransformer predicts the base tokens encoded by VQVAE, and it has no relation to the residual. My VQVAE without the residual layer only achieves an FID score of 0.2, while my MaskedTransformer achieves 0.09. This is really confusing for me since the MaskedTransformer learns from VQVAE but shows better performance. Could you explain why this happens?