Why MaskedTransformer achieve better result than vqvae?

EricGuo5513 / momask-codes

Official implementation of "MoMask: Generative Masked Modeling of 3D Human Motions (CVPR2024)"

https://ericguo5513.github.io/momask/

MIT License

859 stars 73 forks source link

Why MaskedTransformer achieve better result than vqvae? #81

Open FufenNan opened 1 month ago

FufenNan commented 1 month ago

Thanks for sharing your excellent work. I'm trying to replicate the results but encountered some problems. According to your code, the MaskedTransformer predicts the base tokens encoded by VQVAE, and it has no relation to the residual. My VQVAE without the residual layer only achieves an FID score of 0.2, while my MaskedTransformer achieves 0.09. This is really confusing for me since the MaskedTransformer learns from VQVAE but shows better performance. Could you explain why this happens?

Murrol commented 1 month ago

This is interesting. It might be because the encoder of the vqvae is not converged properly.