Closed RalphHan closed 10 months ago
Hi Hongwei @RalphHan ! Thanks for your question!
I would like to point out that, in the machine learning community, a fair comparison is needed. In our RVQ, the number of codebooks is 2. In other words, it is quantized by RVQ once. Therefore, basically, it should be compared with Momask (V, 1). It also seems that the 63 is not 54.6. I think the difference is minor. The reason might be complex. (1): Different codebase. We extend the metric calculation from MLD. (2): Different VQ implementation. (I am not very sure the difference between our TOMATO model and MoMask)
Besides, the fair comparison of RVQ with VQ should be VQ(1024), which has the same parameters in the codebook(s). BTW, I remember the our FID of RVQ is almost equal to that of Momask (V, 1). However, FID is evaluated by the model and the value is really minor (curve jitters a lot in that scale). Therefore, we do not report the FID for VQ compaction for fearing the model bias.
Finally, I would like to point out that the scaling of quantizer numbers is not necessary in HumanML3D. Recent models already approach the upper bound of model performance, and sometimes it is difficult for us to determine whether this is because the algorithm really works. And the large number of codebook numbers will slows down the algorithm speed significantly. Therefore, we take the number of codebooks as 2.
I hope this reply will resolve your question.
Hi Linghao @LinghaoChan .
Thank you for your quick reply! It does resolve my question. It inspires me that Holistic Hierarchical Residual Vector Quantization (H2RVQ) might be a potential improvement by replacing VQ with RVQ in figure (a).
Hi Linghao @LinghaoChan .
Thank you for your quick reply! It does resolve my question. It inspires me that Holistic Hierarchical Residual Vector Quantization (H2RVQ) might be a potential improvement by replacing VQ with RVQ in figure (a).
Yep, it really works! For more about it, please refer to the appendix. BTW, I promise you will always learn something from each section of the appendix. (^-^)
MPJPE of RVQ on HumanML3D from MoMask is 29.5,
but the statistic from your paper is 63.1.
Maybe RVQ is not that bad? Can you provide more details about table 2 of your paper?