Mael-zys / T2M-GPT

(CVPR 2023) Pytorch implementation of “T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations”
https://mael-zys.github.io/T2M-GPT/
Apache License 2.0
595 stars 52 forks source link

About Sampling and Calculating FID Score #69

Closed RohollahHS closed 2 months ago

RohollahHS commented 3 months ago

Hi, thanks for the great work.

I developed an autoregressive model that is somewhat similar to T2M-GPT. However, during sampling, I get better results when if_categorical=False compared to if_categorical=True, both on validation and test datasets. Do I have to use if_categorical=True if I want to report my FID score in my paper?

https://github.com/Mael-zys/T2M-GPT/blob/7db71a28b2117abd9fc0dd402b91df72f1bc6ace/models/t2m_trans.py#L33

Thanks

Mael-zys commented 2 months ago

Hello, we had similar results that if we set if_categorical=False, the FID score is better. However, for text-to-motion generation, the diversity of generated motion is also important. When if_categorical=False during sampling, the "MModality" metric will be zero. ("MModality" measures the diversity of human motion generated from the same text description). Therefore, I think it's better to report the results of if_categorical=True

RohollahHS commented 2 months ago

Hello, we had similar results that if we set if_categorical=False, the FID score is better. However, for text-to-motion generation, the diversity of generated motion is also important. When if_categorical=False during sampling, the "MModality" metric will be zero. ("MModality" measures the diversity of human motion generated from the same text description). Therefore, I think it's better to report the results of if_categorical=True

Thanks for the great explanation. The MMM (https://github.com/exitudio/MMM/tree/main), which is a masked-based generative model, achieves better FID scores by using random sampling. It uses Gumbel sampling with temperature=1, which I think is similar to Categorical() sampling without top_k.

I think it might be better to report the FID score alone without random sampling, and for reporting diversity and other metrics, use random sampling.

For example, the MMM model has an FID score of about 0.12 on the test set without random sampling (Gumbel-softmax with temperature=0), but by using random sampling (Gumbel-softmax with temperature=1), its FID score drops to about 0.08. On the other hand, my model works extremely well compared to MMM during training, and during sampling, my model also has an FID of 0.08 on the test set without random sampling. However, when I use random sampling (Gumbel with temperature=1), its FID score significantly worsens and reaches about 0.5. I also tried top_k and Categorical sampling, and in all cases, my model gets worse FID scores compared to not using random sampling.

Mael-zys commented 2 months ago

I'm not so sure about your cases on the huge FID score diff for sampling. I think you can try printing the final probability before sampling to check whether the index with maximum prob is much higher than other probability for most of the time:

  1. If not, it may be because of some bugs in random sampling part (because without random sampling, the FID score is much better, it seems the max prob is close to others so that it easily samples some other index).
  2. If the maximum probability is already much higher to others, then maybe the model is sensitive to noise. I guess this is because of the discrepancy between training and inference as you are also using an auto-regressive model, maybe some corrupting training strategy would help (eg. during training, replacing some token with random ones).
RohollahHS commented 2 months ago

I'm not so sure about your cases on the huge FID score diff for sampling. I think you can try printing the final probability before sampling to check whether the index with maximum prob is much higher than other probability for most of the time:

  1. If not, it may be because of some bugs in random sampling part (because without random sampling, the FID score is much better, it seems the max prob is close to others so that it easily samples some other index).
  2. If the maximum probability is already much higher to others, then maybe the model is sensitive to noise. I guess this is because of the discrepancy between training and inference as you are also using an auto-regressive model, maybe some corrupting training strategy would help (eg. during training, replacing some token with random ones).

Thanks for your great suggestions, especially the second one. I will try that.