Discrepancy Between Evaluation Metrics and Paper Results After Running Provided Model

motioncraft paper

Dear Author,

First of all, I would like to thank you for open-sourcing your code. Your work is excellent and truly worth learning from.

However, I encountered a minor issue while running your code. After downloading the model file you provided and running the evaluation code, I found that the evaluation metrics I obtained do not correspond to the metrics reported in your paper. I'm wondering if there might have been an issue with the model I downloaded?

In the attached images, Figure 1 shows the experimental results from running the model you provided, while Figure 2 shows the ground truth evaluation results. These results differ from those reported in your paper, which raises my question: could there have been a problem with the model download? Additionally, I'm puzzled as to why the trained model's results are higher than the ground truth (GT).

Thank you for your time and assistance!

Dear Author,

First of all, I would like to thank you for open-sourcing your code. Your work is excellent and truly worth learning from.

However, I encountered a minor issue while running your code. After downloading the model file you provided and running the evaluation code, I found that the evaluation metrics I obtained do not correspond to the metrics reported in your paper. I'm wondering if there might have been an issue with the model I downloaded?

In the attached images, Figure 1 shows the experimental results from running the model you provided, while Figure 2 shows the ground truth evaluation results. These results differ from those reported in your paper, which raises my question: could there have been a problem with the model download? Additionally, I'm puzzled as to why the trained model's results are higher than the ground truth (GT).

Thank you for your time and assistance!

Because we use a pre-trained motion encoder and text encoder to evaluate the feature extraction metrics, and these two pre-trained encoders themselves can only reflect the relative strength of the model (as they are derived from motion-text optimization and are not fully equivalent to GT), it is possible that the GT metrics may not perform as well as the model metrics. This phenomenon has also been observed in other well-known motion datasets such as Humanml3D and KIT. However, I have not encountered this issue here. I will double-check it~

cure-lab / MotionCraft

Discrepancy Between Evaluation Metrics and Paper Results After Running Provided Model #2