Garfield-kh / TM2D

[ICCV 2023] TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration
74 stars 3 forks source link

Questions about MPD and Freezing Score (PFF and AUC) and FID in the paper. #7

Closed MingCongSu closed 2 months ago

MingCongSu commented 7 months ago

Hi @Garfield-kh, thanks for the great work. I am curious about the metric you proposed and have 3 questions.

  1. How to evaluate the MPD and Freezing Score (PFF and AUC) in your code?
  2. What do you mean about the following description in "4.2. Evaluation on Music-text Conditioned Dance Generation" in the paper?

    We use the past 25 motion frames to predict the future 30 frames, and calculate the MPD from future frame (ft) = 10 to ft = 30, respectively.

    i.e. how do you calculate the MPD from future frame (ft) = 10 to ft = 30 and present the result in Table 2? cause I cannot map this to the definition of MPD.

  3. How do you evaluate the FID for in-the-wild music in Table 1? As you mentioned in "4.3. Evaluation on Music Conditioned Dance Generation":

    This is because FACT [30] and Bailando [47] requires seed motion, however, there is no ground-truth for in-the-wild scenario.

It would be of great help if you could reply. Thanks.

Garfield-kh commented 6 months ago

Hi Thank you for the interest and sorry for the late reply.

  1. How to evaluate the MPD and Freezing Score (PFF and AUC) in your code?
    • MPD is evaluated by using the code of DLow which is trained on a subset of our training data.
    • Freezing Score is evaluate by checking the velocity of each joint, when it is below a threshold we will count it as freezeing frame.
  2. i.e. how do you calculate the MPD from future frame (ft) = 10 to ft = 30 and present the result in Table 2? cause I cannot map this to the definition of MPD.
    • In summary, DLow is a motion prediction model, wihich will take 25 frame history motion as input, and output future 30 frames. in the case of "future frame (ft) = 10 to ft = 30", that is to evaluate the prediction accuracy in different future range. You can refer to the paper of DLow for more details.
  3. How do you evaluate the FID for in-the-wild music in Table 1?
    • The evaluation of FID in the evaluation code of Bailando actually is compute the FID between the generated dance with the whole training dancing data, which means it does not require ground truth data
MingCongSu commented 6 months ago

Hi @Garfield-kh thanks for replying to me.😊

Can I ask how did you compute the FID for text2motion? it's because when I evaluated my own trained model, the value was always extremely large like 9.004776248194507e+29. And I also followed TM2T's setting.

Thanks.

Garfield-kh commented 6 months ago

Hi @MingCongSu , Thank you for the question. :)

For music2dance, we follow Bailando's 60fps setting in evaluation metric, which refer in Table 1: Music Conditioned Dance Generation of tm2d paper. For text2motion, we follow TM2T's 20fps setting in evaluation metric, which refer in Table 3: The evaluation of text2motion on the HumanML3D dataset of tm2d paper.

So the case is: using the original TM2T code, add dance data (20fps) in training VQVAE so that it can involve the dancing pattern, train the mix tm2d in 20fps. Did you use the 60fps model in the evaluation?

MingCongSu commented 6 months ago

Hi @Garfield-kh, thanks for replying to me.

Actually, I used the 20fps model in the evaluation. I searched for the FID bug and someone suggested downgrading the scipy version to 1.11.1. I don't know how but it seems to work for me : )

MingCongSu commented 6 months ago

But now I have another problem.😂

I am trying to follow your method to convert AIST++ data into HumanML3D format. After I did the motion_representation process, the skeletons of the data in /new_joints looked weird when I plotted the animation videos. His arms are not turning properly as his body.

it's like this:

gJB_sBM_cAll_d08_mJB5_ch02

How can I reproduce your data? Thank you.

MingCongSu commented 6 months ago

Hi @Garfield-kh. I noticed that there are differences in (t2m_kinematic_chain, t2m_raw_offsets) and (smpl24_kinematic_chain, smpl24_raw_offsets) in HumanML3D_24joint_60fps/paramUtil.py from your prepared dataset file.

(t2m_kinematic_chain, t2m_raw_offsets) (smpl24_kinematic_chain, smpl24_raw_offsets)
image image

Is this the cause of the problem? Should I use (smpl24_kinematic_chain, smpl24_raw_offsets) instead of (t2m_kinematic_chain, t2m_raw_offsets) in the motion_representation process?

Thank you

Garfield-kh commented 6 months ago

Hi sorry for the late reply. I was really busy with my work stuff (under 995 mode).

Hi @Garfield-kh, thanks for replying to me.

Actually, I used the 20fps model in the evaluation. I searched for the FID bug and someone suggested downgrading the scipy version to 1.11.1. I don't know how but it seems to work for me : )

This one is wried.

Should I use (smpl24_kinematic_chain, smpl24_raw_offsets) instead of (t2m_kinematic_chain, t2m_raw_offsets) in the motion_representation process?

The difference between this two is: one is 24 joint, one is 22 joint. In the case of dancing task, we use 60 fps 24 joint follow Bailando In the case of t2m evaluation, we use 20 fps 22 joint follow TM2T if you are using the code of 60 fps, it should be 24 joint one.

For your bug arm, @EricGuo5513 can you help?

MingCongSu commented 6 months ago

@Garfield-kh Thanks for your kindly reply. Sorry for asking so many questions but I really want to continue the research following your method.😂

I noticed there are also differences in the t2m_raw_offests and smpl_raw_offests.

If I want to reproduce aistppml3d_24joijts_60fps, can I process the data in the following steps:

  1. Process raw_pose_processing for both HumanML3D and AIST++ datasets (with 24 joints, 60fps) and save them in /joints dir.
  2. Process motion_representation for all joint data in /joints dir with smpl_kinematic_chain and smpl_raw_offsets to produce new_joint_vecs and new_joints.
  3. Calculate Mean.py and Std.py.

@Garfield-kh @EricGuo5513, Am I doing everything right?

Garfield-kh commented 6 months ago

yes yes

MingCongSu commented 1 month ago

Hi @MingCongSu , Thank you for the question. :)

For music2dance, we follow Bailando's 60fps setting in evaluation metric, which refer in Table 1: Music Conditioned Dance Generation of tm2d paper. For text2motion, we follow TM2T's 20fps setting in evaluation metric, which refer in Table 3: The evaluation of text2motion on the HumanML3D dataset of tm2d paper.

So the case is: using the original TM2T code, add dance data (20fps) in training VQVAE so that it can involve the dancing pattern, train the mix tm2d in 20fps. Did you use the 60fps model in the evaluation?

Hi @Garfield-kh @EricGuo5513. Sorry, I may need to reopen this issue.

You mentioned that you used 60fps to train tm2d with only dance data and got the music2dance result in Table 1 (rows in blue box). But you used 20fps to train mix tm2d and got the text2motion result in Table3.

so my question Is the music2dance result of mix tm2d in Table 1 (rows in red box) also trained on 20fps? or 60fps? image