Closed MingCongSu closed 2 months ago
Hi Thank you for the interest and sorry for the late reply.
Hi @Garfield-kh thanks for replying to me.😊
Can I ask how did you compute the FID for text2motion? it's because when I evaluated my own trained model, the value was always extremely large like 9.004776248194507e+29
. And I also followed TM2T's setting.
Thanks.
Hi @MingCongSu , Thank you for the question. :)
For music2dance, we follow Bailando's 60fps setting in evaluation metric, which refer in Table 1: Music Conditioned Dance Generation of tm2d paper. For text2motion, we follow TM2T's 20fps setting in evaluation metric, which refer in Table 3: The evaluation of text2motion on the HumanML3D dataset of tm2d paper.
So the case is: using the original TM2T code, add dance data (20fps) in training VQVAE so that it can involve the dancing pattern, train the mix tm2d in 20fps. Did you use the 60fps model in the evaluation?
Hi @Garfield-kh, thanks for replying to me.
Actually, I used the 20fps model in the evaluation. I searched for the FID bug and someone suggested downgrading the scipy
version to 1.11.1
. I don't know how but it seems to work for me : )
But now I have another problem.😂
I am trying to follow your method to convert AIST++ data into HumanML3D format. After I did the motion_representation
process, the skeletons of the data in /new_joints
looked weird when I plotted the animation videos. His arms are not turning properly as his body.
it's like this:
How can I reproduce your data? Thank you.
Hi @Garfield-kh. I noticed that there are differences in (t2m_kinematic_chain
, t2m_raw_offsets
) and (smpl24_kinematic_chain
, smpl24_raw_offsets
) in HumanML3D_24joint_60fps/paramUtil.py
from your prepared dataset file.
(t2m_kinematic_chain , t2m_raw_offsets ) |
(smpl24_kinematic_chain , smpl24_raw_offsets ) |
---|---|
Is this the cause of the problem?
Should I use (smpl24_kinematic_chain
, smpl24_raw_offsets
) instead of (t2m_kinematic_chain
, t2m_raw_offsets
) in the motion_representation
process?
Thank you
Hi sorry for the late reply. I was really busy with my work stuff (under 995 mode).
Hi @Garfield-kh, thanks for replying to me.
Actually, I used the 20fps model in the evaluation. I searched for the FID bug and someone suggested downgrading the
scipy
version to1.11.1
. I don't know how but it seems to work for me : )
This one is wried.
Should I use (smpl24_kinematic_chain, smpl24_raw_offsets) instead of (t2m_kinematic_chain, t2m_raw_offsets) in the motion_representation process?
The difference between this two is: one is 24 joint, one is 22 joint. In the case of dancing task, we use 60 fps 24 joint follow Bailando In the case of t2m evaluation, we use 20 fps 22 joint follow TM2T if you are using the code of 60 fps, it should be 24 joint one.
For your bug arm, @EricGuo5513 can you help?
@Garfield-kh Thanks for your kindly reply. Sorry for asking so many questions but I really want to continue the research following your method.😂
I noticed there are also differences in the t2m_raw_offests
and smpl_raw_offests
.
If I want to reproduce aistppml3d_24joijts_60fps
, can I process the data in the following steps:
raw_pose_processing
for both HumanML3D and AIST++ datasets (with 24 joints, 60fps) and save them in /joints
dir.motion_representation
for all joint data in /joints
dir with smpl_kinematic_chain
and smpl_raw_offsets
to produce new_joint_vecs
and new_joints
.Mean.py
and Std.py
.@Garfield-kh @EricGuo5513, Am I doing everything right?
yes yes
Hi @MingCongSu , Thank you for the question. :)
For music2dance, we follow Bailando's 60fps setting in evaluation metric, which refer in Table 1: Music Conditioned Dance Generation of tm2d paper. For text2motion, we follow TM2T's 20fps setting in evaluation metric, which refer in Table 3: The evaluation of text2motion on the HumanML3D dataset of tm2d paper.
So the case is: using the original TM2T code, add dance data (20fps) in training VQVAE so that it can involve the dancing pattern, train the mix tm2d in 20fps. Did you use the 60fps model in the evaluation?
Hi @Garfield-kh @EricGuo5513. Sorry, I may need to reopen this issue.
You mentioned that you used 60fps to train tm2d with only dance data and got the music2dance result in Table 1 (rows in blue box). But you used 20fps to train mix tm2d and got the text2motion result in Table3.
so my question Is the music2dance result of mix tm2d in Table 1 (rows in red box) also trained on 20fps? or 60fps?
Hi @Garfield-kh, thanks for the great work. I am curious about the metric you proposed and have 3 questions.
What do you mean about the following description in "4.2. Evaluation on Music-text Conditioned Dance Generation" in the paper?
i.e. how do you calculate the MPD from future frame (ft) = 10 to ft = 30 and present the result in Table 2? cause I cannot map this to the definition of MPD.
It would be of great help if you could reply. Thanks.