I encountered an issue while evaluating with the script located at /home/mluo/aigeeks/mmv2/tm2d_60fps/eval4bailando/metrics_new_axm.py Specifically, there is a size mismatch between pred_features_k and gt_features_k found in lines 58 and 59. pred_features_k has dimensions (40, 72), whereas gt_features_k measures (1363, 72). The size of gt_features_k seems close to the total number of entries in the AIST++ dataset, suggesting a discrepancy where the ground truth features should ideally match the prediction in size.
Additionally, the prediction features are extracted from motions corresponding in length to raw audio, typically around 2000 fps. Could you clarify if the ground truth features provided in the Google Drive are extracted from motions of similar lengths, or if they derive from entries under new_joint_vecs? If it's the latter, this could explain the temporal dimension mismatch between prediction and ground truth.
Upon reviewing the raw JSON file, I noticed that the dance_array length aligns with new_joint_vecs. Given this, could you provide access to the raw motion dataset that matches the length of the raw audio from music_array? This would help in ensuring that the evaluations are performed under consistent conditions.
I encountered an issue while evaluating with the script located at
/home/mluo/aigeeks/mmv2/tm2d_60fps/eval4bailando/metrics_new_axm.py
Specifically, there is a size mismatch betweenpred_features_k
andgt_features_k
found in lines 58 and 59.pred_features_k
has dimensions (40, 72), whereasgt_features_k
measures (1363, 72). The size ofgt_features_k
seems close to the total number of entries in the AIST++ dataset, suggesting a discrepancy where the ground truth features should ideally match the prediction in size.Additionally, the prediction features are extracted from motions corresponding in length to raw audio, typically around 2000 fps. Could you clarify if the ground truth features provided in the Google Drive are extracted from motions of similar lengths, or if they derive from entries under
new_joint_vecs
? If it's the latter, this could explain the temporal dimension mismatch between prediction and ground truth.Upon reviewing the raw JSON file, I noticed that the
dance_array
length aligns withnew_joint_vecs
. Given this, could you provide access to the raw motion dataset that matches the length of the raw audio frommusic_array
? This would help in ensuring that the evaluations are performed under consistent conditions.Thank you for looking into this matter.