Because the tokens are all padded, if you use len(tokens[bs_i]) to obtain the cap_len, then all sentence lengths will be max_text_len=20 + 2. This will influence the language feature extraction for computing metrics.
And the following code is the original code in HumanML3D, which uses the right token length.
Hi,
I have noticed that there may be a bug in your modified evaluation code as follows.
https://github.com/GuyTevet/motion-diffusion-model/blob/af061ca7c7077fb144c0094a5a72932b967647b6/data_loaders/humanml/motion_loaders/comp_v6_model_dataset.py#L214
Because the tokens are all padded, if you use
len(tokens[bs_i])
to obtain thecap_len
, then all sentence lengths will bemax_text_len=20
+ 2. This will influence the language feature extraction for computing metrics.And the following code is the original code in HumanML3D, which uses the right token length.
https://github.com/GuyTevet/motion-diffusion-model/blob/af061ca7c7077fb144c0094a5a72932b967647b6/data_loaders/humanml/motion_loaders/comp_v6_model_dataset.py#L100
I think this bug may lead to a performance drop in MatchingScore, R-Precision, and so on.