davrempe / humor

Code for ICCV 2021 paper "HuMoR: 3D Human Motion Model for Robust Pose Estimation"
MIT License
510 stars 69 forks source link

i3DB results #23

Open g-fiche opened 2 years ago

g-fiche commented 2 years ago

Hello, thank you for this great work! I have a question about the reproducibility of the results on the i3DB dataset: when I use your code, the final global joint error is 33.5cm and I have 34.3 for Vposer-t (which are respectively equal to 28.15 and 31.59 in the paper). Am I missing something or were your testing settings any different from those in the code? By the way, is this expected to obtain only a 1cm gap between VPoser-t and HuMoR since the CVAE prior seems to be crucial for good predictions? Thank you!

davrempe commented 2 years ago

The setting for the quantitative results in the paper was slightly different than the configuration provided in the repo. In particular, we used 3 second sequences (--imapper-seq-len 90 in the config) and --batch-size 6 (though this shouldn't affect accuracy). However, I have just re-run the evaluation on i3DB with this updated configuration and am still seeing worse results than expected. I will have to investigate this further (there may be some slight discrepancy between my original codebase and this cleaned-up release version for i3DB).

wrt the gap between VPoser-t and HuMoR: the global body joint error is not a great indicator of the key differences between these two methods since it includes all body joints over all frames. HuMoR is most helpful when there are heavy occlusions or noise, but the global joint error metric is dominated by joints that are visible and not too noisy even in i3DB. The difference is more obvious when joint errors are measured for body parts that are often occluded like legs. You can also see the qualitative difference in the supplemental comparisons on the webpage.

g-fiche commented 2 years ago

Hello, thanks a lot for your answer ! I got much better results with the paper configuration.

By the way, I have a question about rollout function. I noticed that the first time you use the rollout function (at the beginning of stage 3), the difference between stage2_result and stage3_init_result increases linearly frame by frame (from 0cm to 4m between frame 0 and 90). We can also see that all metrics degrade between the end of stage2 and stage3_init. The stage3 optimization seems to correct this "reconstruction error" quite quickly, but have you tested the optimization with longer sequences (e.g. 10 or 20sec) to see if the optimization could still fix this difference?

Thank you!

davrempe commented 2 years ago

To start stage 3, we have to represent the output of stage 2 (sequence of SMPL poses) within the VAE (i.e. as an initial pose and sequence of latent vectors). To do this we use the VAE encoder for all pairs of frames to get a latent z for each pair. Then when we rollout the sequence using this latent sequence (i.e. the stage3_init_result) there are naturally some errors in the reconstruction that tend to propagate as the sequence gets longer.

For long sequences like 10-20 sec the optimization will be quite difficult: the initialization will be worse as you suggest, but also it's a much larger problem that will take longer and have more local minima (since we must optimize another latent z for every added timestep). This is why we have the option to split up long videos into short sequences of 2-3 sec.

g-fiche commented 2 years ago

Ok I see ! Thank you very much !