Prediction Scale - Githubissues

Walter0807 / MotionBERT

[ICCV 2023] PyTorch Implementation of "MotionBERT: A Unified Perspective on Learning Human Motion Representations"

Apache License 2.0

1.01k stars 123 forks source link

Hi, I have questions about the dimensions of the predicted poses both in inference and evaluation code. I noticed that the predictions of the network in the evaluation function in train.py are being multiplied by a factor and I traced it back to data['test']['2.5d_factor'] in h36m_sh_conf_cam_source_final.pkl. Could you please help me understand how these factors are being calculated? Does this mean that the outputs of the network are not expected to have the correct scale of a human (in meters) and only the relative pose is the goal? especially in inference, I noticed that when running the inference code if I plot the outputs I notice a change in the dimensions of the person (even when applying that I guess comes from this, even when using the MB_ft_h36m model with rootrel set to True.

In general, it would be really appreciated if you could help me understand the scale of the output and how I can convert it to meters.

Thanks in advance for your help.

The model input (2D) and output (3D) are both in normalized pixel coordinates (roughly [-1,1]). you can check the preprocessing code for in-the-wild inference (dataset_wild.py) for that. It is generally impossible for one to estimate the physical world coordinates from in-the-wild monocular input without camera parameters. To compute the errors on H36M, we follow LCN and use their precomputed scaling factor ('2.5d_factor') to convert the normalized pixel coordinates to the physical world coordinates. The factor is specific to the dataset with its camera parameters. You may also refer to Locally connected network for monocular 3D human pose estimation (LCN)(T-PAMI 2020) Section 6.2.1-6.2.2 for more details.

Walter0807 / MotionBERT

Prediction Scale #21