Clarification on Units of Measurement for MotionBert

Hello MotionBert Team,

I have been exploring the MotionBert model for 3D pose estimation and am impressed with its capabilities. I am particularly interested in applying it to a project that involves estimating the step length of a person walking towards the camera in a video. To accurately interpret the model's output and apply it to real-world measurements, I have a few questions regarding the training details of MotionBert:

Units of Measurement: Could you clarify what units of measurement were used during the training of MotionBert? Specifically, are the 3D pose estimations provided in metric units (meters, centimeters), imperial units (inches, feet), or are they normalized in some manner? Understanding this will help me convert the pose estimations into real-world distances accurately.
Depth Interpretation: In the model's output, the 3D coordinates are provided for each keypoint. I assume the third value in each coordinate set represents the depth information. Could you confirm this interpretation and provide guidance on how to accurately translate these depth values into real-world distances, considering the camera's perspective and any scaling factors used by the model?
Practical Application Guidance: Lastly, any additional advice or guidelines on using MotionBert for estimating physical measurements, such as step length from video data, would be greatly appreciated. Tips on calibration or adjustments needed to account for perspective distortion would be particularly helpful.

Thank you for developing MotionBert and for your support in helping users apply it effectively. I look forward to your response and any insights you can provide on the above queries.

I also wonder the units of Measurement for MotionBert. In the table 1 of the paper, all results are said to be shown as 'mm'. However, in the evaluation period of the training code, the prediction 3d keypoints are only denormalized by the code below,

        n_clips = test_data.shape[0]
        test_hw = self.get_hw()
        data = test_data.reshape([n_clips, -1, 17, 3])
        assert len(data) == len(test_hw)
        # denormalize (x,y,z) coordiantes for results
        for idx, item in enumerate(data):
            res_w, res_h = test_hw[idx]
            data[idx, :, :, :2] = (data[idx, :, :, :2] + np.array([1, res_h / res_w])) * res_w / 2
            data[idx, :, :, 2:] = data[idx, :, :, 2:] * res_w / 2

where the test_wh can be (1000, 1000) or (1000, 1002). I think the code mentioned above can just be a pose-processing that transform the results in relative pixel coordinate into pixel coordinate. I wonder why the results are transformed into 'mm' after the calculation above.

Walter0807 / MotionBERT

Clarification on Units of Measurement for MotionBert #121