Understanding the output

Hello,

I am a beginner in 3D pose estimation field and I have a simple question concerning the output of the algorithm. My question is quite global.

The neural net estimates 3D points from the 2D keypoints. If I understood correctly, and as I read on VideoPose3D issues, these 3D points are given in the "camera space". Then, the function camera_to_world applies an affine transform in order to get the points in the "world space". Points are then translated so that the lowest height of all the points of the sequence is zero.

The transform from the camera space to the world space is taken from the Human 3.6m dataset (a quaternion) with a forced null translation. I understand the null translation since we are mainly interested in the orientation of the points and not their absolute location.

1 - This means that if my camera setup (angles with respect to the ground for example) is different from the human 3.6m setup, the orientation of the resulting skeleton will be rotated with respect to the ground, right ? So a perfectly right standing person can be rotated and not right standing with respect to the (OXY) plane in the visualization ? Therefore, I suppose that I need to find my own transform "camera to world" according to my setup. Am I right ? When saying that, I suppose that the "world space" is such that (OXY) si the ground and the Z axis is the normal to the ground, but I didn't found this information anywhere yet.

2 - I dont't understand the meaning of the 3D coordinates in the camera space. When creating the dataset, points are in the world space and are mapped into each camera reference frame (after a calibration I suppose). Therefore, are the coordinates of the dataset 3D points expressed in meter units ? Or are they normalized ?

Thank you for your help.

fabro66 / GAST-Net-3DPoseEstimation

Understanding the output #18