Closed dbrazey closed 4 years ago
Hello~
I can understand your confusion about the joint coordinates. I recommend that you first understand the relationship between world coordinate, camera coordinate, image plane coordinate, and pixel coordinate. world-coordinate-to-pixel-coordinate introduction. When you understand the relationship between them, it is easy to understand in combination with my responses below.
Hello,
I am a beginner in 3D pose estimation field and I have a simple question concerning the output of the algorithm. My question is quite global.
The neural net estimates 3D points from the 2D keypoints. If I understood correctly, and as I read on VideoPose3D issues, these 3D points are given in the "camera space". Then, the function camera_to_world applies an affine transform in order to get the points in the "world space". Points are then translated so that the lowest height of all the points of the sequence is zero.
The transform from the camera space to the world space is taken from the Human 3.6m dataset (a quaternion) with a forced null translation. I understand the null translation since we are mainly interested in the orientation of the points and not their absolute location.
1 - This means that if my camera setup (angles with respect to the ground for example) is different from the human 3.6m setup, the orientation of the resulting skeleton will be rotated with respect to the ground, right ? So a perfectly right standing person can be rotated and not right standing with respect to the (OXY) plane in the visualization ? Therefore, I suppose that I need to find my own transform "camera to world" according to my setup. Am I right ? When saying that, I suppose that the "world space" is such that (OXY) si the ground and the Z axis is the normal to the ground, but I didn't found this information anywhere yet.
2 - I dont't understand the meaning of the 3D coordinates in the camera space. When creating the dataset, points are in the world space and are mapped into each camera reference frame (after a calibration I suppose). Therefore, are the coordinates of the dataset 3D points expressed in meter units ? Or are they normalized ?
Thank you for your help.