fabro66 / GAST-Net-3DPoseEstimation

A Graph Attention Spatio-temporal Convolutional Networks for 3D Human Pose Estimation in Video (GAST-Net)
MIT License
313 stars 70 forks source link

Understanding the output #18

Closed dbrazey closed 3 years ago

dbrazey commented 3 years ago

Hello,

I am a beginner in 3D pose estimation field and I have a simple question concerning the output of the algorithm. My question is quite global.

The neural net estimates 3D points from the 2D keypoints. If I understood correctly, and as I read on VideoPose3D issues, these 3D points are given in the "camera space". Then, the function camera_to_world applies an affine transform in order to get the points in the "world space". Points are then translated so that the lowest height of all the points of the sequence is zero.

The transform from the camera space to the world space is taken from the Human 3.6m dataset (a quaternion) with a forced null translation. I understand the null translation since we are mainly interested in the orientation of the points and not their absolute location.

1 - This means that if my camera setup (angles with respect to the ground for example) is different from the human 3.6m setup, the orientation of the resulting skeleton will be rotated with respect to the ground, right ? So a perfectly right standing person can be rotated and not right standing with respect to the (OXY) plane in the visualization ? Therefore, I suppose that I need to find my own transform "camera to world" according to my setup. Am I right ? When saying that, I suppose that the "world space" is such that (OXY) si the ground and the Z axis is the normal to the ground, but I didn't found this information anywhere yet.

2 - I dont't understand the meaning of the 3D coordinates in the camera space. When creating the dataset, points are in the world space and are mapped into each camera reference frame (after a calibration I suppose). Therefore, are the coordinates of the dataset 3D points expressed in meter units ? Or are they normalized ?

Thank you for your help.

fabro66 commented 3 years ago

Hello~
I can understand your confusion about the joint coordinates. I recommend that you first understand the relationship between world coordinate, camera coordinate, image plane coordinate, and pixel coordinate. world-coordinate-to-pixel-coordinate introduction. When you understand the relationship between them, it is easy to understand in combination with my responses below.

  1. The reconstructed skeletons from 2D pose is a sequence of 3D coordinates based on the pelvis joint. These 3D coordinates are given in the "camera space".
  2. In our works, the generated 3D joints are expressed in meter units.