Xinyu-Yi / TransPose

A real-time motion capture system that estimates poses and global translations using only 6 inertial measurement units
https://xinyu-yi.github.io/TransPose/
GNU General Public License v3.0
373 stars 72 forks source link

Pose-S3 output dimension #25

Closed TowoC closed 2 years ago

TowoC commented 2 years ago

Hello Xinyu, I have a question about the dimension of Pose-S3 output. As mentioned in the paper, image The output of Pose-S3 is R6(j-1), and j is 24. So the output dimension is 138 But the in the code, net.py, The pose_s3 output dimension is joint_set.n_reduced * 6, which is 90. I have no idea about it. Can you help me to figure out why the dimension is it?

So, when I want to calculate the loss of Pose-S3. I need to use the func:_reduced_glb_6d_to_full_local_mat. to get the full joint rotation matrix. And use func:rotation_matrix_to_r6d, to get the full joint rotation in 6d, right?

Xinyu-Yi commented 2 years ago

We actually use the 90d output (for 15 out of 23 non-root joints, i.e., without the wrists/ankles/ankles/feet as they are not measured by imus). During evaluation, the ignored 8 joints' rotations are all set to identities, so we did not predict their rotations. In the paper, for conciseness, we said that we estimated all non-root joint rotations. Actually whether estimating these 8 joints' rotations does not affect the results much. (Also, the DIP-IMU training data does not contain these 8 joints' rotations)

Xinyu-Yi commented 2 years ago

For the loss, we directly use L2 on the root-relative 6D representation, i.e., just the network output, not converting to rotation matrices or global poses. It's a very simple implementation. You can try different losses.

TowoC commented 2 years ago

Thanks for reply Xinyu,

I have two question here.

1) As you said, the output of Pose-S3 is actually 15d except L R ankle, foot, wrist, hand. So when we calculate the loss in this layer, it just use the 15 point to calculate the loss with groundtruth. Due to the DIP-IMU dataset did not have these information for training and evaluation.

But I have a concern about this, the training dataset contain AMASS and partial DIP-IMU. And the AMASS have all joint information in the dataset. So you mean the Pose-S3 do not calculate the loss cause the DIP-IMU have no such information.

However, the AMASS have the information. And the network output is set to 90d. Does it mean that the network will not regress the other 8 joint rotation while training? And won't it lead the model have a poor performance reletivity?

And... I want to know that the how to get the ground truth in Pose-S3. Is it generate by using axis_angle_to_rotation_matrix to data['pose'] first and using rotation_matrix_to_r6d to get the 6d representation? Thanks a lot.

2) Another small question is about, you said "For the loss, we directly use L2 on the root-relative 6D representation, i.e., just the network output, not converting to rotation matrices or global poses."

As the code you released in net._reduced_glb_6d_to_full_local_mat, I think this function is used to transform the 15 joint rotation to 24 joint rotation with padding some fixed value. And I saw there is a function:global_to_local_pose there. According to the name of "_reduced_glb_6d_to_full_local_mat" and "global_to_local_pose", doesn't it means the output of Pose-S3 is global pose?

Thanks for replying me so many questions.

Xinyu-Yi commented 2 years ago
  1. These 8 joints' rotations are set to identity. As we put our imu on forearms and lower legs, by nature the wrist/hand/ankle/foot joint movements are not measured. Your method to calculate the ground truth is right.
  2. In the name "_reduced_glb_6d_to_full_local_mat", "global" means "root-relative", "local" means "parernt-relative"
TowoC commented 2 years ago

Hello Xinyu, Thanks for your explanation. I understand what things are going.