davrempe / humor

Code for ICCV 2021 paper "HuMoR: 3D Human Motion Model for Robust Pose Estimation"
MIT License
518 stars 72 forks source link

the meaning of prior frame and canonical frame #18

Closed tinatiansjz closed 2 years ago

tinatiansjz commented 2 years ago

Hi, Davis, You really did a wonderful job for both your paper and your codebase! Thanks for sharing the code, which is well-organized and well-annotated. I love it and I have learned a lot.

I have some trouble understanding the transformation of coordinate systems, eg. compute_world2aligned_mat and compute_cam2prior. Could you explain a little bit about the default setup of each coordinate system, and why we need to transform? Besides, what is the meaning of "prior frame" and "canonical frame"? Do they all refer to reconstructing motions in the world coordinate system? If not, what's the relationship or connection between the prior frame and the canonical frame?

Looking forward to your reply.

davrempe commented 2 years ago

Hello, I'm glad you're finding the codebase useful.

There are 3 coordinate systems that are relevant here:

So compute_world2aligned_mat computes the rotation that goes from the "world" frame to the "canonical" frame. This is used to transform the data inputs to the neural network before passing them in (and to transform outputs back to the "world" frame after making a prediction). This is used while training and testing the model with AMASS data, and during pose estimation fitting from RGB video.

On the other hand, compute_cam2prior is only used during fitting to video. This computes the transformation from the camera coordinate frame to the canonical one. This is needed because the poses we predict from a video are in the camera coordinate frame, but HuMoR is only trained in the "canonical" frame. So we have to first transform the current pose to this canonical frame before using HuMoR to roll out the motion. We can figure out the transformation from camera to canonical frame by using the floor plane (which defines the xy plane of the canonical frame at z=0) and the person's current 3D pose (which gives the xy position of the origin the body right vector).

Note: when I say "canonical" or "prior frame" in the code base, it's not always precise... e.g. in viz_fitting_rgb the "prior" frame is actually the coordinate system defined by the "canonical" frame of first timestep of the motion sequence only.

Hope this helps clear things up a bit, all the coordinate systems here can definitely be confusing!