the meaning of prior frame and canonical frame

Hello, I'm glad you're finding the codebase useful.

There are 3 coordinate systems that are relevant here:

The coordinate system of the processed AMASS mocap data. The origin of this system is on the floor with +z pointing "up" orthogonal to the floor (i.e. the floor is the xy plane). This coordinate system is fixed (i.e. it doesn't change throughout a motion sequence). I sometimes refer to this as the "world" frame in the code.
The "canonical frame" and "prior frame" are used interchangeably and usually mean the same thing. In particular, this is the coordinate system that the motion model (HuMoR) directly operates in. The origin of this coordinate system changes over time throughout a motion sequence (wrt the "world" frame). This coordinate system is described in the paper and supplementary material: in short it's set up so that +z is up (same as the world frame), +x axis is parallel to the floor plane but pointing in the same direction as the body right vector (defined by the SMPL root orientation), and so +y will point forward. The xy position of the origin is at the body root position. The reason we need this coordinate system is so that the network can easily generalize no matter where the human is in the "world" frame -- the network can always assume the inputs are transformed in this canonical coordinate frame. This makes the network input invariant to the arbitrary position or facing direction of the human in the "world" frame.
The camera coordinate frame. This is only relevant when we are fitting to RGB or RGB-D video. Its origin is defined by the camera pose and fixed over time (assuming the camera is static).

So compute_world2aligned_mat computes the rotation that goes from the "world" frame to the "canonical" frame. This is used to transform the data inputs to the neural network before passing them in (and to transform outputs back to the "world" frame after making a prediction). This is used while training and testing the model with AMASS data, and during pose estimation fitting from RGB video.

On the other hand, compute_cam2prior is only used during fitting to video. This computes the transformation from the camera coordinate frame to the canonical one. This is needed because the poses we predict from a video are in the camera coordinate frame, but HuMoR is only trained in the "canonical" frame. So we have to first transform the current pose to this canonical frame before using HuMoR to roll out the motion. We can figure out the transformation from camera to canonical frame by using the floor plane (which defines the xy plane of the canonical frame at z=0) and the person's current 3D pose (which gives the xy position of the origin the body right vector).

Note: when I say "canonical" or "prior frame" in the code base, it's not always precise... e.g. in viz_fitting_rgb the "prior" frame is actually the coordinate system defined by the "canonical" frame of first timestep of the motion sequence only.

Hope this helps clear things up a bit, all the coordinate systems here can definitely be confusing!

davrempe / humor

the meaning of prior frame and canonical frame #18