geopavlakos / hamer

HaMeR: Reconstructing Hands in 3D with Transformers
https://geopavlakos.github.io/hamer/
MIT License
348 stars 30 forks source link

What does global orient mean? #66

Closed hahamini closed 1 week ago

hahamini commented 3 weeks ago

I'm asking this question because I want to calculate the vector of the palm from the wrist(root).

I guess it's the orientation of the root keypoint. Is that right? If so, what is the reference of that coordinate system? (For example, palm orientation is z-axis)

Thank you.

geopavlakos commented 3 weeks ago

The parameter global_orient corresponds to the global orientation of the root of the MANO model. About the reference, you will need to check the representation of MANO internally.

hahamini commented 3 weeks ago

@geopavlakos Thanks for your answer.

I have a few more questions.

  1. I'm using hand_pose and pred_keypoints_3d to project a pose to a 2D image. Here, the right hand is projected well, but the left hand is projected to the wrong location. I don't know why.
  2. What is the index correspondence between hand_pose, a rotation matrix with 15 joints, and keypoints_3d, a matrix with 21 keypoints?
  3. Is the pose of each joint relative to its parent node? For example, if root, 1(thumb pip), 2(thumb dip), 3(thumb tip), I wonder if R_root_to_tip = R1@R2@R3 or R3.

Here is my demo code.

           for n in range(batch_size):
                global_orient = (
                    out["pred_mano_params"]["global_orient"].detach().cpu().numpy()[n]
                )
                hand_pose = (
                    out["pred_mano_params"]["hand_pose"].detach().cpu().numpy()[n]
                )
                keypoints_3d = out["pred_keypoints_3d"][n].detach().cpu().numpy() + (
                    pred_cam_t_full[n]
                )
                # Hand root
                cv2.drawFrameAxes(
                    canvas,
                    intrinsic.camera_matrix,
                    np.zeros(5),
                    cv2.Rodrigues(np.squeeze(global_orient))[0],
                    keypoints_3d[0],
                    0.01,
                )
                # Thumb pip
                cv2.drawFrameAxes(
                    canvas,
                    intrinsic.camera_matrix,
                    np.zeros(5),
                    cv2.Rodrigues(hand_pose[0])[0],
                    keypoints_3d[2],
                    0.01,
                )
                # Thumb tip
                cv2.drawFrameAxes(
                    canvas,
                    intrinsic.camera_matrix,
                    np.zeros(5),
                    cv2.Rodrigues(hand_pose[2])[0],
                    keypoints_3d[4],
                    0.01,
                )
geopavlakos commented 3 weeks ago

1) The hand pose parameters follow the regular MANO order, so I would point you to that. Please check Lines 206-220 here. The 3D keypoints follow the OpenPose order.

2) I believe you will need to flip the 3D keypoints (and pred_cam_t_full) of the left hand across the x axis (multiply the first dimension with -1). Please check this issue for an explanation.

3) Yes, you are correct, the rotations are expressed relative to the parent node. For this, we follow the MANO convention, so for more details you can check how this is implemented here.

hahamini commented 2 weeks ago

@geopavlakos This was really helpful, thank you.

hahamini commented 2 weeks ago

@geopavlakos Please explain more about 2. I have pred_keypoints_3d and hand_pose of shape (15, 3) for my left hand, and pred_cam_t_full of shape (,3). Actually, I don't know whether the pred_keypoints_3d is based on the camera coordinate system or the MANO model. However, since you said that it is flipped based on the x-axis, I did the x-axis flip transformation, but the result is wrong.

I would appreciate it if you could explain specifically whether I should try to add pred_cam_t_full to pred_keypoints_3d and then try the transformation, or if I should do it before that. Or, if you could provide a pseudocode that calculates the left hand from the 6dof pose of the right hand in the camera coordinate system, I would appreciate it.

for n in range(batch_size):
    global_orient = (
        out["pred_mano_params"]["global_orient"].detach().cpu().numpy()[n]
    )
    hand_pose = (
        out["pred_mano_params"]["hand_pose"].detach().cpu().numpy()[n]
    )
    keypoints_3d = out["pred_keypoints_3d"].detach().cpu().numpy()[n]
    if right[n] == 0:
       flip_matrix = np.array([[1, 0, 0], [0, -1, 0], [0, 0, -1]])
       keypoints_3d = np.matmul(
           flip_matrix, keypoints_3d.reshape(-1, 3, 1)
       ).reshape(-1, 3)
    keypoints_3d += pred_cam_t_full[n]
geopavlakos commented 2 weeks ago

You should multiply the first dimension with -1, not the second and the third. Similarly, you need to multiply the first dimension of pred_cam_t_full with -1 before you add them together.

hahamini commented 2 weeks ago

@geopavlakos As you said, I multiplied the first dimension (pred_keypoints_3d[:,0]) of pred_keypoints_3d(21,3) by -1, and also multiplied the first dimension (pred_cam_t_full[0]) of pred_cam_t_full(3,) by -1. However, when I verified it with re-projection, the result was wrong. (Of course, the right hand result is perfect.)

                global_orient = (
                    out["pred_mano_params"]["global_orient"].detach().cpu().numpy()[n]
                )
                hand_pose = (
                    out["pred_mano_params"]["hand_pose"].detach().cpu().numpy()[n]
                )
                keypoints_3d = out["pred_keypoints_3d"].detach().cpu().numpy()[n]
                # keypoints_3d -> (21,3)
                # pred_cam_t_full[n] -> (3,)
                if right[n] == 0:
                    keypoints_3d[:, 0] *= -1
                    pred_cam_t_full[n][0] *= -1
                keypoints_3d += pred_cam_t_full[n]
geopavlakos commented 1 week ago

My bad, the multiplication for the camera translation has already been taken care of in line 138 of the demo. You only need to multiply the first dimension of the keypoints with -1 and that should be it.

hahamini commented 3 days ago

@geopavlakos Does that mean I don't have to change anything in my code to handle the right hand?

I just need to add pred_cam_t_full to get the 3D keypoints for the right hand? But still the result was wrong.

        for batch in dataloader:
            batch = recursive_to(batch, device)
            with torch.no_grad():
                out = model(batch)

            multiplier = 2 * batch["right"] - 1
            pred_cam = out["pred_cam"]

            pred_cam[:, 1] = multiplier * pred_cam[:, 1]
            box_center = batch["box_center"].float()
            box_size = batch["box_size"].float()
            img_size = batch["img_size"].float()
            pred_cam_t_full = (
                cam_crop_to_full(
                    pred_cam,
                    box_center,
                    box_size,
                    img_size,
                    my_intrinsic
                )
                .detach()
                .cpu()
                .numpy()
            )

            batch_size = batch["img"].shape[0]
            for n in range(batch_size):
                global_orient = (
                    out["pred_mano_params"]["global_orient"].detach().cpu().numpy()[n]
                )
                hand_pose = (
                    out["pred_mano_params"]["hand_pose"].detach().cpu().numpy()[n]
                )
                keypoints_3d = out["pred_keypoints_3d"].detach().cpu().numpy()[n]
                # keypoints_3d -> (21,3)
                # pred_cam_t_full[n] -> (3,)
                #if right[n] == 0:
                #    keypoints_3d[:, 0] *= -1
                #    pred_cam_t_full[n][0] *= -1
                keypoints_3d += pred_cam_t_full[n]

++ demo.py(138) And you said to multiply the x-axis by -1 because the left and right hands are mirrors of the x-axis. However, when I checked line 138, I found that it multiplies the 1st index (y-axis) of pred_cam. I wonder if this is correct.

geopavlakos commented 2 days ago

I mentioned above that you "only need to multiply the first dimension of the keypoints with -1". By commenting out the whole if-statement in your code, you don't multiply the keypoint of the left hand with -1. The correct code would be:

                if right[n] == 0:
                    keypoints_3d[:, 0] *= -1
                keypoints_3d += pred_cam_t_full[n]