geopavlakos / hamer

HaMeR: Reconstructing Hands in 3D with Transformers
https://geopavlakos.github.io/hamer/
MIT License
400 stars 39 forks source link

Hand orientation estimation #23

Closed retoc71586 closed 9 months ago

retoc71586 commented 9 months ago

Good evening, a quick question. I am using Hamer to estimate the hand orientation (rotation) from a monocular video. If I understand the code correctly pred_mano_params.global_orient represents the rotation matrix to go from the hand reference frame to the camera reference frame. If the image coordinate system is the usual: what is the hand coordinate system?

geopavlakos commented 9 months ago

Do you need to define the hand coordinate frame? It would be simpler to provide the global_orient prediction to the mano model and let it produce the vertices in the camera frame, instead of using a separate coordinate frame for the hand.

retoc71586 commented 9 months ago

I am using the repo to also track the wrist position and orientation in space. Therefore I need the relative rotation btw camera and wrist frame. I was previously using FrankMocap for this purpose and I could do that there by taking the wrist component of ['pred_hand_pose'] which is the angle axis representation of this rotation. My idea to do it with Hamer was to just use out['pred_mano_params']['global_orient'], convert it to angle axis and use it in place of the FrankMocap prediction but I get some unexpected rotation values. Therefore I wanted to know what is the reference frame for this rotation.

Do you have some code example for doing what you suggested and render the result on the input image?

Thanks for the help

retoc71586 commented 9 months ago

If I just use the visualisation script I was using for FrankMocap that did what you suggested but I use the rotation from global_orient instead the result looks wrong:

Screenshot 2024-02-02 at 10 08 57

(blue is FrankMocap and red is Hamer)

That is why I am trying to understand better what the rotation in global_orient means. I am guessing that the reason why the two dont coincide is because you are using a different mano model than the one from FrankMocap but I'm not sure

geopavlakos commented 9 months ago

Hamer should be a drop-in replacement for the hand estimation module of FrankMocap, since they regress the same output. Can you share the rendering script you use (from FrankMocap regressed output to rendering)? Is it likely you are applying further transformations to the FrankMocap output to render according to the renderer conventions?

The demo code we have shows examples of how to render the predicted output both on the bounding box level and at the full image.

retoc71586 commented 9 months ago

This is the rendering code I use. I hope it makes sense even if it is out of context: https://github.com/retoc71586/bullshitting/blob/main/renderer.py

I find the main difference to be your use of the parameter focal_length. It is not fully clear to me how you use this parameter in the renderer and why you do not use theglobal_orient parameter in the rendeding.

geopavlakos commented 9 months ago

Hmm, if I understand correctly, this only shows the initialization of the renderer, not the actual rendering function (for example, I don't see how you handle the wrist rotation as you mention above). The focal_length value would affect the estimated translation for the rendering, and based on the visualization you give, the translation seems to be roughly accurate, so that probably is handled as expected. I'm more concerned with the global_orient, and how this is handled by the MANOGroupLayer, which I don't have the definition for.

But instead of digging deeper into the code, I'm thinking whether it could help more to establish the correspondence between HaMeR and FrankMocap. So, FrankMocap returns a field pred_hand_pose with 48 parameters. The first three correspond to the out['pred_mano_params']['global_orient'] for HaMeR and the other 45 to the out['pred_mano_params']['hand_pose'] for HaMeR. The only difference is that HaMeR outputs the 3D rotations in rotation matrix representation, while the FrankMocap output is in axis angle representation, but based on your previous messages, you are aware of that. Beyond that, there is direct correspondence between pred_hand_betas for FrancMocap and out['pred_mano_params']['betas'] for HaMeR, and (I believe the same is true) for pred_camera of FrankMocap and out['pred_cam'] of HaMeR. If you follow this correspondence, you should be getting an 1-to-1 mapping between the two.

With that being said, something that I don't follow in your above messages, and could be source of confusion, is that:

the wrist component of ['pred_joints_smpl'] which is the angle axis representation of this rotation

This field for FrankMocap gives only the joint location. The wrist rotation in axis angle representation for FrankMocap is actually the first three values in the pred_hand_pose field and, as I write above, this corresponds to out['pred_mano_params']['global_orient'] for HaMeR.

I hope this is helpful. If there is still confusion after the 1-on-1 mapping, then you might need to share more of the script of how you handle the FrankMocap output, to see what you might be doing differently.

retoc71586 commented 9 months ago

Good morning, thanks for the support on the matter. You are right, the variable in FrankMocap containing the wrist orientation is pred_hand_pose, I have corrected my previous message not to mislead anyone who might be reading this thread in the future.

What you said about the 1to1 correspondence (I get the angle axis from the rot matrix viaaa, _ = cv.Rodriguez(rot_mat)) makes perfect sense to me so I agree there must be something strange with the rendering. I have updated my code at https://github.com/retoc71586/bullshitting/blob/main/renderer.py with the whole renderer class which I call just by renderer.run()

The piece of code where I get the vertices from MANO is self._mano_group_layer = MANOGroupLayer([self._mano_side],[self._mano_betas.astype(np.float32)], mano_root=base_path+"/utils/mano/manopth/mano_v1_2/models").to(self._device) and consequently mano_pose = self._mano_poses[:,f,1:] pose = torch.from_numpy(mano_pose.astype(np.float32)).to(self._device) pose = pose.view(-1, self._mano_group_layer.num_obj * 51) vert, joint = self._mano_group_layer(pose) This vertices and joints are then saved in the renderer internal variables.

In pose I savepred_hand_pose coming from FrankMocap in the order angle_axis_wrist_rotation, joint_3D_position (this reference frame), wrist_3D_position_in_camera_frame.

I think the reason why my FrankMocap code is not compatible with yours is that we use a different MANO layer with different input shape. I use this code, whereas I see you are using smplx.MANOLayer coming from MaxPlanck institute code.

geopavlakos commented 9 months ago

I tried the manopth package and for a random input, it gives me the exact same vertex output as the MPI MANOLayer code. The only difference is that manopth outputs vertices in mm, while MANOLayer in meters. Is it possible that you apply some extra transformation in the modified manopth code (I see this is not the original), other than simply adding the wrist_3D_position in the vertices? Or any potential error with the transformation from rotation matrices to axis angle? (Although I think that's unlikely, given that the joint rotations look ok). Also, a small comment:

In pose I save pred_hand_pose coming from FrankMocap in the order angle_axis_wrist_rotation, joint_3D_position (this reference frame), wrist_3D_position_in_camera_frame.

I assume you mean joint_3D_rotations instead of joint_3D_positions. So it should be simply the pred_hand_pose of FrankMocap + the wrist_3D_position.

Beyond that, I believe that the HaMeR output should be able to directly substitute FrankMocap and is compatible with manopth.

Alternatively, if you are not tied to manopth, you could use the MPI MANOLayer and our rendering code. Here is the MANOLayer initialization. Here is the forward pass to get the vertices given the HaMeR output. And here we have the function that handles the rendering.