geopavlakos / hamer

HaMeR: Reconstructing Hands in 3D with Transformers
https://geopavlakos.github.io/hamer/
MIT License
326 stars 28 forks source link

Getting accurate 3D wrist position #49

Closed arnavbalaji closed 2 months ago

arnavbalaji commented 2 months ago

Hello,

I'm trying to extract the 3D wrist positions for different frames of a video. I'm currently using out['pred_keypoints_3d'][0][0] to get them, but they seem to be inaccurate compared to how the hand is moving (I have a simple video where my hand is moving straight in one direction, but the positions are not changing across frames).

I read #30, and looked at the pred_cam, pred_cam_t, and pred_cam_t_full values, and I'm not really sure what to do with them. I can't element-wise add them because the position is a 1x3 tensor while pred_cam_t is a 2x3 tensor.

I was also hoping to get some understanding of the significance of all three of these values, as the paper only mentions one 3-element translation vector for the camera parameters. Hoping to get some guidance on this. Thank you!

geopavlakos commented 2 months ago

Did you check this comment: https://github.com/geopavlakos/hamer/issues/30#issuecomment-1961760722? That should be enough to get an approximation of the hand location in the camera frame.

arnavbalaji commented 2 months ago

I did, that's what I was confused about. Do I just add the predicted keypoint to pred_cam_t_full?

For example, if I wanted the pose of the left wrist in the camera frame, would I do out["pred_keypoints_3d"][0][0] + pred_cam_t_full[0]?

Also was confused a little bit about the difference between pred_cam, pred_cam_t, and pred_cam_t_full. What is the significance of each of them?

arnavbalaji commented 2 months ago

I'm confirming because I still seem to be getting incorrect results. The demo.py script, however, seems to be providing a pretty good mesh of my hand so I was just wondering how to get the correct poses.

geopavlakos commented 2 months ago

I confirm that the procedure described in the other issue is correct. In what sense the results seem incorrect?

I believe you will not need the other values, but pred_cam is the raw output of the network and pred_cam_t is the translation of the hand in the bounding box crop (assuming that there is a virtual camera that captures only the cropped bounding box of the hand).

arnavbalaji commented 2 months ago

https://github.com/geopavlakos/hamer/assets/90121393/3192c575-f995-40a8-a381-3793ad5c9e05

Here is my video. Based on this, I was expecting my wrist position results to be slightly increasing/decreasing in one axis (or something to that effect), and then going the opposite direction in the same axis. However, the positions seem to be arbitrary. Here are the wrist position results for the first 10 frames (I am extracting around 5 frames per second for the image folder).

[ 0.281043    0.14964627 35.312008  ]
[-0.0940349  0.0878491 32.937767 ]
[-3.4428090e-03  1.0743936e-01  3.5119164e+01]
[-0.08250158  0.08986542 33.565678  ]
[ 0.27820206  0.16157107 33.76223   ]
[ 0.1897386   0.14185128 35.016422  ]
[1.9602448e-02 1.2397959e-01 3.4882965e+01]
[-0.05886994  0.09774382 33.21006   ]
[ 0.29561424  0.15586299 34.29335   ]
[ 0.05398774  0.1252439  36.970455  ]

Also, the value in the z-axis seems to be drastically changing across frames, going from ~33 to ~0.3 back up to ~34 in frames 2, 3, and 4.

Here is the code I have to calculate wrist position.

for batch in dataloader:
            batch = recursive_to(batch, device)

            with torch.no_grad():
                out = model(batch)

            multiplier = (2*batch['right']-1)
            pred_cam = out['pred_cam']
            pred_cam[:,1] = multiplier*pred_cam[:,1]
            box_center = batch["box_center"].float()
            box_size = batch["box_size"].float()
            img_size = batch["img_size"].float()
            multiplier = (2*batch['right']-1)
            scaled_focal_length = model_cfg.EXTRA.FOCAL_LENGTH / model_cfg.MODEL.IMAGE_SIZE * img_size.max()
            pred_cam_t_full = cam_crop_to_full(pred_cam, box_center, box_size, img_size, scaled_focal_length).detach().cpu().numpy()

            global_orient = out["pred_mano_params"]["global_orient"][0][0].cpu().detach().numpy()
            wrist_position = out["pred_keypoints_3d"][0][0].cpu().detach().numpy()

            wrist_position = np.add(wrist_position, pred_cam_t_full[0])
geopavlakos commented 2 months ago

The demo code doesn't process the images in alphabetical order, it just pulls all the images from the input folder. If you want to sort them by image name, you need to add an extra sorting operation.

Also, the 3.5119164e+01 value corresponds to 35.119164, so the depth values don't change so significantly.

arnavbalaji commented 2 months ago

Oh I read the "+" as a "-", makes more sense now.

That makes a lot more sense why it seemed arbitrary, I added a sort and the results make a lot more sense now! Thanks!