facebookresearch / InterHand2.6M

Official PyTorch implementation of "InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image", ECCV 2020
Other
676 stars 92 forks source link

When can we expect the code for using model on our own images? #15

Closed SuhelNaryal closed 3 years ago

SuhelNaryal commented 3 years ago

Hi, Congratulation for such great work. I am building an application where I need a robust hand pose estimation model like yours. I tried to figure out the code to use own images myself but couldn't achieve it. Parameters like focal, principal points, abs depths are confusing me. So, Can you give me some directions on this and share any potential dates for the release of code to use your own images. Thank you.

mks0601 commented 3 years ago

Maybe you want some simple demo codes? Sorry but I'm busy nowadays :( I'll make some demo codes after a conference deadline (mid of November). Thanks!

SuhelNaryal commented 3 years ago

Hi, I figured some things from the paper and ran the code on my image. I just wanted to ask how much difference does abs root depth make in results. I got the following results. Thanks for your help. You have done great work. index

mks0601 commented 3 years ago

not much. abs root means absolute depth value from camera to the wrist. it may affect 3D coordinates, not on image coordinates like your result.

SuhelNaryal commented 3 years ago

Okay, Thank you so much.

saqib22 commented 3 years ago

@SuhelNaryal Did you use any extra parameters other than RGB image while inferencing on custom image ?

SuhelNaryal commented 3 years ago

@saqib22 Hi. No, I did not use the extra parameters. I modified the call function in model.py to use just the image.

saqib22 commented 3 years ago

@SuhelNaryal Thank You! I will try to modify the code for myself too because I want the inference results from single camera.

One more thing are these coordinates in 3D or 2D, the one you got on your custom image ?

SuhelNaryal commented 3 years ago

Feel free to ask in case you need any help. Happy to help.

saqib22 commented 3 years ago

One more thing are these coordinates in 3D or 2D, the one you got on your custom image ?

Thanks

SuhelNaryal commented 3 years ago

These are 3D coordinates.

kingsman0000 commented 3 years ago

@saqib22 Hi. No, I did not use the extra parameters. I modified the call function in model.py to use just the image.

Hi I want to use my own image dataset like you, can you please tell me the location and call function did you change in model.py. Because I want to input and output only one image

kingsman0000 commented 3 years ago

Do I need annotation (.json) as input for my own dataset for visualization?

saqib22 commented 3 years ago

These are 3D coordinates.

@SuhelNaryal Thanks ! So are these 3D coordinates in image space or the actual 3D space ?

SuhelNaryal commented 3 years ago

@saqib22 Hi. No, I did not use the extra parameters. I modified the call function in model.py to use just the image.

Hi I want to use my own image dataset like you, can you please tell me the location and call function did you change in model.py. Because I want to input and output only one image

Hi, In main -> model.py -> class Model -> function forward you can make following chages.

def forward(self, inputs, targets=None, meta_info=None, mode=None):
        input_img = inputs['img']
        if mode in ['training', 'test']:
                  target_joint_coord, target_rel_root_depth, target_hand_type = targets['joint_coord'], targets['rel_root_depth'], targets['hand_type']
                  joint_valid, root_valid, hand_type_valid, inv_trans = meta_info['joint_valid'], meta_info['root_valid'], meta_info['hand_type_valid'], meta_info['inv_trans']

        batch_size = input_img.shape[0]
        img_feat = self.backbone_net(input_img)
        joint_heatmap_out, rel_root_depth_out, hand_type = self.pose_net(img_feat)

        if mode == 'train':
            target_joint_heatmap = self.render_gaussian_heatmap(target_joint_coord)

            loss = {}
            loss['joint_heatmap'] = self.joint_heatmap_loss(joint_heatmap_out, target_joint_heatmap, joint_valid)
            loss['rel_root_depth'] = self.rel_root_depth_loss(rel_root_depth_out, target_rel_root_depth, root_valid)
            loss['hand_type'] = self.hand_type_loss(hand_type, target_hand_type, hand_type_valid)
            return loss
        elif mode == 'test':
            out = {}
            val_z, idx_z = torch.max(joint_heatmap_out,2)
            val_zy, idx_zy = torch.max(val_z,2)
            val_zyx, joint_x = torch.max(val_zy,2)
            joint_x = joint_x[:,:,None]
            joint_y = torch.gather(idx_zy, 2, joint_x)
            joint_z = torch.gather(idx_z, 2, joint_y[:,:,:,None].repeat(1,1,1,cfg.output_hm_shape[1]))[:,:,0,:]
            joint_z = torch.gather(joint_z, 2, joint_x)
            joint_coord_out = torch.cat((joint_x, joint_y, joint_z),2).float()
            out['joint_coord'] = joint_coord_out
            out['rel_root_depth'] = rel_root_depth_out
            out['hand_type'] = hand_type
            out['inv_trans'] = inv_trans
            out['target_joint'] = target_joint_coord
            out['joint_valid'] = joint_valid
            out['hand_type_valid'] = hand_type_valid
            return out
      else:
           out = {}
            val_z, idx_z = torch.max(joint_heatmap_out,2)
            val_zy, idx_zy = torch.max(val_z,2)
            val_zyx, joint_x = torch.max(val_zy,2)
            joint_x = joint_x[:,:,None]
            joint_y = torch.gather(idx_zy, 2, joint_x)
            joint_z = torch.gather(idx_z, 2, joint_y[:,:,:,None].repeat(1,1,1,cfg.output_hm_shape[1]))[:,:,0,:]
            joint_z = torch.gather(joint_z, 2, joint_x)
            joint_coord_out = torch.cat((joint_x, joint_y, joint_z),2).float()
            out['joint_coord'] = joint_coord_out
            out['rel_root_depth'] = rel_root_depth_out
            out['hand_type'] = hand_type
            return out

The idea is to just get the outputs from model. You will get 3d joint coordinates in a 64x64x64 space. you will then have to map these coordinates into your image space.

SuhelNaryal commented 3 years ago

Do I need annotation (.json) as input for my own dataset for visualization?

No, You don't need json. Just get the model, load the pretrained weights and pass the image to the model and you will get the coordinates.

SuhelNaryal commented 3 years ago

These are 3D coordinates.

@SuhelNaryal Thanks ! So are these 3D coordinates in image space or the actual 3D space ?

Yes, these coordinates have been mapped to images space. You can map the 64x64x64 space output to any space you desire.

saqib22 commented 3 years ago

@SuhelNaryal Thank You So much for your help, I am going to try this out now !

kingsman0000 commented 3 years ago

Do I need annotation (.json) as input for my own dataset for visualization?

No, You don't need json. Just get the model, load the pretrained weights and pass the image to the model and you will get the coordinates.

@SuhelNaryal Thanks for your reply.What is the meaning about pass the image to the model,if I want to change my own img_path,where do I need to change?In dataset.py,if I only change img_path,it will be a lot of problems because of the code below will connect annot_path,annot_subset and so on.But my folder only has my own image instead of json file.

saqib22 commented 3 years ago

@SuhelNaryal I am done with changing the code in model.py as you suggested and wrote a custom test data loader for my custom images and now I am able to run this on my own images but now can you comment on how did you visualize this on your own image ? any snippet for that too ?

I mean how do you get the meta_info ?

SuhelNaryal commented 3 years ago

@saqib22 Hi, you can try the following code.

from utils.preprocessing import load_skeleton
from utils.vis import vis_keypoints, vis_3d_keypoints
from utils.transforms import world2cam, cam2pixel, pixel2cam

focal = [1500, 1500] # x-axis, y-axis
princpt = [256/2, 256/2] 
root_joint_idx = {'right': 20, 'left': 41}

skeleton = load_skeleton('path_to_skeleton.txt', 42) #skeleton.txt is in the annotations zip
joint_coord_out = out['joint_coord'].cpu().numpy()
rel_root_depth_out = out['rel_root_depth'].cpu().numpy()
hand_type_out = out['hand_type'].cpu().numpy()
preds = {'joint_coord': [], 'rel_root_depth': [], 'hand_type': []}
for i in range(joint_coord_out.shape[0]):
   preds['joint_coord'].append(joint_coord_out[i])
   preds['rel_root_depth'].append(rel_root_depth_out[i])
   preds['hand_type'].append(hand_type_out[i])

preds = {k: np.concatenate(v) for k,v in preds.items()}

preds_joint_coord, preds_rel_root_depth, preds_hand_type = preds['joint_coord'], preds['rel_root_depth'], preds['hand_type']
pred_joint_coord_img = preds_joint_coord[0].copy()
pred_joint_coord_img[:,0] = pred_joint_coord_img[:,0]/cfg.output_hm_shape[2]*cfg.input_img_shape[1]
pred_joint_coord_img[:,1] = pred_joint_coord_img[:,1]/cfg.output_hm_shape[1]*cfg.input_img_shape[0]
pred_joint_coord_img[:,2] = (pred_joint_coord_img[:,2]/cfg.output_hm_shape[0] * 2 - 1) * (cfg.bbox_3d_size/2)

if preds_hand_type[0][0] == 0.9 and preds_hand_type[0][1] == 0.9:  #change threshold to execute this parth if both handa are present
    pred_rel_root_depth = (preds_rel_root_depth[0]/cfg.output_root_hm_shape * 2 - 1) * (cfg.bbox_3d_size_root/2)

    pred_left_root_img = pred_joint_coord_img[root_joint_idx['left']].copy()
    pred_left_root_img[2] +=  pred_rel_root_depth
    pred_left_root_cam = pixel2cam(pred_left_root_img[None,:], focal, princpt)[0]

    pred_right_root_img = pred_joint_coord_img[root_joint_idx['right']].copy()
    pred_right_root_cam = pixel2cam(pred_right_root_img[None,:], focal, princpt)[0]

    pred_rel_root = pred_left_root_cam - pred_right_root_cam

pred_joint_coord_cam = pixel2cam(pred_joint_coord_img, focal, princpt)
joint_type = {'right': np.arange(0,21), 'left': np.arange(21,21*2)}
for h in ('right', 'left'):
    pred_joint_coord_cam[joint_type[h]] = pred_joint_coord_cam[joint_type[h]] - pred_joint_coord_cam[root_joint_idx[h],None,:]

joint_valid = [1.0]*21 + [1.0]*21 #change 1.0 to 0 if that handis not resent right hand is comes first in output
img_path = 'path to image'
cvimg = cv2.imread(img_path, cv2.IMREAD_COLOR | cv2.IMREAD_IGNORE_ORIENTATION)
_img = cvimg[:,:,::-1].transpose(2,0,1)
vis_kps = pred_joint_coord_img.copy()
vis_valid = joint_valid.copy()
filename = 'out____2d.jpg'
vis_keypoints(img, pred_joint_coord_img, joint_valid, skeleton, filename)
filename = 'out____3d.jpg'
vis_3d_keypoints(pred_joint_coord_cam, joint_valid, skeleton, filename)

This is an example of how you can use outputs on your image. I believe it is pretty much correct. you may add root depth to the z coordinates if needed. Please let me know if something goes wrong with the code.

kingsman0000 commented 3 years ago

@SuhelNaryal I am done with changing the code in model.py as you suggested and wrote a custom test data loader for my custom images and now I am able to run this on my own images but now can you comment on how did you visualize this on your own image ? any snippet for that too ?

I mean how do you get the meta_info ?

Hi,where did you write your custom test data loader for your custom images in?Is it a separate python file or you changed somewhere in author's code?Can you please common how did you load your custom images or where did you change it. Or your custom test data loader code please.Thank you!

SuhelNaryal commented 3 years ago

@kingsman0000 I have not done any of that. I modified the code to get outputs on my own image. I have not used any meta info or data loader.

kingsman0000 commented 3 years ago

@SuhelNaryal So how about the code you common just like 3 days ago?Is that a separate python file or need to add somewhere in author's code?I tried to add my own path but it still miss like "out[]" Can I use that code to predict and vis my own img?

@kingsman0000 I have not done any of that. I modified the code to get outputs on my own image. I have not used any meta info or data loader.

SuhelNaryal commented 3 years ago

@kingsman0000 out is the output from the model.

aswa123 commented 3 years ago

@saqib22 Hi, you can try the following code.

from utils.preprocessing import load_skeleton
from utils.vis import vis_keypoints, vis_3d_keypoints
from utils.transforms import world2cam, cam2pixel, pixel2cam

focal = [1500, 1500] # x-axis, y-axis
princpt = [256/2, 256/2] 
root_joint_idx = {'right': 20, 'left': 41}

skeleton = load_skeleton('path_to_skeleton.txt', 42) #skeleton.txt is in the annotations zip
joint_coord_out = out['joint_coord'].cpu().numpy()
rel_root_depth_out = out['rel_root_depth'].cpu().numpy()
hand_type_out = out['hand_type'].cpu().numpy()
preds = {'joint_coord': [], 'rel_root_depth': [], 'hand_type': []}
for i in range(joint_coord_out.shape[0]):
   preds['joint_coord'].append(joint_coord_out[i])
   preds['rel_root_depth'].append(rel_root_depth_out[i])
   preds['hand_type'].append(hand_type_out[i])

preds = {k: np.concatenate(v) for k,v in preds.items()}

preds_joint_coord, preds_rel_root_depth, preds_hand_type = preds['joint_coord'], preds['rel_root_depth'], preds['hand_type']
pred_joint_coord_img = preds_joint_coord[0].copy()
pred_joint_coord_img[:,0] = pred_joint_coord_img[:,0]/cfg.output_hm_shape[2]*cfg.input_img_shape[1]
pred_joint_coord_img[:,1] = pred_joint_coord_img[:,1]/cfg.output_hm_shape[1]*cfg.input_img_shape[0]
pred_joint_coord_img[:,2] = (pred_joint_coord_img[:,2]/cfg.output_hm_shape[0] * 2 - 1) * (cfg.bbox_3d_size/2)

if preds_hand_type[0][0] == 0.9 and preds_hand_type[0][1] == 0.9:  #change threshold to execute this parth if both handa are present
    pred_rel_root_depth = (preds_rel_root_depth[0]/cfg.output_root_hm_shape * 2 - 1) * (cfg.bbox_3d_size_root/2)

    pred_left_root_img = pred_joint_coord_img[root_joint_idx['left']].copy()
    pred_left_root_img[2] +=  pred_rel_root_depth
    pred_left_root_cam = pixel2cam(pred_left_root_img[None,:], focal, princpt)[0]

    pred_right_root_img = pred_joint_coord_img[root_joint_idx['right']].copy()
    pred_right_root_cam = pixel2cam(pred_right_root_img[None,:], focal, princpt)[0]

    pred_rel_root = pred_left_root_cam - pred_right_root_cam

pred_joint_coord_cam = pixel2cam(pred_joint_coord_img, focal, princpt)
joint_type = {'right': np.arange(0,21), 'left': np.arange(21,21*2)}
for h in ('right', 'left'):
    pred_joint_coord_cam[joint_type[h]] = pred_joint_coord_cam[joint_type[h]] - pred_joint_coord_cam[root_joint_idx[h],None,:]

joint_valid = [1.0]*21 + [1.0]*21 #change 1.0 to 0 if that handis not resent right hand is comes first in output
img_path = 'path to image'
cvimg = cv2.imread(img_path, cv2.IMREAD_COLOR | cv2.IMREAD_IGNORE_ORIENTATION)
_img = cvimg[:,:,::-1].transpose(2,0,1)
vis_kps = pred_joint_coord_img.copy()
vis_valid = joint_valid.copy()
filename = 'out____2d.jpg'
vis_keypoints(img, pred_joint_coord_img, joint_valid, skeleton, filename)
filename = 'out____3d.jpg'
vis_3d_keypoints(pred_joint_coord_cam, joint_valid, skeleton, filename)

This is an example of how you can use outputs on your image. I believe it is pretty much correct. you may add root depth to the z coordinates if needed. Please let me know if something goes wrong with the code.

Hello Sir, May I know where you use this code , I mean you make different file or in any file you add this code. Because I'm trying to use my own data set.Can you please help me with this. Thank you so much.

SuhelNaryal commented 3 years ago

@aswa123 Hi, create a new file for this code.

saqib22 commented 3 years ago

@SuhelNaryal Hi, I have used the code for visualization but my results are not good image208472D

SuhelNaryal commented 3 years ago

@saqib22 I having this issue as well. Do tell if you find something on this.

saqib22 commented 3 years ago

@SuhelNaryal Didn't you get the right results ? in the early comments ?

saqib22 commented 3 years ago

@saqib22 I having this issue as well. Do tell if you find something on this.

Sure Thing, Thanks

SuhelNaryal commented 3 years ago

@SuhelNaryal Didn't you get the right results ? in the early comments?

Only on this particular image. I have not done enough testing yet.

saqib22 commented 3 years ago

@SuhelNaryal Didn't you get the right results ? in the early comments?

Only on this particular image. I have not done enough testing yet.

Then I think there must be some other parameters along with RGB image then ? Can you reopen this issue ?

ravitejageeda commented 3 years ago

@SuhelNaryal Can you pls post the code how you passed image to the model to get output? Did you create any custom dataset.py to pass your random image ? Also did you create any .json annotations to the random image you passed ??

SuhelNaryal commented 3 years ago

@ravitejageeda I have shared the code above. No, I have not created any dataset.py or .json. The metadata is used for testing and training. You can just load the model and pass your image to get the results.

saqib22 commented 3 years ago

@SuhelNaryal Hi, I have used the code for visualization but my results are not good image208472D

@mks0601 I get the results like this when I don't use inv_trans, but after using the inv_trans matrix I get the correct results. So, how can I calculate this matrix.

I see a function in preprocessing.augmentation() that requires the box along with some other parameter. So is it possible to get this matrix during test time for my random image ? Thanks

ravitejageeda commented 3 years ago

@saqib22 @SuhelNaryal I get the following error when I run the above code.

10 skeleton = load_skeleton('/content/InterHand2.6M/data/InterHand2.6M/annotations/skeleton.txt', 42) #skeleton.txt is in the annotations zip ---> 11 joint_coord_out = out['joint_coord']..cpu().numpy() 12 rel_root_depth_out = out['rel_root_depth']..cpu().numpy() 13 hand_type_out = out['hand_type']..cpu().numpy() NameError: name 'out' is not defined

I could see out might be coming from model. But seems that prediction step is missing from the code.

Can you pls let me know what have you done here for Out for prediction ?

mks0601 commented 3 years ago

@saqib22 Yes. You need to fed the bbox coordinates and you can set the values of other parameters of the testing mode (https://github.com/facebookresearch/InterHand2.6M/blob/4e950b6465cc4eb4b26811cd0966997a7ab7b5a6/common/utils/preprocessing.py#L78)

SuhelNaryal commented 3 years ago

@saqib22 @SuhelNaryal I get the following error when I run the above code.

10 skeleton = load_skeleton('/content/InterHand2.6M/data/InterHand2.6M/annotations/skeleton.txt', 42) #skeleton.txt is in the annotations zip ---> 11 joint_coord_out = out['joint_coord']..cpu().numpy() 12 rel_root_depth_out = out['rel_root_depth']..cpu().numpy() 13 hand_type_out = out['hand_type']..cpu().numpy() NameError: name 'out' is not defined

I could see out might be coming from model. But seems that prediction step is missing from the code.

Can you pls let me know what have you done here for Out for prediction ?

@ravitejageeda Yes, Out is the output from model. You need to load model and get predictions from the model. Script to modify model.py is in comments above.

saqib22 commented 3 years ago

@mks0601 I have modified the code for inv_trans but I am not sure what is the format if bbox that is required it is (x,y,w,h) ? and in what resolution should I detect the hand at (334 x 512) or (256, 256) ? Thanks

saqib22 commented 3 years ago

@mks0601 Also why does some bboxs look like this ? 2

mks0601 commented 3 years ago

could you tell me which bbox is wrong? annotation id would be helpful

saqib22 commented 3 years ago

@mks0601 I was able to run the code on my custom images, but to run on a live demo I want to train my own 2D hand detector but on my first attempt on using the Interhand as training data doesn't give me good results ! Can you suggest some datasets to train a robust hand detector ?

Thanks

mks0601 commented 3 years ago

Hi, the images of our dataset are captured from a special multi-view environment, which has very different image appearance compared with daily images. You should use in-the-wild datasets, such as coco wholebody. I'm working on a new project for in-the-wild hand pose estimation. Will upload it on arxiv soon. Please stay tuned

saqib22 commented 3 years ago

@mks0601 Hi thanks, definitely I will check that out ? But I have tested this repo on my own webcam images and it works fine so far as I have seen. ? Shouldn't I use this repo ?

mks0601 commented 3 years ago

Of course you can use this repo, but I think my new work will be definitely better.

Frank-Dz commented 3 years ago

Hi, I figured some things from the paper and ran the code on my image. I just wanted to ask how much difference does abs root depth make in results. I got the following results. Thanks for your help. You have done great work. index

Do you have an inference code for our own image?

mks0601 commented 3 years ago

Use the demo codes.