facebookresearch / InterWild

Official PyTorch implementation of "Bringing Inputs to Shared Domains for 3D Interacting Hands Recovery in the Wild", CVPR 2023
Other
159 stars 16 forks source link

How to filter out bad result? #17

Open Javacr opened 8 months ago

Javacr commented 8 months ago

Hello, Moon! I don't know how to filter out the bad results, because as long as I give InterWild the hand bbox, it always returns the key points. I tried using the confidence provided by hand detector , but it wasn't reliable(as shown in the following figs

). I wanted to use the confidence of joint_img, but InterWild didn't provide it. So, can you give me some advice? image

image

image

mks0601 commented 8 months ago

Hi, could you use this one? https://github.com/facebookresearch/InterWild/blob/70d3eca941270f103cefec8907b9b62590321ed6/demo/demo.py#L112

Javacr commented 8 months ago

Hi, could you use this one?

https://github.com/facebookresearch/InterWild/blob/70d3eca941270f103cefec8907b9b62590321ed6/demo/demo.py#L112

I've actually tried hand_bbox_conf, but it's not reliable. In fig 1, the palm is blocked, but InterWild gives high confidences; In fig 3, palm is clear, but gives low confidence. I decide to add a parallel network to obtain 2D hand joints and corresponding confidence, so that judge the quality of 3D results.

mks0601 commented 8 months ago

Actually, you can do the same things as here (https://github.com/facebookresearch/InterWild/blob/70d3eca941270f103cefec8907b9b62590321ed6/common/nets/module.py#L225) to here (https://github.com/facebookresearch/InterWild/blob/70d3eca941270f103cefec8907b9b62590321ed6/common/nets/module.py#L31) to get joint confidence. Could you try?

Javacr commented 8 months ago

Actually, you can do the same things as here (

https://github.com/facebookresearch/InterWild/blob/70d3eca941270f103cefec8907b9b62590321ed6/common/nets/module.py#L225

) to here ( https://github.com/facebookresearch/InterWild/blob/70d3eca941270f103cefec8907b9b62590321ed6/common/nets/module.py#L31

) to get joint confidence. Could you try?

Thank you for your suggestion, I have tried it today. module.py

class PositionNet(nn.Module):
    # 2.5D joint
    def __init__(self):
        super(PositionNet, self).__init__()
        self.joint_num = mano.sh_joint_num
        self.depth_dim = cfg.output_hand_hm_shape[0]
        self.conv = make_conv_layers([2048, self.joint_num*self.depth_dim], kernel=1, stride=1, padding=0, bnrelu_final=False)

    def get_conf(self, hand_hm, rhand_joint, lhand_joint):
        # rhand_joint: Size([21, 3])
        batch_size, joint_num, depth, height, width = hand_hm.shape
        hand_hm = hand_hm.view(batch_size, joint_num, depth*height*width)
        hand_hm = F.softmax(hand_hm, 2)
        hand_hm = hand_hm.view(batch_size, joint_num, depth, height, width)
        # hand_hm: Size([2, 21, 8, 8, 8])
        rjoint_conf = sample_joint_features_3d(hand_hm[0,::], rhand_joint)
        ljoint_conf = sample_joint_features_3d(hand_hm[1,::], lhand_joint)
        return rjoint_conf, ljoint_conf

    def forward(self, hand_feat):
        hand_hm = self.conv(hand_feat)
        _, _, height, width = hand_hm.shape
        hand_hm = hand_hm.view(-1,self.joint_num,self.depth_dim,height,width)
        # import pdb; pdb.set_trace()
        # hand_coord: Size([2, 21, 3])
        hand_coord = soft_argmax_3d(hand_hm)
        rhand_joint, lhand_joint = hand_coord[0,:,:], hand_coord[1,:,:]
        rjoint_conf, ljoint_conf = self.get_conf(hand_hm, rhand_joint, lhand_joint)
        return hand_coord, rjoint_conf, ljoint_conf

transforms.py

def sample_joint_features_3d(heatmap3d, joint_xyz):
    # heatmap3d: Size([21, 8, 8, 8]) -> Size([21, 1, 8, 8, 8])
    # joint_xyz: Size([21, 3])-> Size(21, 1, 3)
    # grid: Size([21, 1, 1, 1, 3])
    heatmap3d = heatmap3d[:,None,:,:,:]
    joint_xyz = joint_xyz[:,None,:]
    depth, height, width = heatmap3d.shape[2:]
    # normalize to [-1, 1]
    x = joint_xyz[:,:,0] / (width-1) * 2 - 1
    y = joint_xyz[:,:,1] / (height-1) * 2 - 1
    z = joint_xyz[:,:,2] / (depth-1) * 2 - 1

    grid = torch.stack((x,y,z),2)[:,:,None,None,:]
    # import pdb; pdb.set_trace()
    # input math:`(N, C, D_\text{in}, H_\text{in}, W_\text{in})` (5-D case)
    # grid math:`(N, D_\text{out}, H_\text{out}, W_\text{out}, 3)` (5-D case)
    confidence_joint = F.grid_sample(heatmap3d, grid, align_corners=True)[:,:,:,0] # batch_size, channel_dim, joint_num
    return confidence_joint

I give some results(all joint confidences are show on image), two hands is clear in left image, but blurry in right image. case 1: The mean confidences are higher than 0.08 in left image and lower than 0.08 in righ image, this is reasonable. image

case 2: The mean confidence is higher than 0.08 in left image, this is reasonable, howeve, the mean confidence of right hand in right image is 0.1, it is too high. image

case 3: like case2 image

Javacr commented 8 months ago

I remove softmax when I obtain confidence, this means I use ture value of hand_hm to represent confidence. When value is smaller than 0, 0 is used instead. The new results seem reasonable. image image image

yanqi1811 commented 8 months ago

Hi, Javacr! Is your hands detetion from interwild model or you have retrained a hands detection model? I have tested interwild model and I found the detection results are not good when whole body in image. Thanks!

Javacr commented 8 months ago

Hi, Javacr! Is your hands detetion from interwild model or you have retrained a hands detection model? I have tested interwild model and I found the detection results are not good when whole body in image. Thanks!

Could you paste some results? The hand detector is from interwild, it preform well in most case. Its shortcoming is that it will predict two hands if only one hand in image, so I use wrist joint from VIT pose to filter out the wrong bbox.

yanqi1811 commented 8 months ago

Hi, Javacr! Is your hands detetion from interwild model or you have retrained a hands detection model? I have tested interwild model and I found the detection results are not good when whole body in image. Thanks!

Could you paste some results? The hand detector is from interwild, it preform well in most case. Its shortcoming is that it will predict two hands if only one hand in image, so I use wrist joint from VIT pose to filter out the wrong bbox.

Like these. These iamges resolution are 1920*1080, I think this resolution is very large, maybe I should run a body detection firstly. When I crop the body image, it works well. Thank you!

20231102174143 20231102174724 20231102180356

Javacr commented 8 months ago

Hi, Javacr! Is your hands detetion from interwild model or you have retrained a hands detection model? I have tested interwild model and I found the detection results are not good when whole body in image. Thanks!

Could you paste some results? The hand detector is from interwild, it preform well in most case. Its shortcoming is that it will predict two hands if only one hand in image, so I use wrist joint from VIT pose to filter out the wrong bbox.

Like these. These iamges resolution are 1920*1080, I think this resolution is very large, maybe I should run a body detection firstly. When I crop the body image, it works well. Thank you!

20231102174143 20231102174724 20231102180356

As moon said, "Put input images at images. The image should be a cropped image, which contain a single human. For example, using a human detector. We have a hand detection network, so no worry about the hand postiions!", human detector is important.

yanqi1811 commented 8 months ago

Hi, Javacr! Is your hands detetion from interwild model or you have retrained a hands detection model? I have tested interwild model and I found the detection results are not good when whole body in image. Thanks!

Could you paste some results? The hand detector is from interwild, it preform well in most case. Its shortcoming is that it will predict two hands if only one hand in image, so I use wrist joint from VIT pose to filter out the wrong bbox.

Like these. These iamges resolution are 1920*1080, I think this resolution is very large, maybe I should run a body detection firstly. When I crop the body image, it works well. Thank you! 20231102174143 20231102174724 20231102180356

As moon said, "Put input images at images. The image should be a cropped image, which contain a single human. For example, using a human detector. We have a hand detection network, so no worry about the hand postiions!", human detector is important.

Thank you!