Open Javacr opened 8 months ago
Hi, could you use this one? https://github.com/facebookresearch/InterWild/blob/70d3eca941270f103cefec8907b9b62590321ed6/demo/demo.py#L112
Hi, could you use this one?
I've actually tried hand_bbox_conf, but it's not reliable. In fig 1, the palm is blocked, but InterWild gives high confidences; In fig 3, palm is clear, but gives low confidence. I decide to add a parallel network to obtain 2D hand joints and corresponding confidence, so that judge the quality of 3D results.
Actually, you can do the same things as here (https://github.com/facebookresearch/InterWild/blob/70d3eca941270f103cefec8907b9b62590321ed6/common/nets/module.py#L225) to here (https://github.com/facebookresearch/InterWild/blob/70d3eca941270f103cefec8907b9b62590321ed6/common/nets/module.py#L31) to get joint confidence. Could you try?
Actually, you can do the same things as here (
) to get joint confidence. Could you try?
Thank you for your suggestion, I have tried it today. module.py
class PositionNet(nn.Module):
# 2.5D joint
def __init__(self):
super(PositionNet, self).__init__()
self.joint_num = mano.sh_joint_num
self.depth_dim = cfg.output_hand_hm_shape[0]
self.conv = make_conv_layers([2048, self.joint_num*self.depth_dim], kernel=1, stride=1, padding=0, bnrelu_final=False)
def get_conf(self, hand_hm, rhand_joint, lhand_joint):
# rhand_joint: Size([21, 3])
batch_size, joint_num, depth, height, width = hand_hm.shape
hand_hm = hand_hm.view(batch_size, joint_num, depth*height*width)
hand_hm = F.softmax(hand_hm, 2)
hand_hm = hand_hm.view(batch_size, joint_num, depth, height, width)
# hand_hm: Size([2, 21, 8, 8, 8])
rjoint_conf = sample_joint_features_3d(hand_hm[0,::], rhand_joint)
ljoint_conf = sample_joint_features_3d(hand_hm[1,::], lhand_joint)
return rjoint_conf, ljoint_conf
def forward(self, hand_feat):
hand_hm = self.conv(hand_feat)
_, _, height, width = hand_hm.shape
hand_hm = hand_hm.view(-1,self.joint_num,self.depth_dim,height,width)
# import pdb; pdb.set_trace()
# hand_coord: Size([2, 21, 3])
hand_coord = soft_argmax_3d(hand_hm)
rhand_joint, lhand_joint = hand_coord[0,:,:], hand_coord[1,:,:]
rjoint_conf, ljoint_conf = self.get_conf(hand_hm, rhand_joint, lhand_joint)
return hand_coord, rjoint_conf, ljoint_conf
transforms.py
def sample_joint_features_3d(heatmap3d, joint_xyz):
# heatmap3d: Size([21, 8, 8, 8]) -> Size([21, 1, 8, 8, 8])
# joint_xyz: Size([21, 3])-> Size(21, 1, 3)
# grid: Size([21, 1, 1, 1, 3])
heatmap3d = heatmap3d[:,None,:,:,:]
joint_xyz = joint_xyz[:,None,:]
depth, height, width = heatmap3d.shape[2:]
# normalize to [-1, 1]
x = joint_xyz[:,:,0] / (width-1) * 2 - 1
y = joint_xyz[:,:,1] / (height-1) * 2 - 1
z = joint_xyz[:,:,2] / (depth-1) * 2 - 1
grid = torch.stack((x,y,z),2)[:,:,None,None,:]
# import pdb; pdb.set_trace()
# input math:`(N, C, D_\text{in}, H_\text{in}, W_\text{in})` (5-D case)
# grid math:`(N, D_\text{out}, H_\text{out}, W_\text{out}, 3)` (5-D case)
confidence_joint = F.grid_sample(heatmap3d, grid, align_corners=True)[:,:,:,0] # batch_size, channel_dim, joint_num
return confidence_joint
I give some results(all joint confidences are show on image), two hands is clear in left image, but blurry in right image.
case 1: The mean confidences are higher than 0.08 in left image and lower than 0.08 in righ image, this is reasonable.
case 2: The mean confidence is higher than 0.08 in left image, this is reasonable, howeve, the mean confidence of right hand in right image is 0.1, it is too high.
case 3: like case2
I remove softmax when I obtain confidence, this means I use ture value of hand_hm to represent confidence. When value is smaller than 0, 0 is used instead. The new results seem reasonable.
Hi, Javacr! Is your hands detetion from interwild model or you have retrained a hands detection model? I have tested interwild model and I found the detection results are not good when whole body in image. Thanks!
Hi, Javacr! Is your hands detetion from interwild model or you have retrained a hands detection model? I have tested interwild model and I found the detection results are not good when whole body in image. Thanks!
Could you paste some results? The hand detector is from interwild, it preform well in most case. Its shortcoming is that it will predict two hands if only one hand in image, so I use wrist joint from VIT pose to filter out the wrong bbox.
Hi, Javacr! Is your hands detetion from interwild model or you have retrained a hands detection model? I have tested interwild model and I found the detection results are not good when whole body in image. Thanks!
Could you paste some results? The hand detector is from interwild, it preform well in most case. Its shortcoming is that it will predict two hands if only one hand in image, so I use wrist joint from VIT pose to filter out the wrong bbox.
Like these. These iamges resolution are 1920*1080, I think this resolution is very large, maybe I should run a body detection firstly. When I crop the body image, it works well. Thank you!
Hi, Javacr! Is your hands detetion from interwild model or you have retrained a hands detection model? I have tested interwild model and I found the detection results are not good when whole body in image. Thanks!
Could you paste some results? The hand detector is from interwild, it preform well in most case. Its shortcoming is that it will predict two hands if only one hand in image, so I use wrist joint from VIT pose to filter out the wrong bbox.
Like these. These iamges resolution are 1920*1080, I think this resolution is very large, maybe I should run a body detection firstly. When I crop the body image, it works well. Thank you!
![]()
![]()
As moon said, "Put input images at images. The image should be a cropped image, which contain a single human. For example, using a human detector. We have a hand detection network, so no worry about the hand postiions!", human detector is important.
Hi, Javacr! Is your hands detetion from interwild model or you have retrained a hands detection model? I have tested interwild model and I found the detection results are not good when whole body in image. Thanks!
Could you paste some results? The hand detector is from interwild, it preform well in most case. Its shortcoming is that it will predict two hands if only one hand in image, so I use wrist joint from VIT pose to filter out the wrong bbox.
Like these. These iamges resolution are 1920*1080, I think this resolution is very large, maybe I should run a body detection firstly. When I crop the body image, it works well. Thank you!
![]()
![]()
As moon said, "Put input images at images. The image should be a cropped image, which contain a single human. For example, using a human detector. We have a hand detection network, so no worry about the hand postiions!", human detector is important.
Thank you!
Hello, Moon! I don't know how to filter out the bad results, because as long as I give InterWild the hand bbox, it always returns the key points. I tried using the confidence provided by hand detector , but it wasn't reliable(as shown in the following figs
). I wanted to use the confidence of joint_img, but InterWild didn't provide it. So, can you give me some advice?![image](https://github.com/facebookresearch/InterWild/assets/44012990/e2af7d48-d49f-4c7b-b6bb-7a9b1b5d3113)