IcarusWizard / PixelwiseRegression

PyTorch release for paper "Pixel-wise Regression: 3D Hand Pose Estimation via Spatial-form Representation and Differentiable Decoder"
MIT License
34 stars 2 forks source link

Poor results when infering #9

Open TheoCarme opened 1 year ago

TheoCarme commented 1 year ago

Hello,

I am currently trying to implement a program that use Pixelwise Regression to estimate the pose of at least two hands on a depth video stream (one frame at the time). I am using the Stereo Labs ZED Mini camera. Since Pixelwise can only estimate the pose of only one hand on a frame, I begin by using Mediapipe Hands (I know it is overkill, I may change later) to locate the hands and crop them on the frame. Then I resize the cropped hands to a size of 128x128. Finally I can use this code :

    def estimate(self, img):

        label_img = Resize(size=[self.label_size, self.label_size])(img)
        label_img = tr.reshape(label_img, (1, 1, self.label_size, self.label_size))
        mask = tr.where(label_img > 0, 1.0, 0.0)

        img = img.to(self.device, non_blocking=True)
        label_img = label_img.to(self.device, non_blocking=True)
        mask = mask.to(self.device, non_blocking=True)

        self.heatmaps, self.depthmaps, hands_uvd = self.model(img, label_img, mask)[-1]
        hands_uvd = hands_uvd.detach().cpu().numpy()
        self.hands_uvd = hands_uvd

        return hands_uvd

To get this result

After I looked into the values of img, label_img and mask when executing test_samples.py, I got the feeling that contrary to mine those matrix are normalized, which could be the cause of my poor results. So is my feeling right and if so can you explain to me how I can do the same treatment on my matrix.

P.S. : I tested with both HAND17 and MSRA pretrained models.

Thank you for your work.

IcarusWizard commented 1 year ago

Hi @TheoCarme ,

Thanks for your interest in our work.

You are right, the depth image needs to be normalized to the centre of the cropping box. You can find compute the COM like here then normalise the image like here. You may need to tune the cube size to have the best performance, starting with the default value of the trained dataset of the model. Also, be sure to denormalise the output uvd like here before you draw a figure.

Hope these help.

TheoCarme commented 1 year ago

Thank you for your help.

Can you please give me examples of value and shape of cube_size ? I searched but did not find where I could get those default values. Is it right to say that in my situation the cube_size should approximately match with the size of the area in which my hands are present.

IcarusWizard commented 1 year ago

cube_size is a scalar that represents half the length of the side of the square. The default value of the models you use should be 150 (mm). In my experience, the best value is slightly larger than the size of the hand and depends on how good the segmentations are.

TheoCarme commented 1 year ago

I did not fully understand how the code you suggested work. Yet I tried to use it but did not succeeded. I get this error : uvd[:, :, :2] = (uvd[:, :, :2] * (box_size - 1)).view(-1, 1, 1) RuntimeError: The expanded size of the tensor (1) must match the existing size (42) at non-singleton dimension 0. Target sizes: [1, 21, 2]. Tensor sizes: [42, 1, 1]

Instead I used this function to normalize my cropped image before resizing it to 128*128. Then I made this function to denormalize the uvd values. In the end I obtain this kind of results

Do you see any error in what I did ?

Thank you for your time.

IcarusWizard commented 1 year ago

I am sorry for your confusion.

In the first part, the denormalization code raise error because the box_size is in the wrong shape. box_size suppose to have a shape of [batch_size], which in your case is [1].

Please follow the procedure in the HandDataset.process_single_data to prepare the input data and also the normalization parameter that you need for the denormalization function. The norm_img function is from an early stage of the project which is not fit for the trained model.

Are these clear to you?

TheoCarme commented 1 year ago

It is clear, thank you, I will try that.

TheoCarme commented 1 year ago

So I made this function to crop and normalize my images. Now when I try to process an image with this function I get this error : ValueError: Expected more than 1 spatial element when training, got input size torch.Size([1, 128, 1, 1])

So can you help me understand what the problem is, please.

Also with a cube_size of 10000 the cropping function get my images from a shape of about 700700 to 2020. Is this standard ?

IcarusWizard commented 1 year ago

Could you share more information about this error, e.g. which line triggered this error? Also, can you check all the shapes of the input tensors?

A cube_size of 10000 is definitely too large and with a number this large the cropped image shouldn't be so small. There are two explanations in my mind. First, the more likely one, is that your image contains backgrounds which make your com large on the z dimension. Second, the less likely one, is that your image is not in millimetres. Could you check from these two directions?