Daniil-Osokin / lightweight-human-pose-estimation-3d-demo.pytorch

Real-time 3D multi-person pose estimation demo in PyTorch. OpenVINO backend can be used for fast inference on CPU.
Apache License 2.0
656 stars 138 forks source link

Question About 3D Model #7

Closed rodyt closed 4 years ago

rodyt commented 4 years ago

Hi,

First of all, thanks for creating this github repo.

I understand that you aren't the author of the papers, but you seemed really knowledgeable about the subject so I wanted to ask a few questions regarding 3D pose estimation and this repo:

  1. What is the format of the final 3D output of the model? (the dimensionality of the output tensor, what it represents, etc.)

  2. How is the ground truth data for this model annotated? I've heard some methods generate a 3 dimensional heatmap (x, y, z) and take an L2 loss. Other methods break down this problem into a set of 2D maps.

3) Many papers talk about how to the z position is "root relative". I don't quite understand how they measure the distance along the z - axis. Is 0 the z position of the pelvis, and a ground truth z position of 0 is in front of it and -10 is behind it? Isn't it better to apply a gaussian kernel? What if the pelvis is out of frame? What do we do then?

4) Research papers measure the distance along the z axis in terms of mm, but across the x and y axis, they use pixels. How does that work?

Again, thank you for making this repo available and open.

Thanks

Daniil-Osokin commented 4 years ago

Hi, a lot of questions =). 3D coordinates for each person encoded in root-relative order, so root has (0, 0, 0) coordinates, the rest encoded as offsets from root. For single-person 3D pose estimation this is finish, but for multi-person case persons should be distributed in the space. In this work it is done by minimizing 3D->2D keypoints projection error. Inference of this network looks like: infer model, get 2D and 3D branches output. Use 2D output to find 2D keypoints and group it by persons (do 2D multi-person pose estimation). Read 3D coordinates from 3D branch output at found 2D keypoints locations to obtain root-relative 3D poses. Minimize 3D->2D projection error to find root position in 3D space and add this offset to rest 3D keypoints.

  1. The 3D branch outputs 8 times downsampled feature maps with regressed 3D coordinates in root-relative order. 3D coordinates read at locations of found 2D keypoints.
  2. Yes, ground truth is a set of 3 dimensional heatmaps (x, y, z) for each keypoint type, L2 loss is used for learning.
  3. If root is outside of image, then such person keypoints are skipped. Often mean per joint position error is measured (in mm) as accuracy metric, it is just average L2 distance between gold and predicted keypoints.
  4. Did not see about pixels error accuracy metric for 3D estimation.
rodyt commented 4 years ago

Hello,

Thank you for your elaborate answer.

So to confirm what you said (all of this is in the context of single person pose estimation):

  1. Root relative keypoint gold annotations would look something like:

Root ( 0, 0, 0) Elbow (-10, 50, 30) = 10 pixels left, 50 pixels below, 30 mm behind (or is it in front) (I assume pixels and mm are directly convertible to each other through some sort of conversion ratio)

  1. Ground truth heatmap is a 3D Tensor of (x, y, z). There is a 3D gaussian around the (x, y, z) location of a joint. (Shouldn't this be very memory intensive?)

For example, the output 3D heatmap might be a tensor of 64 x 64 x 64. The joint location is inside one of these 64 x-pixel locations, one of these 64 y-pixel locations, and the one of these 64 z-pixel locations with the root joint at the center z locations (32th pixel). Is this correct?

But if the input of the model is an RGB image is 2D, how do you extend the dimensionality so that the output is 3D?

  1. Is L2 Loss the per pixel loss between the predicted 3D heatmap of (x, y, z), or the euclidean distance between the (x,y,z) coordinate of the predicted joint and the ground truth?

Thank you so much for your response

Daniil-Osokin commented 4 years ago

3D coordinates regression is performed in metric untis, e.g. mm (there are some approaches with volume prediction as you mentioned like 64x64x64, but this is not one of them). So the regression target for particular keypoint type is one feature map with x coordinates placed at location of 2D joint, one feature map for y and one for z coordinate. Each such feature map contains all x coordiantes for all keypoints of particular keypoint type (e.g. all x coordinates for all necks), placed at location of 2D joint. Actually you can check the code, there are no heatmaps for 3D coordinates prediction.

Daniil-Osokin commented 4 years ago

Hope, it helped.

maylad31 commented 4 years ago

"read 3D coordinates from 3D branch output at found 2D keypoints locations to obtain root-relative 3D poses. Minimize 3D->2D projection error to find root position in 3D space and add this offset to rest 3D keypoints. " Can you please elaborate more?

Daniil-Osokin commented 4 years ago

What are you interested about?

maylad31 commented 4 years ago

hello, regret for the late revet. I was not well. Actually i am using your code to experiment with a setup. i have two overlapping cameras. and i am trying to map a person in one camera to a person in another. What i am trying to do is to map x,y,z for one camera with your code to x,y,z from other camera. Mapping x,y to x,y is not possible as a ray can have many points. Hence i thought if i could use 3d pose coordinates to map.

Daniil-Osokin commented 4 years ago

Here coordinates are returned in camera space. Then they are mapped to world space using camera extrinsics. So you can add translation between cameras and rotate on difference in angle to map between cameras.

maylad31 commented 4 years ago

Thanks

Fan-loewe commented 3 years ago

Hi Daniil,

I just have a problem with this downsampling rate.

  • The 3D branch outputs 8 times downsampled feature maps with regressed 3D coordinates in root-relative order. 3D coordinates read at locations of found 2D keypoints.

image

However, according to the pipeline in the ORPM paper, the 3D branch outputs 4 times downsampled feature maps. Did I miss something here?

Thank you :)

Daniil-Osokin commented 3 years ago

Hi, Fan! Yeah, in the original paper downsampling rate is 4 for the 3D branch. But we used the same downsampling rate 8 with 2D branch for a faster inference speed.

Fan-loewe commented 3 years ago

Hi Daniil, I got it. Thank you!