Daniil-Osokin / lightweight-human-pose-estimation-3d-demo.pytorch

Real-time 3D multi-person pose estimation demo in PyTorch. OpenVINO backend can be used for fast inference on CPU.
Apache License 2.0
656 stars 138 forks source link

Re-training/Fine tuning #17

Closed ankitbansal811 closed 4 years ago

ankitbansal811 commented 4 years ago

How can I retrain the model on custom image dataset? or pick the existing weights and fine tune those?

Daniil-Osokin commented 4 years ago

Hi, we did not release training code due to time constraints. I think the easiest way is to take neighbor repository and add 3D keypoints estimation branch to it.

ankitbansal811 commented 4 years ago

Hi, Do you have any plans to add training module with this repo (or as a separate repo for 3D pose detection).

Also, I couldn't really understand the post processing of features to get get 3D pose. Especially:

  1. Refining keypoints coordinates at corresponding limbs locations
  2. The part where you translate the coordinates based on ratio of mean coordinates from 2D and 3D poses

If you can share some reference materials for the above. It would be great. Thanks

Daniil-Osokin commented 4 years ago

No plans for now. You can check the paper. Basically 3D root-relative coordinates predicted at spatial locations of 2D coordinates. Root position in 3D (translation vector) is found as one, which minimizes 3D to 2D coordinates projection error. Actually you may ask to release training code here.

KarlosMuradyan commented 4 years ago

Hi @Daniil-Osokin,

Firstly, thanks for the repository! I'm exploring it and trying to write code for training procedure. I have two questions regarding it:

  1. As I understood from the paper, which references to Vnect paper, the Loss function used for X location map for joint j is:
Loss(X_j) = \lVert H^{GT}_j  \odot (X_j - X^{GT}_j) \rVert

Where

\lVert  \bullet  \rVert \textrm{ is L2 loss }\odot \textrm{ is Hadamard product }H^{GT} \textrm{ is confidence map }X_j \textrm{ is the j-th keypoint's X location matrix outputted from the model }X^{GT}_j \textrm{ is the j-th keypoint's X Ground truth location matrix }

Everything is clear to me except how to generate ground truth for X location map of j-th joint. Assume, we have only one person in the image at the [a, b] location of already scaled image. I suppose that the Ground Truth X location map of joint j should have the value of the distance from person's root joint to that joint stored in [a,b] coordinate. In that case, what are other values in that matrix?

  1. I didn't see the Contributing.md file in your repo. Does this mean that you are not open for contributions to the repository? I would like to work on training code and after successful implementation integrate it in your repository. Is that possible?

Best regards, Karlos

Daniil-Osokin commented 4 years ago

Hi, if we talk about one person in a particular image, the other values are the copy of this value (root-relative X coordinate) at the positions of other joints. This is redundancy in 3D coordinates encoding, which paper refers. So, for example, 3D left wrist coordinate is stored at 2D location of left wrist, and 2D location of left elbow, and 2D location of left shoulder + 2D location of root joint (neck in this repository, pelvis in paper). This helps to read pose coordinates more robustly in case of occlusions, if wrist was occluded, we can infer it's coordinate at elbow location.

Contributions are very welcome! However, it will require some time and effort :)

KarlosMuradyan commented 4 years ago

Thank you for your response!

I took one of the illustrations from the Figure 4 of the paper to confirm what I understood from the paper and from your comment. In this illustration, there are two people: one on the left side, the other in the middle of the picture. Considering X ORPM of left elbow only (the ORPM that is illustrated in the front), I understood that:

  1. the 5 joints highlighted of the person in the middle of the picture have the same values and store x coordinate of the left elbow of that person. At the same time, the 5 joints highlighted of the person on the left side of the picture also have the same values, but they store that person's left elbow's x coordinate. Thus we have 10 highlighted joints 5 of which store the x coordinate of the first person's left elbow, the other 5 store the other person's left elbow's x coordinate
  2. Even though in the illustration there are 5 positions highlighted for each person (3 for the limb, one neck, one pelvis), but in the repo only neck is used, so only 4 is needed: 3 for the limb, 1 for the neck.
  3. As mentioned in the paper "a per-pixel L2 loss is enforced in the neighborhood of all possible readout locations. The loss is weighted by a limited support Gaussian centered at the read-out location. " Suppose [a,b] is the position of a readout location. It is reasonable to claim that [a-1 , b-1] is inside the Gaussian support (we still consider the value stored in that coordinate when computing the loss). What would be the value of the ground truth X ORPM of the left elbow at [a-1, b-1] coordinate? Is its value just the copy of the value stored at [a, b] coordinate?

Multiple-levels-of-selective-redundancy-in-our-Occlusion-robust-Pose-map-ORPM_W640 (2)

About contribution: Yes, I know that it requires time and effort, but I'm ready for that. Although I can't promise, but I will do everything to successfully implement the training procedure and contribute to the repo. I'm really interested in this area of research and want to dive deeper :)

Daniil-Osokin commented 4 years ago

Hi! 1 - right. 2 - correct. We do not use pelvis, because our base model, which detects 2D keypoints, was trained without pelvis. 3 - yes, it is still be the same value within gaussian neighborhood. This trick increases number of positives, thus balances positives to negatives ratio. However loss now has weights, so it less penalizes for error farther from exact keypoint location (or redundant location). It can be implemented as a separate mask with gaussians, centered at keypoints locations. I think these questions worth a separate issue. We can have a call if it is needed.

KarlosMuradyan commented 4 years ago

@Daniil-Osokin , do you want me to create another issue where we can discuss the implementation details if some questions like these arise? Thanks for the initiative of having a call if needed. At this moment, there is no need.

Daniil-Osokin commented 4 years ago

Yes, you get it right.

Daniil-Osokin commented 4 years ago

So, closing.

guker commented 4 years ago

Hi, if we talk about one person in a particular image, the other values are the copy of this value (root-relative X coordinate) at the positions of other joints. This is redundancy in 3D coordinates encoding, which paper refers. So, for example, 3D left wrist coordinate is stored at 2D location of left wrist, and 2D location of left elbow, and 2D location of left shoulder + 2D location of root joint (neck in this repository, pelvis in paper). This helps to read pose coordinates more robustly in case of occlusions, if wrist was occluded, we can infer it's coordinate at elbow location.

Contributions are very welcome! However, it will require some time and effort :)

  1. the location map Xj of joint J, it have same value ((root-relative X coordinate) on other position of location map Xj? @Daniil-Osokin

  2. bone length need loss?

Daniil-Osokin commented 4 years ago

Hi, yes for both questions. For joint J, location map for x coordinate Xj will have the same value at each allowed location (at this joint position + redundant positions) for the same person. And loss need to be computed at all these locations. This due to be able to read joint coordinates at redundant location (at left elbow, left wrist, neck for left wrist joint) in case if it can not be read at its own location due to occlusion or truncation.