NVlabs / Deep_Object_Pose

Deep Object Pose Estimation (DOPE) – ROS inference (CoRL 2018)
Other
1.02k stars 286 forks source link

question about advantage of PnP algorithm #116

Closed huckl3b3rry87 closed 4 years ago

huckl3b3rry87 commented 4 years ago

I am wondering, is there an advantage of using the PnP algorithm over training it to output the 6D pose directly?

Thank you!

TontonTremblay commented 4 years ago

That is a hard question to answer. I have no proof but here is my 2 cents.

Neural nets are universal function approximators so there are no real fundamental problems in regressing to the 6D pose.

DOPE regresses to keypoints, these are represented a 2D map on the image plane. You can think of moving from the an RGB image space to a 2D heatmap space. You can see that there are relationships in there which are quite natural, it is in the image space, convolution can easily be used to make the transformation, and the correlation with texture is very high.

When regressing to the 6D pose directly, you move to a space that is more complicated than the 2d keypoint we had. The viewpoint of the camera is important, the rotation as a matrix or quaternion is not intuitive and translation is function of size and where in the image. These properties have to be learn by fully connected neural network as convolutions wont be enough. Moreover, learning directly the pose bakes in the camera intrinsic into the weights.

To me it is more natural to think about this problem using PnP, leveraging tools that are stable and we know how to optimize. Although that does not mean you should not explore and research regressing to the pose directly. I am excited to learn about that and also to get my opinion change.

huckl3b3rry87 commented 4 years ago

@TontonTremblay thank you very much for your detailed response!

I was thinking that an issue with estimating the pose directly was that the camera parameters would be baked in, but in DenseFusion, which used RGB and depth to estimate pose directly, the camera parameters are adjustable. I just asked a question here to help clarify this.

TontonTremblay commented 4 years ago

Very good question. I am sure that slight variations on the intrinsic wont have a big impact on the predictions. To be honest it would be interesting if this was measured.

huckl3b3rry87 commented 4 years ago

Thanks, I am interested in seeing what @j96w has to say about this

zhanghui-hunan commented 4 years ago

Dear Tremblay! I would like to know PnP ask for objPoints which represent the world coordinates of 2D keypoints (x,y,z), but the config.yaml collects the cuboid dimensions. Could you tell me how to get objPoints for PnP and the relationship between objPoints and cuboid dimensions? Thank you!@TontonTremblay

mintar commented 4 years ago

I was thinking that an issue with estimating the pose directly was that the camera parameters would be baked in, but in DenseFusion, which used RGB and depth to estimate pose directly, the camera parameters are adjustable. I just asked a question here to help clarify this.

I've read the DenseFusion paper and skimmed over the code, and from what I can tell the camera intrinsics that you can supply during inference are only used to project the depth image to a point cloud and back into the image plane. I still suspect that the camera intrinsics are baked into the network weights during training, and that training on one camera and doing inference on another will give inferior results. I'd also love to get feedback from @j96w on this. DenseFusion is a really cool project, and I'm definitely going to give it a go soon.

I would have to do some experiments to verify that, and in fact that would be easy to do: I've collected a large-scale dataset (the YCB-M dataset) with multiple different RGB-D cameras, so I could just take the pretrained models supplied by LineMod and test what happens on my dataset. Maybe I'll do that once I get back to the office in a couple of months.

I've used the YCB-M dataset to test DOPE (I have an ICRA 2020 paper about this), and DOPE had no problem coping with the different cameras. The reason for this is that the camera intrinsics don't factor into training the network at all, and you can supply them during inference:

It is precisely the use of the PnP algorithm why DOPE can do this. I have some experience running PoseCNN on different cameras, and it simply does not work, because PoseCNN directly regresses to the pose, and the intrinsics are baked in. You have to re-train PoseCNN with the camera you're going to use for inference, and you cannot reuse the weights at all.

TontonTremblay commented 4 years ago

I was not sure if your dataset was available yet and thus why I did not mention it. But I thought for sure it would be the best dataset to explore this problem and how to design solutions.

huckl3b3rry87 commented 4 years ago

FYI @TontonTremblay and @mintar @j96w responded to https://github.com/j96w/DenseFusion/issues/150

Please close this issue if you guys think that it is appropriate

TontonTremblay commented 4 years ago

Thank you. I will close, I am not very good at doing that on this repo as you might have noticed.