ApolloAuto / apollo

An open autonomous driving platform
Apache License 2.0
24.82k stars 9.65k forks source link

How does the YOLO 3D box detection from image work? #5819

Open alexbuyval opened 5 years ago

alexbuyval commented 5 years ago

I have some questions regarding 3D box detection from camera image by YOLO.

First of all, there is a inconsistency between documentations and code. As follows from here, the YOLO net outputs 2D rectangles in image frame and another code converts 2D to 3D. However, as I see in code, the YOLO detector outputs 3D bounding box directly.

As I understand the documentation is a little bit outdated. Right?

Are there any papers about such YOLO 3D net?

I wonder is it possible to use apollo's YOLO net with another camera (another focal length, resolution and so on) and get the same accuracy of 3D detection? I mean is there a cross-sensor generalization in the network?

Also I don't understand what does an anchors means?

Thank you for any advice and information!

lianglia-apollo commented 5 years ago

There could be some lag in documents update. Please refer to the actual code if there is a conflict.

vladpaunescu commented 5 years ago

Hi! I'm also interested in the YOLO 3D structure. I know it's called darknet-16c-16x-3d. I found the deploy.prototxt structure. I would like to understand the meaning of the output layers:

These should be related to standard YOLO 2D box:

What are these?

I don't know the meaning of the last two of them.

Could you shed some light please?

And what dataset did you use to train 3D object detection?

Vlad

vladpaunescu commented 5 years ago

Hi! Could somebody shed some light on the dataset used for training angle prediction, and how you regressed the angle, given the fact that the angle interval is circular between [0,2*pi).

Many thanks, Vlad

lucasjinreal commented 5 years ago

@vladpaunescu @alexbuyval There is a paper talked about this link

Avps1 commented 4 years ago

I have some questions regarding 3D box detection from camera image by YOLO.

First of all, there is a inconsistency between documentations and code. As follows from here, the YOLO net outputs 2D rectangles in image frame and another code converts 2D to 3D. However, as I see in code, the YOLO detector outputs 3D bounding box directly.

As I understand the documentation is a little bit outdated. Right?

Are there any papers about such YOLO 3D net?

I wonder is it possible to use apollo's YOLO net with another camera (another focal length, resolution and so on) and get the same accuracy of 3D detection? I mean is there a cross-sensor generalization in the network?

Also I don't understand what does an anchors means?

Thank you for any advice and information!

Hi Mr. Buyval,

You have stated in your issue that "the YOLO detector outputs 3D bounding box directly.", as can be found in "code". But the hyperlink to the code does not work any more, unfortunately. It would be of great help if you could tell me the path or at least the name of the code that you were referring to.

Also, have you found a fitting answer to your query ? Kind Regards.

alexbuyval commented 4 years ago

Hi @Avps1

You can find the yolo detector here now.

Also, have you found a fitting answer to your query ?

The Apollo's Yolo detector outputs 2D detections and 3D dimensions of obstacles as well as yaw orientation. Based on these data the Transformer calculates a 3D position of object in the map frame.

I hope it would be helpful for you

Regards, Alex

Avps1 commented 4 years ago

Hi @Avps1

You can find the yolo detector here now.

Also, have you found a fitting answer to your query ?

The Apollo's Yolo detector outputs 2D detections and 3D dimensions of obstacles as well as yaw orientation. Based on these data the Transformer calculates a 3D position of object in the map frame.

I hope it would be helpful for you

Regards, Alex

Thank you so much for your timely response !

mertmerci commented 3 years ago

Hello Mr. Buyval, You asked

I have some questions regarding 3D box detection from camera image by YOLO.

First of all, there is a inconsistency between documentations and code. As follows from here, the YOLO net outputs 2D rectangles in image frame and another code converts 2D to 3D. However, as I see in code, the YOLO detector outputs 3D bounding box directly.

As I understand the documentation is a little bit outdated. Right?

Are there any papers about such YOLO 3D net?

I wonder is it possible to use apollo's YOLO net with another camera (another focal length, resolution and so on) and get the same accuracy of 3D detection? I mean is there a cross-sensor generalization in the network?

Also I don't understand what does an anchors means?

Thank you for any advice and information!

Hello Mr. Buyval,

As I see, you asked about the YOLO 3D network, which outputs not only 2D BB but also 3D BB dimensions and yaw angle, used in Object Detection task under Apollo. Did you find any papers or any relevant publish about the network?

lucasjinreal commented 3 years ago

@mertmerci Try some SOTA model rather than Yolo3D, such as Smoke, R3DNet etc, they using camera image and prediction 3 dimensional information about obstacle.

There also training code opensource, trained on kitti, and only need monocular camera for inference. However, for production, the issue is that you gonna need prepare your own data.

mertmerci commented 3 years ago

@jinfagang Thank you for your quick response. The indications you had made are really on point so thank you for the support. I'm checking Self-supervised Spatiotemporal Learning via Video Clip Order Prediction for R3DNet. Do you have any other suggestion for 3D BB regression, using only monocular camera images?

Thank you for your support, once again.