Owen-Liuyuxuan / ros2_vision_inference

unified multi-threading inferencing nodes for monocular 3D object detection, depth prediction and semantic segmentation
18 stars 0 forks source link

The input size of dla34_deform_576_768.onnx is not really 576x768 #4

Closed mamadouDembele closed 3 weeks ago

mamadouDembele commented 3 weeks ago

Hi, First of all, thank you for providing the pre-trained models for 3D detection and segmentation. When loading the model dla34_deform_576_768.onnx on netron the input size of the model displayed is 384x1280. Do you have a dla34_deform of size 576x768?

Best

Owen-Liuyuxuan commented 3 weeks ago

Oh sorry, actually there is, and there are actually multiple 576x768 and 384x1280 models on the servers... I made mistakes when I manually tagging them the names..

Owen-Liuyuxuan commented 3 weeks ago

https://github.com/Owen-Liuyuxuan/ros2_vision_inference/releases/tag/v1.1.1 (It's also available in the new README)

I added both models. Very sorry for making this mistakes.

mamadouDembele commented 3 weeks ago

Thank you very much for your response. I have started to test the 3d detection models of the readme on a sample kitti image. The results obtained are as follows:

yolox_3d

dla34_576_768_3d

dla34_384_1280

As you can see, I've tested the three models on the readme: mono3d_yolox_576_768.onnx, dla34_deform_384_1280.onnx and dla34_deform_576_768.onnx. For each model, I display the results of the 2d detection and the results of the 3d detection (with the top view and the associated ground-truth). The dla34_deform_x_y.onnx seems to be better than the mono3d_yolox_576_768.onnx. The dla34_deform_384_1280.onnx model is more accurate than the dla34_deform_576_768.onnx model. This is because the ratio between the H/W = 375/1242 input image and that of the 384/1280 model input are fairly close.

In the coming days, I plan to test the models on images from our autonomous shuttles and compare the results with the lidar segmentation. I'll share the results if I can.

Best

Owen-Liuyuxuan commented 3 weeks ago

Great, that is what I have expected.

  1. The 384/1280 models are mainly tailored for KITTI datasets.
  2. DLA34 is a significantly larger backbone than YOLOX.

So the result makes sense to me.