Banconxuan / RTM3D

The official PyTorch Implementation of RTM3D and KM3D for Monocular 3D Object Detection
MIT License
453 stars 85 forks source link

About the robustness and portability of monocular 3D models #10

Open WSTao opened 3 years ago

WSTao commented 3 years ago

Monocular 3D depends on camera parameters. If you change a different camera or installation method, the original DataSet training model will not work. So how can you solve this difference

Banconxuan commented 3 years ago

You just need to make sure the calib matrix is formatted correctly, and the parameters can vary from camera to camera. We have done verification on the nuscenes data set to prove that this is work. We used Model Zoo's DLA34 model (trained from only the kitti data set) to get the results without changing any parameters.

Banconxuan commented 3 years ago

1_image 1_bev

Banconxuan commented 3 years ago

We format the calib of nuscnes as: P0: 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 P1: 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 P2: 1.252813102119e+03 0.000000000000e+00 8.265881147814e+02 0.000000000000e+00 0.000000000000e+00 1.252813102119e+03 4.699846626225e+02 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00 P3: 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 R0_rect: 1.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 Tr_velo_to_cam: 1.122025939680e-02 -9.998986137987e-01 -8.767434198194e-03 -7.022340992421e-03 5.464515701519e-02 9.368031550067e-03 -9.984618905094e-01 -3.515059821513e-01 9.984427938514e-01 1.072390359095e-02 5.474472849433e-02 -7.332408994883e-01 Tr_imu_to_velo: 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00

WSTao commented 3 years ago

Thank you very much In other words, if you use different cameras or different installation locations(at this time, the internal and external parameters of the camera change), you need to remake the training data set?

Banconxuan commented 3 years ago

YES. For now, you can only align Kitti's format.

WSTao commented 3 years ago

Ok, thanks!

cch2016 commented 3 years ago

You just need to make sure the calib matrix is formatted correctly, and the parameters can vary from camera to camera. We have done verification on the nuscenes data set to prove that this is work. We used Model Zoo's DLA34 model (trained from only the kitti data set) to get the results without changing any parameters.

In KM3D , you reformulate the geometric constraints as a differentiable version and used for training . I wonder if KM3D is overfitted easily in train data 's camera parameters . But , It seems look work well in nuscenes data. Comparing to RTM3D’s ,I wonder if the generalization of KM3D is poor in other datasets, Do you make a comparison?

walzimmer commented 3 years ago

@cch2016 I have tried to run the pretrained model on my own camera images:

Screenshot from 2021-02-10 15-14-10

The cars can only be detected in close range (10-20m) because the image is cropped internally:

image

I have used the calibration data (projection matrix) from the KITTI dataset (calib/000000.txt).

@Banconxuan Did you use the calibration data (projection matrix) from KITTI or from NuScenes when doing inference on this image?

image

cch2016 commented 3 years ago

@walzimmer You should use projection matrix from your own dataset. It's generalization is pretty good.

a43992899 commented 3 years ago

Hi, @walzimmer did you successfully get the intended result on your custom dataset? I am currently working on my custom dataset, with cameras from higher angle. And, @cch2016 @Banconxuan do I need to corp my images into the same size as Kitti dataset images? The calib parameters that I need to change, are P2, R0_rect and Tr_velo_to_cam, and I should set other parameters to zero . Is that correct?

athus1990 commented 3 years ago

So , If I understand correctly all you need to change is P2. If you look at the run code for testing only P2 is being read

gujiaqivadin commented 3 years ago

Hello, guys! I am also doing some work on the generalization of mono3d methods. And I wonder why the network can learn the robostness of camera intrinsics. The depth of instances will vary from different cameras and different datasets based on the camrea intrinsics. So I think it will fail on the depth estimation(other 3d box attributes may be good).

KinkangLiu commented 3 years ago

Brothers. I used some pictures in Suscenes for inferencing, and modified P2 (the intrinsic parameters of camera). The results show that the model can identify the object well, but the position seems to have a large deviation. Is it impossible to infer the positions of objects with different camera?

By looking at the code, I can see that the network directly outputs the location of each target. Don't the model need intrinsic parameters to get the position of object from a picture? In this way, how can model obtain position of object from pictures taken by different cameras?

If I use camera with different focal length,can the model infer accurate position of object?

If you know, I hope you will give me some advice. Thank you very much!

image image

Ocean-627 commented 2 years ago

Could you please offer some details about how to train KM3D on Nuscenes dataset to obtain the result in the paper(AP=15.3)? Thank you.