abhi1kumar / DEVIANT

[ECCV 2022] Official PyTorch Code of DEVIANT: Depth Equivariant Network for Monocular 3D Object Detection
https://arxiv.org/abs/2207.10758
MIT License
203 stars 29 forks source link

Testing on Rope3D Dataset #29

Closed Aangss closed 7 months ago

Aangss commented 8 months ago

I converted the Rope3D dataset to KITTI format. I tried to test the model with the KITTI pre-training file you provided. But the result is not very satisfactory. For the description of the data after I converted Rope3D:

  1. Original image resolution:1920×1080 , validation with resolution set to 960×512
  2. calib: P0: 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 P1: 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 P2: 2.173379882812e+03 0.000000000000e+00 9.618704833984e+02 0.000000000000e+00 0.000000000000e+00 2.322043945312e+03 5.883443603516e+02 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00 P3: 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 R0_rect: 1.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 Tr_velo_to_cam: 1.994594966642e-03 -9.998606204387e-01 1.657520384002e-02 -1.115697257486e-01 -2.372202408477e-01 -1.657520384002e-02 -9.713144706380e-01 6.538036584690e+00 9.714538501993e-01 -1.994594966642e-03 -2.372202408477e-01 1.596758475422e+00 Tr_imu_to_velo: 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00
  3. Visualisation of validation results image

So I have the following thoughts

  1. For KITTI, the data is acquired with the camera of the acquisition vehicle parallel to the ground. In Rope3D, its camera is located on the roadside traffic light frame, and the camera is non-parallel to the ground. So the geometric projection a priori for traditional 3D object detection does not apply to Rope3D's similar roadside dataset does it?
  2. In addition to this, when I validated the private dataset, I only provided the P2 internal reference, but I don't know how to combine the rotation and translation matrices of the external reference (I think this must be a problem of my knowledge base), but I can't find the corresponding question and answer, so I am asking this question, and I hope that you can answer it.
  3. For the next step, I would like to use Rope3D for training. Thank you very much for your outstanding contribution and activity. Salute!!!
abhi1kumar commented 8 months ago

Hi @Aangss Thank you for your interest in DEVIANT again.

I tried to test the model with the KITTI pre-training file you provided. But the result is not very satisfactory on Rope3D dataset.

This is a well-known problem in Mono3D. Changing the camera height messes up the Mono3D model. See BEVHeight, CVPR 2023.

So the geometric projection a priori for traditional 3D object detection does not apply to Rope3D's similar roadside dataset does it

It applies. Traditional detectors' depth goes haywire in inference since the detector only relies on learned parameters in inference. DEVIANT proposes network design to ensure good and consistent depth (after depth translations) even during inference. We argue it from the ego camera movement (along depth) in the paper, which is slightly more intuitive to understand. However, we could also say the ego camera remains fixed, while the object translates along the depth.

I don't know how to combine the rotation and translation matrices of the external reference

The P2 is the camera matrix and includes intrinsics (K) as well as rotations (R) and translations (t).

P2 = K [ R | t ]

You look into the following references for more details:

I would like to use Rope3D for training.

That is awesome. We welcome contributions to DEVIANT repo. Please feel free to open a PR and add the corresponding Rope3D config file to this repo.

Aangss commented 8 months ago

Thanks for the information. @abhi1kumar

abhi1kumar commented 7 months ago

Closing due to inactivity.

Aangss commented 7 months ago

I used some of the rope3d data for training and the results I get will be biased in pitch angle, which doesn't seem to be ideal!What could be the problem causing this. image

abhi1kumar commented 7 months ago

Hi @Aangss I agree, this is not looking great. Here are the couple of checks which I would do:

python plot/plot_qualitative_output.py --folder YOUR_FOLDER --show_gt_in_image
Aangss commented 7 months ago

I converted the Rope3D dataset to kitti format, so I didn't change the dataloader code part, only the config.And I only used part of the Rope3d dataset for training, about 3000 images. image 2024-03-26 19:44:40,862 INFO conf: { random_seed: 444
dataset: {'type': 'kitti', 'root_dir': 'data/', 'train_split_name': 'train', 'val_split_name': 'val', 'resolution': [960, 512], 'eval_dataset': 'kitti', 'batch_size': 8, 'class_merging': False, 'use_dontcare': False, 'use_3d_center': True, 'writelist': ['Car', 'Pedestrian', 'Cyclist'], 'random_flip': 0.5, 'random_crop': 0.5, 'scale': 0.4, 'shift': 0.1} model: {'type': 'gupnet', 'backbone': 'dla34', 'neck': 'DLAUp', 'use_conv': 'sesn', 'replace_style': 'max_scale_after_dla34_layer', 'sesn_norm_per_scale': False, 'sesn_rescale_basis': False, 'sesn_scales': [0.83, 0.9, 1.0], 'scale_index_for_init': 0} optimizer: {'type': 'adam', 'lr': 0.00125, 'weight_decay': 1e-05} lr_scheduler: {'warmup': True, 'decay_rate': 0.1, 'decay_list': [90, 120]} trainer: {'max_epoch': 140, 'eval_frequency': 20, 'save_frequency': 20, 'disp_frequency': 20, 'log_dir': 'output/run331'} tester: {'threshold': 0.2} 2024-03-27 01:30:45,042 INFO ------ TRAIN EPOCH 140 ------ 2024-03-27 01:30:45,042 INFO Learning Rate: 0.000013 2024-03-27 01:30:45,420 INFO Weights: depth:1.0000, heading:1.0000, offset2d:1.0000, offset3d:1.0000, seg:1.0000, size2d:1.0000, size3d:1.0000, 2024-03-27 01:30:58,833 INFO BATCH[0020/0254] depth_loss:1.1811, heading_loss:0.2900, offset2d_loss:0.2085, offset3d_loss:0.2216, seg_loss:0.3941, size2d_loss:0.5219, size3d_loss:-0.2126, 2024-03-27 01:31:10,127 INFO BATCH[0040/0254] depth_loss:1.2149, heading_loss:0.3047, offset2d_loss:0.2013, offset3d_loss:0.2267, seg_loss:0.4024, size2d_loss:0.4831, size3d_loss:-0.2149, 2024-03-27 01:31:21,314 INFO BATCH[0060/0254] depth_loss:1.2102, heading_loss:0.2911, offset2d_loss:0.2050, offset3d_loss:0.2250, seg_loss:0.4028, size2d_loss:0.4889, size3d_loss:-0.2270, 2024-03-27 01:31:32,509 INFO BATCH[0080/0254] depth_loss:1.2211, heading_loss:0.3034, offset2d_loss:0.2043, offset3d_loss:0.2216, seg_loss:0.3741, size2d_loss:0.5052, size3d_loss:-0.2368, 2024-03-27 01:31:43,674 INFO BATCH[0100/0254] depth_loss:1.2019, heading_loss:0.2767, offset2d_loss:0.1972, offset3d_loss:0.2253, seg_loss:0.3887, size2d_loss:0.4724, size3d_loss:-0.2361, 2024-03-27 01:31:54,819 INFO BATCH[0120/0254] depth_loss:1.2152, heading_loss:0.3006, offset2d_loss:0.1987, offset3d_loss:0.2249, seg_loss:0.3867, size2d_loss:0.4837, size3d_loss:-0.2514, 2024-03-27 01:32:06,065 INFO BATCH[0140/0254] depth_loss:1.2160, heading_loss:0.2976, offset2d_loss:0.1994, offset3d_loss:0.2227, seg_loss:0.3920, size2d_loss:0.5068, size3d_loss:-0.2001, 2024-03-27 01:32:17,229 INFO BATCH[0160/0254] depth_loss:1.2406, heading_loss:0.3051, offset2d_loss:0.2082, offset3d_loss:0.2257, seg_loss:0.3972, size2d_loss:0.5122, size3d_loss:-0.2242, 2024-03-27 01:32:28,375 INFO BATCH[0180/0254] depth_loss:1.2895, heading_loss:0.3546, offset2d_loss:0.2281, offset3d_loss:0.2282, seg_loss:0.4598, size2d_loss:0.5326, size3d_loss:-0.1834, 2024-03-27 01:32:39,558 INFO BATCH[0200/0254] depth_loss:1.2462, heading_loss:0.2810, offset2d_loss:0.2107, offset3d_loss:0.2278, seg_loss:0.4170, size2d_loss:0.4878, size3d_loss:-0.1844, 2024-03-27 01:32:50,735 INFO BATCH[0220/0254] depth_loss:1.2312, heading_loss:0.2889, offset2d_loss:0.2154, offset3d_loss:0.2266, seg_loss:0.3959, size2d_loss:0.5213, size3d_loss:-0.2292, 2024-03-27 01:33:01,952 INFO BATCH[0240/0254] depth_loss:1.1880, heading_loss:0.2983, offset2d_loss:0.1893, offset3d_loss:0.2245, seg_loss:0.3714, size2d_loss:0.4717, size3d_loss:-0.2392, 2024-03-27 01:33:09,422 INFO BATCH[0254/0254] depth_loss:0.8752, heading_loss:0.2215, offset2d_loss:0.1520, offset3d_loss:0.1572, seg_loss:0.2689, size2d_loss:0.3420, size3d_loss:-0.1718, 2024-03-27 01:33:09,882 INFO ==> Saving to checkpoint 'output/run_331/checkpoints/checkpoint_epoch_140'

abhi1kumar commented 7 months ago

the results I get will be biased in pitch angle, which doesn't seem to be ideal!

Can you confirm that the image in the comment corresponds to the GT projected 3D boxes?

If yes, why do I see the difference green and pink boxes in the BEV. Ideally, these two color boxes should be the same in the image.