XuyangBai / TransFusion

[PyTorch] Official implementation of CVPR2022 paper "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers". https://arxiv.org/abs/2203.11496
Apache License 2.0
642 stars 77 forks source link

Training on KITTI dataset #51

Open Galaxy-ZRX opened 2 years ago

Galaxy-ZRX commented 2 years ago

Hi Xuyang,

Thanks for the great work and your contributions! Here is a question about training on KITTI. I have tried to modify the config file to meet the point cloud input of KITTI dataset, but the eval results are almost zero. I will attach my logs below if it helps. Could you please give me some hint on how to modify the config? For example, the VFE for nuscenes and Waymo are different and I am not sure if I should use HardSimpleVFE from nuscenes config or HardVFE from Waymo config.

Many thanks!

Galaxy-ZRX commented 2 years ago

TorchVision: 0.7.0 OpenCV: 4.6.0 MMCV: 1.2.4 MMCV Compiler: GCC 8.4 MMCV CUDA Compiler: 10.2 MMDetection: 2.10.0 MMDetection3D: 0.11.0+399bda0

2022-08-03 18:42:31,484 - mmdet - INFO - Distributed training: False 2022-08-03 18:42:32,189 - mmdet - INFO - Config: point_cloud_range = [0, -40, -3.0, 70.0, 40, 1.0] class_names = ['Car'] voxel_size = [0.05, 0.05, 0.1] out_size_factor = 8 evaluation = dict(interval=1) dataset_type = 'KittiDataset' data_root = 'data/kitti/' input_modality = dict( use_lidar=True, use_camera=False, use_radar=False, use_map=False, use_external=False) train_pipeline = [ dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4), dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True), dict( type='ObjectSample', db_sampler=dict( data_root='data/kitti/', info_path='data/kitti/kitti_dbinfos_train.pkl', rate=1.0, prepare=dict( filter_by_difficulty=[-1], filter_by_min_points=dict(Car=5)), classes=['Car'], sample_groups=dict(Car=15), points_loader=dict( type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4))), dict( type='GlobalRotScaleTrans', rot_range=[-0.785, 0.785], scale_ratio_range=[0.9, 1.1], translation_std=[0.5, 0.5, 0.5]), dict( type='RandomFlip3D', sync_2d=False, flip_ratio_bev_horizontal=0.5, flip_ratio_bev_vertical=0.5), dict( type='PointsRangeFilter', point_cloud_range=[0, -40, -3.0, 70.0, 40, 1.0]), dict( type='ObjectRangeFilter', point_cloud_range=[0, -40, -3.0, 70.0, 40, 1.0]), dict(type='ObjectNameFilter', classes=['Car']), dict(type='PointShuffle'), dict(type='DefaultFormatBundle3D', class_names=['Car']), dict(type='Collect3D', keys=['points', 'gt_bboxes_3d', 'gt_labels_3d']) ] test_pipeline = [ dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4), dict( type='MultiScaleFlipAug3D', img_scale=(1333, 800), pts_scale_ratio=1, flip=False, transforms=[ dict( type='GlobalRotScaleTrans', rot_range=[0, 0], scale_ratio_range=[1.0, 1.0], translation_std=[0, 0, 0]), dict(type='RandomFlip3D'), dict( type='DefaultFormatBundle3D', class_names=['Car'], with_label=False), dict(type='Collect3D', keys=['points']) ]) ] data = dict( samples_per_gpu=1, workers_per_gpu=4, train=dict( type='RepeatDataset', times=2, dataset=dict( type='KittiDataset', data_root='data/kitti/', ann_file='data/kitti/kitti_infos_train.pkl', split='training', pipeline=[ dict( type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4), dict( type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True), dict( type='ObjectSample', db_sampler=dict( data_root='data/kitti/', info_path='data/kitti/kitti_dbinfos_train.pkl', rate=1.0, prepare=dict( filter_by_difficulty=[-1], filter_by_min_points=dict(Car=5)), classes=['Car'], sample_groups=dict(Car=15), points_loader=dict( type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4))), dict( type='GlobalRotScaleTrans', rot_range=[-0.785, 0.785], scale_ratio_range=[0.9, 1.1], translation_std=[0.5, 0.5, 0.5]), dict( type='RandomFlip3D', sync_2d=False, flip_ratio_bev_horizontal=0.5, flip_ratio_bev_vertical=0.5), dict( type='PointsRangeFilter', point_cloud_range=[0, -40, -3.0, 70.0, 40, 1.0]), dict( type='ObjectRangeFilter', point_cloud_range=[0, -40, -3.0, 70.0, 40, 1.0]), dict(type='ObjectNameFilter', classes=['Car']), dict(type='PointShuffle'), dict(type='DefaultFormatBundle3D', class_names=['Car']), dict( type='Collect3D', keys=['points', 'gt_bboxes_3d', 'gt_labels_3d']) ], modality=dict( use_lidar=True, use_camera=False, use_radar=False, use_map=False, use_external=False), classes=['Car'], test_mode=False)), val=dict( type='KittiDataset', data_root='data/kitti/', ann_file='data/kitti/kitti_infos_val.pkl', split='training', pipeline=[ dict( type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4), dict( type='MultiScaleFlipAug3D', img_scale=(1333, 800), pts_scale_ratio=1, flip=False, transforms=[ dict( type='GlobalRotScaleTrans', rot_range=[0, 0], scale_ratio_range=[1.0, 1.0], translation_std=[0, 0, 0]), dict(type='RandomFlip3D'), dict( type='DefaultFormatBundle3D', class_names=['Car'], with_label=False), dict(type='Collect3D', keys=['points']) ]) ], modality=dict( use_lidar=True, use_camera=False, use_radar=False, use_map=False, use_external=False), classes=['Car'], test_mode=True), test=dict( type='KittiDataset', data_root='data/kitti/', ann_file='data/kitti/kitti_infos_val.pkl', split='training', pipeline=[ dict( type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4), dict( type='MultiScaleFlipAug3D', img_scale=(1333, 800), pts_scale_ratio=1, flip=False, transforms=[ dict( type='GlobalRotScaleTrans', rot_range=[0, 0], scale_ratio_range=[1.0, 1.0], translation_std=[0, 0, 0]), dict(type='RandomFlip3D'), dict( type='DefaultFormatBundle3D', class_names=['Car'], with_label=False), dict(type='Collect3D', keys=['points']) ]) ], modality=dict( use_lidar=True, use_camera=False, use_radar=False, use_map=False, use_external=False), classes=['Car'], test_mode=True)) model = dict( type='TransFusionDetector', pts_voxel_layer=dict( max_num_points=5, voxel_size=[0.05, 0.05, 0.1], max_voxels=(16000, 40000), point_cloud_range=[0, -40, -3.0, 70.0, 40, 1.0]), pts_voxel_encoder=dict(type='HardSimpleVFE'), pts_middle_encoder=dict( type='SparseEncoder', in_channels=4, output_channels=128, sparse_shape=[41, 1600, 1408], order=('conv', 'norm', 'act')), pts_backbone=dict( type='SECOND', in_channels=256, out_channels=[128, 256], layer_nums=[5, 5], layer_strides=[1, 2], norm_cfg=dict(type='BN', eps=0.001, momentum=0.01), conv_cfg=dict(type='Conv2d', bias=False)), pts_neck=dict( type='SECONDFPN', in_channels=[128, 256], out_channels=[256, 256], upsample_strides=[1, 2], norm_cfg=dict(type='BN', eps=0.001, momentum=0.01), upsample_cfg=dict(type='deconv', bias=False), use_conv_for_no_stride=True), pts_bbox_head=dict( type='TransFusionHead', num_proposals=200, auxiliary=True, in_channels=512, hidden_channel=128, num_classes=1, num_decoder_layers=1, num_heads=8, learnable_query_pos=False, initialize_by_heatmap=True, nms_kernel_size=3, ffn_channel=256, dropout=0.1, bn_momentum=0.1, activation='relu', common_heads=dict( center=(2, 2), height=(1, 2), dim=(3, 2), rot=(2, 2)), bbox_coder=dict( type='TransFusionBBoxCoder', pc_range=[0, -40], voxel_size=[0.05, 0.05], out_size_factor=8, post_center_range=[-61.2, -61.2, -10.0, 61.2, 61.2, 10.0], score_threshold=0.0, code_size=8), loss_cls=dict( type='FocalLoss', use_sigmoid=True, gamma=2, alpha=0.25, reduction='mean', loss_weight=1.0), loss_bbox=dict(type='L1Loss', reduction='mean', loss_weight=0.25), loss_heatmap=dict( type='GaussianFocalLoss', reduction='mean', loss_weight=1.0)), train_cfg=dict( pts=dict( dataset='kitti', assigner=dict( type='HungarianAssigner3D', iou_calculator=dict(type='BboxOverlaps3D', coordinate='lidar'), cls_cost=dict( type='FocalLossCost', gamma=2, alpha=0.25, weight=0.15), reg_cost=dict(type='BBoxBEVL1Cost', weight=0.25), iou_cost=dict(type='IoU3DCost', weight=0.25)), pos_weight=-1, gaussian_overlap=0.1, min_radius=2, grid_size=[1600, 1408, 40], voxel_size=[0.05, 0.05, 0.1], out_size_factor=8, code_weights=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], point_cloud_range=[0, -40, -3.0, 70.0, 40, 1.0])), test_cfg=dict( pts=dict( dataset='kitti', grid_size=[1600, 1408, 40], out_size_factor=8, pc_range=[0, -40], voxel_size=[0.05, 0.05], nms_type=None))) optimizer = dict(type='AdamW', lr=0.0001, weight_decay=0.01) optimizer_config = dict(grad_clip=dict(max_norm=0.1, norm_type=2)) lr_config = dict( policy='cyclic', target_ratio=(10, 0.0001), cyclic_times=1, step_ratio_up=0.4) momentum_config = dict( policy='cyclic', target_ratio=(0.8947368421052632, 1), cyclic_times=1, step_ratio_up=0.4) total_epochs = 20 checkpoint_config = dict(interval=1) log_config = dict( interval=50, hooks=[dict(type='TextLoggerHook'), dict(type='TensorboardLoggerHook')]) dist_params = dict(backend='nccl') log_level = 'INFO' work_dir = './work_dirs/transfusion_nusc2kitti_voxel_L_Ruixiao' load_from = None resume_from = 'work_dirs/transfusion_nusc2kitti_voxel_L_Ruixiao/epoch_1.pth' workflow = [('train', 1)] gpu_ids = range(0, 1)

XuyangBai commented 2 years ago

HardSimpleVFE and HardVFE should both work and have similar results. Could you provide the training log on KITTI so that I can better find the potential reasons.

Galaxy-ZRX commented 2 years ago

Thank you for your reply to both of my questions! My log is: https://drive.google.com/file/d/1dAfESZp5sE3IGwSA-BTvBtxLNIPVfIOQ/view?usp=sharing

I have checked some issues from others and applied the fade strategy. For KITTI I trained with ObjectSampler used for 40 epochs and then commented it out for 10 epochs further training (i.e. this log). I also noticed that the BEVloss will have some problems due to the point cloud range of KITTI, so I use a larger but square point cloud range as [-75.2, -75.2, -4, 75.2, 75.2, 2], however, the performance was still not good enough, so I use the BBox3DL1Cost instead while keeping the point cloud range as above. Now the best performance I got with KITTI is: Epoch 8(+40 before fade strategy): Car AP@0.70, 0.70, 0.70: bbox AP:85.6172, 77.9356, 77.6623 bev AP:84.1549, 75.9334, 75.1914 3d AP:71.1918, 61.9135, 57.7088 aos AP:0.25, 1.11, 1.45 Car AP@0.70, 0.50, 0.50: bbox AP:85.6172, 77.9356, 77.6623 bev AP:87.1550, 85.8918, 85.7408 3d AP:87.0314, 85.2156, 84.4186 aos AP:0.25, 1.11, 1.45

As you can see it's good but still have a gap with other methods. As you advised in #52 I will try to change the sparse_shape in pts_middle_encoder, would you think this will improve the performance or should I also change some other things from the config file?

BTW, may I ask why you suggest I use [1, 704, 800] for the sparse_shape instead of [41, xxx, xxx] as for Waymo and NUSC? I haven't understood the meaning of this part of the setting.

Thank you very much for your reply! I have also sent you an email (rz6u20@soton.ac.uk) before, if you find it easier to discuss via email or wechat, it's also ok for me! Thank you!

Galaxy-ZRX commented 2 years ago

update: just change the read right of the Google drive link so that you can open it successfully. Hope it can help and looking forward to your advice!

XuyangBai commented 2 years ago

I also noticed that the BEVloss will have some problems due to the point cloud range of KITTI, so I use a larger but square point cloud range as [-75.2, -75.2, -4, 75.2, 75.2, 2], however, the performance was still not good enough, so I use the BBox3DL1Cost instead while keeping the point cloud range as above.

Yes the BBoxBEVL1Cost has some problems when the perception range is not a square, as discussed in https://github.com/XuyangBai/TransFusion/issues/17, your solution is recommended but the weight of BBox3DL1Cost might need re-tuning as it has different scale with BBoxBEVL1Cost, that might be the reason for the performance gap.

As you can see it's good but still have a gap with other methods. As you advised in https://github.com/XuyangBai/TransFusion/issues/52 I will try to change the sparse_shape in pts_middle_encoder, would you think this will improve the performance or should I also change some other things from the config file?

You should first config the shape correctly, otherwise it will be either be error raised. And since you are using square perception range, it will not have this problem.

BTW, may I ask why you suggest I use [1, 704, 800] for the sparse_shape instead of [41, xxx, xxx] as for Waymo and NUSC? I haven't understood the meaning of this part of the setting.

I use 1 because I copy that part of code from pillar-based config file. If you use VoxelNet as your backbone, then the first dimension is not 1 and you should change it according. (i.e. If your z range is [-4, 2] and voxel size is [0.075, 0.075, 0.2], then your first dimension should be 31, that is 1+6/0.2)

Galaxy-ZRX commented 2 years ago

Thank you for your reply! I set the weight of BBOX3EDL1Cost as 2.0 before as you can see from the log. Do you still remember it should be larger or smaller than BEVL1Cost? Or do you have any suggestions on how to fine-tune the weight?

Actually, I am now trying to change it back to BEVL1Loss since I use a square range. I will try both and see which is the problem.

Do you think other things in my log look fine? Many thanks!

Galaxy-ZRX commented 2 years ago

I also noticed that in other's log such as #17, the bbox loss should drop to ~0.5, but mine is around 2.8 with loss_weight=3.0 (or ~1.6 with loss_weight=2.0), do you think this is normal or something wrong from the box assigned module?

XuyangBai commented 2 years ago

Thank you for your reply! I set the weight of BBOX3EDL1Cost as 2.0 before as you can see from the log. Do you still remember it should be larger or smaller than BEVL1Cost? Or do you have any suggestions on how to fine-tune the weight?

I do not remember the exact value of BBox3DL1Cost. my experience is using a larger weight for BBoxCost is good for more accurate bounding box prediction, but might also suffer from redundant predictions. So you can try slightly increasing the weight if you want.

I also noticed that in other's log such as https://github.com/XuyangBai/TransFusion/issues/17, the bbox loss should drop to ~0.5, but mine is around 2.8 with loss_weight=3.0 (or ~1.6 with loss_weight=2.0), do you think this is normal or something wrong from the box assigned module?

It is usually hard to compare the value of loss for different datasets, it should be normal if your results and visualization looks good.

Galaxy-ZRX commented 2 years ago

Thank you! I will keep trying. Once I get a good enough result I will let you know.

Galaxy-ZRX commented 2 years ago

Hi @XuyangBai , sorry for disturbing again, may I also ask if you have the config file for PointAugmenting model on Waymo? I noticed that you report the results but their codes only provide the details for nuScenes. Many thanks!

XuyangBai commented 2 years ago

I do not re-implement PointAugmenting on waymo but get their results from the authors.

Galaxy-ZRX commented 2 years ago

Hi @XuyangBai, thank you for your previous suggestions that help me to solve the problems! Now I am doing some tests with TransFusion, may I ask if you can tell me the exact use of reg_cost and its weight? I remembered that you said when this value is large, we can get more accurate boxes but may also suffer from redundant boxes. But I am not sure how it works as this.

Thank you very much and looking forward to your reply!

anaghasmenon44 commented 1 year ago

Hi @XuyangBai , While attempting the training with kitti dataset (Transfusion with both Image and LiDAR), in the transfusion head, I'm getting error "mmdet3d/models/dense_heads/transfusion_head.py", line 982, in forward_single pts_2d[:, 2] = torch.clamp(pts_2d[:, 2], min=1e-5) IndexError: too many indices for tensor of dimension 1"

Could you please help on this?

for view_idx in range(self.num_views):

print ( points.size(), num_points, self.num_views)  
 #torch.Size([1800, 3]) , 1800 , 1

pts_4d = torch.cat([points, points.new_ones(size=(num_points, 1))], dim=-1)
print (pts_4d.size())
#torch.Size([1800, 4])

print (lidar2img_rt[view_idx].t().size())
#torch.Size([4])

pts_2d =  pts_4d @ lidar2img_rt[view_idx].t()
print (pts_2d.size(), pts_2d.size()[0])
#torch.Size([1800]) , 1800

#Error happens at the below line:
pts_2d[:, 2] = torch.clamp(pts_2d[:, 2], min=1e-5)
pts_2d[:, 0] /= pts_2d[:, 2]
pts_2d[:, 1] /= pts_2d[:, 2]
carry-all-coder commented 1 year ago

Hi @XuyangBai , While attempting the training with kitti dataset (Transfusion with both Image and LiDAR), in the transfusion head, I'm getting error "mmdet3d/models/dense_heads/transfusion_head.py", line 982, in forward_single pts_2d[:, 2] = torch.clamp(pts_2d[:, 2], min=1e-5) IndexError: too many indices for tensor of dimension 1"

Could you please help on this?

for view_idx in range(self.num_views):

print ( points.size(), num_points, self.num_views)  
 #torch.Size([1800, 3]) , 1800 , 1

pts_4d = torch.cat([points, points.new_ones(size=(num_points, 1))], dim=-1)
print (pts_4d.size())
#torch.Size([1800, 4])

print (lidar2img_rt[view_idx].t().size())
#torch.Size([4])

pts_2d =  pts_4d @ lidar2img_rt[view_idx].t()
print (pts_2d.size(), pts_2d.size()[0])
#torch.Size([1800]) , 1800

#Error happens at the below line:
pts_2d[:, 2] = torch.clamp(pts_2d[:, 2], min=1e-5)
pts_2d[:, 0] /= pts_2d[:, 2]
pts_2d[:, 1] /= pts_2d[:, 2]

Same problem! Have you find any solutions? @anaghasmenon44

anaghasmenon44 commented 1 year ago

Hi @XuyangBai , While attempting the training with kitti dataset (Transfusion with both Image and LiDAR), in the transfusion head, I'm getting error "mmdet3d/models/dense_heads/transfusion_head.py", line 982, in forward_single pts_2d[:, 2] = torch.clamp(pts_2d[:, 2], min=1e-5) IndexError: too many indices for tensor of dimension 1" Could you please help on this? for view_idx in range(self.num_views):

print ( points.size(), num_points, self.num_views)  
 #torch.Size([1800, 3]) , 1800 , 1

pts_4d = torch.cat([points, points.new_ones(size=(num_points, 1))], dim=-1)
print (pts_4d.size())
#torch.Size([1800, 4])

print (lidar2img_rt[view_idx].t().size())
#torch.Size([4])

pts_2d =  pts_4d @ lidar2img_rt[view_idx].t()
print (pts_2d.size(), pts_2d.size()[0])
#torch.Size([1800]) , 1800

#Error happens at the below line:
pts_2d[:, 2] = torch.clamp(pts_2d[:, 2], min=1e-5)
pts_2d[:, 0] /= pts_2d[:, 2]
pts_2d[:, 1] /= pts_2d[:, 2]

Same problem! Have you find any solutions? @anaghasmenon44

not yet

hanshibo001213 commented 1 year ago

Hello, I tried to change the model to run on kitti and changed the size of voxel, but I changed the sparse tensor to voxel The following error occurs when the size matches the numerical value:

RuntimeError: Given groups=1, weight of size [128, 256, 3, 3], expected input[1, 128, 88, 100] to have 256 channels, but got 128 channels instead.

But after replacing the sparse tensor shape with the one in your configuration (4115041504), it can now run again. May I ask why?

DanielDoerr commented 1 year ago

@hanshibo001213 @anaghasmenon44 has one of you found a solution to train with the kitti dataset with lidar and camera data?

2000lf commented 8 months ago

@anaghasmenon44 when I train with kittidataset,the error say kittidataset isn't registered in mmdet dataset. can you give some guidance? Thanks a lot. File "tools/train.py", line 255, in <module> main() File "tools/train.py", line 251, in main meta=meta) File "/home/shiying/zjx/envs/anaconda3/envs/mmlab/lib/python3.7/site-packages/mmdet/apis/train.py", line 223, in train_detector val_dataset = build_dataset(cfg.data.val, dict(test_mode=True)) File "/home/shiying/zjx/envs/anaconda3/envs/mmlab/lib/python3.7/site-packages/mmdet/datasets/builder.py", line 82, in build_dataset dataset = build_from_cfg(cfg, DATASETS, default_args) File "/home/shiying/zjx/envs/anaconda3/envs/mmlab/lib/python3.7/site-packages/mmcv/utils/registry.py", line 62, in build_from_cfg f'{obj_type} is not in the {registry.name} registry') KeyError: 'KittiDataset is not in the dataset registry'

2000lf commented 8 months ago

@Galaxy-ZRX When I was training with the KITTI dataset, an error occurred, indicating that KittiDataset is not registered in mmdetection. Have you encountered this issue before?

Galaxy-ZRX commented 5 months ago

@Galaxy-ZRX When I was training with the KITTI dataset, an error occurred, indicating that KittiDataset is not registered in mmdetection. Have you encountered this issue before?

Sorry I didn't meet this issue before

HaiCLi commented 2 months ago

Hi,

Does it mean you succeed the bev on kitti dataset?

2000lf commented 2 months ago

Hi,

Does it mean you succeed the bev on kitti dataset?

no, I use nuscenes instead