Closed NaomiEX closed 7 months ago
Please follow the steps in quick_start.md thoroughly, paying particular attention to compiling the 'deformable_aggregation' cuda operation. Your current GPU memory and training time are abnormal.
I did follow the steps, I have compiled the cuda ops, perhaps the memory and training time difference is because I'm not training on RTX3090s but on 4 A100s
It's difficult to explain this reason. It might be due to batch normalization. You could try again with 8 GPUs or adjust the learning rate.
Pull the latest code and try again. @NaomiEX
Alright, let me try again
Alright, let me try again
Are the metrics aligned?
Apologies, I haven't been able to retrain it again, will report back once I train it with the new code when I get the time.
Hi, I also trained the model for R50 on 8*RTX4090 with total batch size 48 and lr 6e-4 for 100 epochs(follow your config), however I obtained a final mAP of only 45.34 which is much lower than the reported mAP. Here are my configurations and results. Can you give me some suggestions? plugin = True plugin_dir = 'projects/mmdet3d_plugin/' dist_params = dict(backend='nccl') log_level = 'INFO' work_dir = './work_dirs/sparse4dv3_temporal_r50_1x8_bs6_256x704' total_batch_size = 48 num_gpus = 8 batch_size = 6 num_iters_per_epoch = 586 num_epochs = 100 checkpoint_epoch_interval = 20 checkpoint_config = dict(interval=11720) log_config = dict( interval=51, hooks=[ dict(type='TextLoggerHook', by_epoch=False), dict(type='TensorboardLoggerHook') ]) load_from = None resume_from = 'work_dirs/sparse4dv3_temporal_r50_1x8_bs6_256x704/iter_23440.pth' workflow = [('train', 1)] fp16 = dict(loss_scale=32.0) input_shape = (704, 256) tracking_test = True tracking_threshold = 0.2 class_names = [ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ] num_classes = 10 embed_dims = 256 num_groups = 8 num_decoder = 6 num_single_frame_decoder = 1 use_deformable_func = True strides = [4, 8, 16, 32] num_levels = 4 num_depth_layers = 3 drop_out = 0.1 temporal = True decouple_attn = True with_quality_estimation = True model = dict( type='Sparse4D', use_grid_mask=True, use_deformable_func=True, img_backbone=dict( type='ResNet', depth=50, num_stages=4, frozen_stages=-1, norm_eval=False, style='pytorch', with_cp=True, out_indices=(0, 1, 2, 3), norm_cfg=dict(type='BN', requires_grad=True), pretrained='ckpt/resnet50-19c8e357.pth'), img_neck=dict( type='FPN', num_outs=4, start_level=0, out_channels=256, add_extra_convs='on_output', relu_before_extra_convs=True, in_channels=[256, 512, 1024, 2048]), depth_branch=dict( type='DenseDepthNet', embed_dims=256, num_depth_layers=3, loss_weight=0.2), head=dict( type='Sparse4DHead', cls_threshold_to_reg=0.05, decouple_attn=True, instance_bank=dict( type='InstanceBank', num_anchor=900, embed_dims=256, anchor='_nuscenes_kmeans900.npy', anchor_handler=dict(type='SparseBox3DKeyPointsGenerator'), num_temp_instances=600, confidence_decay=0.6, feat_grad=False), anchor_encoder=dict( type='SparseBox3DEncoder', vel_dims=3, embed_dims=[128, 32, 32, 64], mode='cat', output_fc=False, in_loops=1, out_loops=4), num_single_frame_decoder=1, operation_order=[ 'deformable', 'ffn', 'norm', 'refine', 'temp_gnn', 'gnn', 'norm', 'deformable', 'ffn', 'norm', 'refine', 'temp_gnn', 'gnn', 'norm', 'deformable', 'ffn', 'norm', 'refine', 'temp_gnn', 'gnn', 'norm', 'deformable', 'ffn', 'norm', 'refine', 'temp_gnn', 'gnn', 'norm', 'deformable', 'ffn', 'norm', 'refine', 'temp_gnn', 'gnn', 'norm', 'deformable', 'ffn', 'norm', 'refine' ], temp_graph_model=dict( type='MultiheadAttention', embed_dims=512, num_heads=8, batch_first=True, dropout=0.1), graph_model=dict( type='MultiheadAttention', embed_dims=512, num_heads=8, batch_first=True, dropout=0.1), norm_layer=dict(type='LN', normalized_shape=256), ffn=dict( type='AsymmetricFFN', in_channels=512, pre_norm=dict(type='LN'), embed_dims=256, feedforward_channels=1024, num_fcs=2, ffn_drop=0.1, act_cfg=dict(type='ReLU', inplace=True)), deformable_model=dict( type='DeformableFeatureAggregation', embed_dims=256, num_groups=8, num_levels=4, num_cams=6, attn_drop=0.15, use_deformable_func=True, use_camera_embed=True, residual_mode='cat', kps_generator=dict( type='SparseBox3DKeyPointsGenerator', num_learnable_pts=6, fix_scale=[[0, 0, 0], [0.45, 0, 0], [-0.45, 0, 0], [0, 0.45, 0], [0, -0.45, 0], [0, 0, 0.45], [0, 0, -0.45]])), refine_layer=dict( type='SparseBox3DRefinementModule', embed_dims=256, num_cls=10, refine_yaw=True, with_quality_estimation=True), sampler=dict( type='SparseBox3DTarget', num_dn_groups=5, num_temp_dn_groups=3, dn_noise_scale=[2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5], max_dn_gt=32, add_neg_dn=True, cls_weight=2.0, box_weight=0.25, reg_weights=[2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0], cls_wise_reg_weights=dict( {9: [2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0]})), loss_cls=dict( type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, loss_weight=2.0), loss_reg=dict( type='SparseBox3DLoss', loss_box=dict(type='L1Loss', loss_weight=0.25), loss_centerness=dict(type='CrossEntropyLoss', use_sigmoid=True), loss_yawness=dict(type='GaussianFocalLoss'), cls_allow_reverse=[5]), decoder=dict(type='SparseBox3DDecoder'), reg_weights=[2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0])) dataset_type = 'NuScenes3DDetTrackDataset' data_root = 'data/nuscenes/' anno_root = 'data/nuscenes_anno_pkls/' file_client_args = dict(backend='disk') img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict( type='LoadPointsFromFile', coord_type='LIDAR', load_dim=5, use_dim=5, file_client_args=dict(backend='disk')), dict(type='ResizeCropFlipImage'), dict(type='MultiScaleDepthMapGenerator', downsample=[4, 8, 16]), dict(type='BBoxRotation'), dict(type='PhotoMetricDistortionMultiViewImage'), dict( type='NormalizeMultiviewImage', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict( type='CircleObjectRangeFilter', class_dist_thred=[55, 55, 55, 55, 55, 55, 55, 55, 55, 55]), dict( type='InstanceNameFilter', classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ]), dict(type='NuScenesSparse4DAdaptor'), dict( type='Collect', keys=[ 'img', 'timestamp', 'projection_mat', 'image_wh', 'gt_depth', 'focal', 'gt_bboxes_3d', 'gt_labels_3d' ], meta_keys=['T_global', 'T_global_inv', 'timestamp', 'instance_id']) ] test_pipeline = [ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict(type='ResizeCropFlipImage'), dict( type='NormalizeMultiviewImage', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='NuScenesSparse4DAdaptor'), dict( type='Collect', keys=['img', 'timestamp', 'projection_mat', 'image_wh'], meta_keys=['T_global', 'T_global_inv', 'timestamp']) ] input_modality = dict( use_lidar=False, use_camera=True, use_radar=False, use_map=False, use_external=False) data_basic_config = dict( type='NuScenes3DDetTrackDataset', data_root='data/nuscenes/', classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], modality=dict( use_lidar=False, use_camera=True, use_radar=False, use_map=False, use_external=False), version='v1.0-trainval') data_aug_conf = dict( resize_lim=(0.4, 0.47), final_dim=(256, 704), bot_pct_lim=(0.0, 0.0), rot_lim=(-5.4, 5.4), H=900, W=1600, rand_flip=True, rot3d_range=[-0.3925, 0.3925]) data = dict( samples_per_gpu=6, workers_per_gpu=6, train=dict( type='NuScenes3DDetTrackDataset', data_root='data/nuscenes/', classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], modality=dict( use_lidar=False, use_camera=True, use_radar=False, use_map=False, use_external=False), version='v1.0-trainval', ann_file='data/nuscenes_anno_pkls/nuscenes_infos_train.pkl', pipeline=[ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict( type='LoadPointsFromFile', coord_type='LIDAR', load_dim=5, use_dim=5, file_client_args=dict(backend='disk')), dict(type='ResizeCropFlipImage'), dict(type='MultiScaleDepthMapGenerator', downsample=[4, 8, 16]), dict(type='BBoxRotation'), dict(type='PhotoMetricDistortionMultiViewImage'), dict( type='NormalizeMultiviewImage', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict( type='CircleObjectRangeFilter', class_dist_thred=[55, 55, 55, 55, 55, 55, 55, 55, 55, 55]), dict( type='InstanceNameFilter', classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ]), dict(type='NuScenesSparse4DAdaptor'), dict( type='Collect', keys=[ 'img', 'timestamp', 'projection_mat', 'image_wh', 'gt_depth', 'focal', 'gt_bboxes_3d', 'gt_labels_3d' ], meta_keys=[ 'T_global', 'T_global_inv', 'timestamp', 'instance_id' ]) ], test_mode=False, data_aug_conf=dict( resize_lim=(0.4, 0.47), final_dim=(256, 704), bot_pct_lim=(0.0, 0.0), rot_lim=(-5.4, 5.4), H=900, W=1600, rand_flip=True, rot3d_range=[-0.3925, 0.3925]), with_seq_flag=True, sequences_split_num=2, keep_consistent_seq_aug=True), val=dict( type='NuScenes3DDetTrackDataset', data_root='data/nuscenes/', classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], modality=dict( use_lidar=False, use_camera=True, use_radar=False, use_map=False, use_external=False), version='v1.0-trainval', ann_file='data/nuscenes_anno_pkls/nuscenes_infos_val.pkl', pipeline=[ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict(type='ResizeCropFlipImage'), dict( type='NormalizeMultiviewImage', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='NuScenesSparse4DAdaptor'), dict( type='Collect', keys=['img', 'timestamp', 'projection_mat', 'image_wh'], meta_keys=['T_global', 'T_global_inv', 'timestamp']) ], data_aug_conf=dict( resize_lim=(0.4, 0.47), final_dim=(256, 704), bot_pct_lim=(0.0, 0.0), rot_lim=(-5.4, 5.4), H=900, W=1600, rand_flip=True, rot3d_range=[-0.3925, 0.3925]), test_mode=True, tracking=True, tracking_threshold=0.2), test=dict( type='NuScenes3DDetTrackDataset', data_root='data/nuscenes/', classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], modality=dict( use_lidar=False, use_camera=True, use_radar=False, use_map=False, use_external=False), version='v1.0-trainval', ann_file='data/nuscenes_anno_pkls/nuscenes_infos_val.pkl', pipeline=[ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict(type='ResizeCropFlipImage'), dict( type='NormalizeMultiviewImage', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='NuScenesSparse4DAdaptor'), dict( type='Collect', keys=['img', 'timestamp', 'projection_mat', 'image_wh'], meta_keys=['T_global', 'T_global_inv', 'timestamp']) ], data_aug_conf=dict( resize_lim=(0.4, 0.47), final_dim=(256, 704), bot_pct_lim=(0.0, 0.0), rot_lim=(-5.4, 5.4), H=900, W=1600, rand_flip=True, rot3d_range=[-0.3925, 0.3925]), test_mode=True, tracking=True, tracking_threshold=0.2)) optimizer = dict( type='AdamW', lr=0.0006, weight_decay=0.001, paramwise_cfg=dict(custom_keys=dict(img_backbone=dict(lr_mult=0.5)))) optimizer_config = dict(grad_clip=dict(max_norm=25, norm_type=2)) lr_config = dict( policy='CosineAnnealing', warmup='linear', warmup_iters=500, warmup_ratio=0.3333333333333333, min_lr_ratio=0.001) runner = dict(type='IterBasedRunner', max_iters=58600) vis_pipeline = [ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict(type='Collect', keys=['img'], meta_keys=['timestamp', 'lidar2img']) ] evaluation = dict( interval=11720, pipeline=[ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict( type='Collect', keys=['img'], meta_keys=['timestamp', 'lidar2img']) ]) gpu_ids = range(0, 8)
mAP: 0.4534 mATE: 0.5459 mASE: 0.2624 mAOE: 0.4592 mAVE: 0.2185 mAAE: 0.1890 NDS: 0.5592 Eval time: 94.8s
Per-class results: Object Class AP ATE ASE AOE AVE AAE car 0.661 0.363 0.142 0.053 0.184 0.197 truck 0.371 0.586 0.193 0.068 0.175 0.210 bus 0.404 0.637 0.186 0.103 0.389 0.245 trailer 0.155 0.966 0.266 0.605 0.203 0.082 construction_vehicle 0.111 0.901 0.467 1.045 0.119 0.371 pedestrian 0.551 0.526 0.287 0.526 0.290 0.152 motorcycle 0.474 0.530 0.247 0.616 0.261 0.251 bicycle 0.479 0.413 0.259 0.993 0.127 0.005 traffic_cone 0.717 0.254 0.299 nan nan nan barrier 0.610 0.284 0.277 0.123 nan nan
NDS is a more stable metric, prioritize NDS. Fluctuations don't appear significant from NDS.
These fluctuations could possibly be caused by resume
; completing a full training cycle might normalize the metrics.
@zhaoyangwei123
Hi, I trained the model with total batch size 48 and lr 6e-4 for 100 epochs, however I obtained a final mAP of only 44.1 which is much lower than the reported mAP. I've attached my log, let me know if you have any suggestions on why this could be.
I've also seen the issue about Metric Fluctuations however the reported average there was 46.32 mAP which is still much higher than what I got.
s4d.log