exiawsh / StreamPETR

[ICCV 2023] StreamPETR: Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection
Other
578 stars 63 forks source link

config of vit large #7

Closed wzc9401 closed 1 year ago

wzc9401 commented 1 year ago

Thanks for sharing the great work!

I was wondering, it would be great if you could also share the training config files and checkpoints for nuscenes leaderboard version which achieves 0.62 map and 0.676 nds.

Looking forward to your reply

exiawsh commented 1 year ago

We plan to release the config files of StreamPETR-Large in October. Here we can provide some training details about StreamPETR-Large. We didn't modify the detector and using MIM-pretrained VIT-Large backbone. The image size is 1600*640, we set the backbone learning rate to 0.1x base learning rate and conduct layer-wise learning rate decay. StreamPETR-Large was trained 24 epochs on trainval set using stream training (without cbgs), which consumes about 60 hours on 8x A100 GPUs.

wzc9401 commented 1 year ago

@exiawsh Thank you for replying! When I try to increase the resolution to 512 x 1408, the model effect deteriorates. what else needs to be modified besides changing the resolution in the config?

exiawsh commented 1 year ago

@exiawsh Thank you for replying! When I try to increase the resolution to 512 x 1408, the model effect deteriorates. what else needs to be modified besides changing the resolution in the config?

Would you please provide your config files? I will check it.

wzc9401 commented 1 year ago

@exiawsh Here, thanks a lot!

base = [ '../../../mmdetection3d-1.0.0rc6/configs/base/datasets/nus-3d.py', '../../../mmdetection3d-1.0.0rc6/configs/base/default_runtime.py' ] backbone_norm_cfg = dict(type='LN', requires_grad=True) plugin=True plugin_dir='projects/mmdet3d_plugin/'

point_cloud_range = [-51.2, -51.2, -5.0, 51.2, 51.2, 3.0] voxel_size = [0.2, 0.2, 8] img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)

class_names = [ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ]

ida_aug_conf = { "resize_lim": (0.38, 0.55), "final_dim": (512, 1408), "bot_pct_lim": (0.0, 0.0), "rot_lim": (0.0, 0.0), "H": 900, "W": 1600, "rand_flip": True, }

num_gpus = 8 batch_size = 1 num_iters_per_epoch = 28130 // (num_gpus * batch_size) num_epochs = 24

queue_length = 1 num_frame_losses = 1 collect_keys=['lidar2img', 'intrinsics', 'extrinsics','timestamp', 'img_timestamp', 'ego_pose', 'ego_pose_inv'] input_modality = dict( use_lidar=False, use_camera=True, use_radar=False, use_map=False, use_external=True) model = dict( type='Petr3D', num_frame_head_grads=num_frame_losses, num_frame_backbone_grads=num_frame_losses, num_frame_losses=num_frame_losses, use_grid_mask=True, img_backbone=dict( pretrained='torchvision://resnet50', type='ResNet', depth=50, num_stages=4, out_indices=(2, 3), frozen_stages=-1, norm_cfg=dict(type='BN2d', requires_grad=False), norm_eval=True, with_cp=True, style='pytorch'), img_neck=dict( type='CPFPN', ###remove unused parameters in_channels=[1024, 2048], out_channels=256, num_outs=2), img_roi_head=dict( type='FocalHead', num_classes=10, in_channels=256, loss_cls2d=dict( type='QualityFocalLoss', use_sigmoid=True, beta=2.0, loss_weight=2.0), loss_centerness=dict(type='GaussianFocalLoss', reduction='mean', loss_weight=1.0), loss_bbox2d=dict(type='L1Loss', loss_weight=5.0), loss_iou2d=dict(type='GIoULoss', loss_weight=2.0), loss_centers2d=dict(type='L1Loss', loss_weight=10.0), train_cfg=dict( assigner2d=dict( type='HungarianAssigner2D', cls_cost=dict(type='FocalLossCost', weight=2.), reg_cost=dict(type='BBoxL1Cost', weight=5.0, box_format='xywh'), iou_cost=dict(type='IoUCost', iou_mode='giou', weight=2.0), centers2d_cost=dict(type='BBox3DL1Cost', weight=10.0))) ), pts_bbox_head=dict( type='StreamPETRHead', num_classes=10, in_channels=256, num_query=644, memory_len=1024, topk_proposals=256, num_propagated=256, with_ego_pos=True, match_with_velo=False, scalar=10, ##noise groups noise_scale = 1.0, dn_weight= 1.0, ##dn loss weight split = 0.75, ###positive rate LID=True, with_position=True, position_range=[-61.2, -61.2, -10.0, 61.2, 61.2, 10.0], code_weights = [2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], transformer=dict( type='PETRTemporalTransformer', decoder=dict( type='PETRTransformerDecoder', return_intermediate=True, num_layers=6, transformerlayers=dict( type='PETRTemporalDecoderLayer', attn_cfgs=[ dict( type='MultiheadAttention', embed_dims=256, num_heads=8, dropout=0.1), dict( type='PETRMultiheadAttention', embed_dims=256, num_heads=8, dropout=0.1), ], feedforward_channels=2048, ffn_dropout=0.1, with_cp=True, ###use checkpoint to save memory operation_order=('self_attn', 'norm', 'cross_attn', 'norm', 'ffn', 'norm')), )), bbox_coder=dict( type='NMSFreeCoder', post_center_range=[-61.2, -61.2, -10.0, 61.2, 61.2, 10.0], pc_range=point_cloud_range, max_num=300, voxel_size=voxel_size, num_classes=10), loss_cls=dict( type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, loss_weight=2.0), loss_bbox=dict(type='L1Loss', loss_weight=0.25), loss_iou=dict(type='GIoULoss', loss_weight=0.0),),

train_cfg=dict(pts=dict(
    grid_size=[512, 512, 1],
    voxel_size=voxel_size,
    point_cloud_range=point_cloud_range,
    out_size_factor=4,
    assigner=dict(
        type='HungarianAssigner3D',
        cls_cost=dict(type='FocalLossCost', weight=2.0),
        reg_cost=dict(type='BBox3DL1Cost', weight=0.25),
        iou_cost=dict(type='IoUCost', weight=0.0), # Fake cost. This is just to make it compatible with DETR head. 
        pc_range=point_cloud_range),)))

dataset_type = 'CustomNuScenesDataset' data_root = '/data/nuscenes/'

file_client_args = dict(backend='disk')

train_pipeline = [ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True, with_bbox=True, with_label=True, with_bbox_depth=True), dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), dict(type='ObjectNameFilter', classes=class_names), dict(type='ResizeCropFlipRotImage', data_aug_conf = ida_aug_conf, training=True), dict(type='GlobalRotScaleTransImage', rot_range=[-0.3925, 0.3925], translation_std=[0, 0, 0], scale_ratio_range=[0.95, 1.05], reverse_angle=True, training=True, ), dict(type='NormalizeMultiviewImage', img_norm_cfg), dict(type='PadMultiViewImage', size_divisor=32), dict(type='PETRFormatBundle3D', class_names=class_names, collect_keys=collect_keys + ['prev_exists']), dict(type='Collect3D', keys=['gt_bboxes_3d', 'gt_labels_3d', 'img', 'gt_bboxes', 'gt_labels', 'centers2d', 'depths', 'prev_exists'] + collect_keys, meta_keys=('filename', 'ori_shape', 'img_shape', 'pad_shape', 'scale_factor', 'flip', 'box_mode_3d', 'box_type_3d', 'img_norm_cfg', 'scene_token', 'gt_bboxes_3d','gt_labels_3d')) ] test_pipeline = [ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict(type='ResizeCropFlipRotImage', data_aug_conf = ida_aug_conf, training=False), dict(type='NormalizeMultiviewImage', img_norm_cfg), dict(type='PadMultiViewImage', size_divisor=32), dict( type='MultiScaleFlipAug3D', img_scale=(1333, 800), pts_scale_ratio=1, flip=False, transforms=[ dict( type='PETRFormatBundle3D', collect_keys=collect_keys, class_names=class_names, with_label=False), dict(type='Collect3D', keys=['img'] + collect_keys, meta_keys=('filename', 'ori_shape', 'img_shape','pad_shape', 'scale_factor', 'flip', 'box_mode_3d', 'box_type_3d', 'img_norm_cfg', 'scene_token')) ]) ]

data = dict( samples_per_gpu=batch_size, workers_per_gpu=4, train=dict( type=dataset_type, data_root=data_root, ann_file=data_root + 'nuscenes2d_temporal_infos_train.pkl', num_frame_losses=num_frame_losses, seq_split_num=2, # streaming video training seq_mode=True, # streaming video training pipeline=train_pipeline, classes=class_names, modality=input_modality, collect_keys=collect_keys + ['img', 'prev_exists', 'img_metas'], queue_length=queue_length, test_mode=False, use_valid_flag=True, box_type_3d='LiDAR'), val=dict(type=dataset_type, pipeline=test_pipeline, data_root=data_root, collect_keys=collect_keys + ['img', 'img_metas'], queue_length=queue_length, ann_file=data_root + 'nuscenes2d_temporal_infos_val.pkl', classes=class_names, modality=input_modality), test=dict(type=dataset_type, pipeline=test_pipeline, data_root=data_root, collect_keys=collect_keys + ['img', 'img_metas'], queue_length=queue_length, ann_file=data_root + 'nuscenes2d_temporal_infos_val.pkl', classes=class_names, modality=input_modality), shuffler_sampler=dict(type='InfiniteGroupEachSampleInBatchSampler'), nonshuffler_sampler=dict(type='DistributedSampler') )

optimizer = dict( type='AdamW', lr=4e-4, # bs 8: 2e-4 || bs 16: 4e-4 paramwise_cfg=dict( custom_keys={ 'img_backbone': dict(lr_mult=0.25), # 0.25 only for Focal-PETR with R50-in1k pretrained weights }), weight_decay=0.01)

optimizer_config = dict(type='Fp16OptimizerHook', loss_scale='dynamic', grad_clip=dict(max_norm=35, norm_type=2))

lr_config = dict( policy='CosineAnnealing', warmup='linear', warmup_iters=500, warmup_ratio=1.0 / 3, min_lr_ratio=1e-3, )

evaluation = dict(interval=num_iters_per_epochnum_epochs, pipeline=test_pipeline) find_unused_parameters=False #### when use checkpoint, find_unused_parameters must be False checkpoint_config = dict(interval=num_iters_per_epoch, max_keep_ckpts=3) runner = dict( type='IterBasedRunner', max_iters=num_epochs num_iters_per_epoch) load_from=None resume_from=None

exiawsh commented 1 year ago

you need to change the ida_aug_conf = { "resize_lim": (0.8, 1.0), "final_dim": (512, 1408), "bot_pct_lim": (0.0, 0.0), "rot_lim": (0.0, 0.0), "H": 900, "W": 1600, "rand_flip": True, } the resieze_lim should be modified based on your final_dim.

wzc9401 commented 1 year ago

thanks,I will try it!

exiawsh commented 1 year ago

thanks,I will try it!

I have found that your learning rate setting may have some problems. You set the num_gpus = 8, batch_size = 1. The total_batch_size is 8*1 = 8, the learning rate should be set to 2e-4. (2e-4 for total_batch_size 8, 4e-4 for total batch_size 16). I recommend you use total_batch_size 16 (e.g. num_gpus = 8, batch_size = 2)

wzc9401 commented 1 year ago

Yes, when I started the task, I passed the parameter to modify the learning rate to 2e-4.

FANG-MING commented 1 year ago

We plan to release the config files of StreamPETR-Large in October. Here we can provide some training details about StreamPETR-Large. We didn't modify the detector and using MIM-pretrained VIT-Large backbone. The image size is 1600*640, we set the backbone learning rate to 0.1x base learning rate and conduct layer-wise learning rate decay. StreamPETR-Large was trained 24 epochs on trainval set using stream training (without cbgs), which consumes about 60 hours on 8x A100 GPUs.

when I use layer-wise learning rate decay, I always get the loss to be NAN. Could you provide the config of layer-wise learning rate decay, If it is convenient.

iuhiyuh commented 1 year ago

We plan to release the config files of StreamPETR-Large in October. Here we can provide some training details about StreamPETR-Large. We didn't modify the detector and using MIM-pretrained VIT-Large backbone. The image size is 1600*640, we set the backbone learning rate to 0.1x base learning rate and conduct layer-wise learning rate decay. StreamPETR-Large was trained 24 epochs on trainval set using stream training (without cbgs), which consumes about 60 hours on 8x A100 GPUs.

@exiawsh Hello, I've implemented a standard Facebook ViT version, but it's running into out of memory error on 8xA100 with a resolution of 1600x640. Are there any modifications to the model architecture or other potential causes for this? Can you provide some advice?

yingfei1016 commented 1 year ago

@exiawsh Hello, I've implemented a standard Facebook ViT version, but it's running into out of memory error on 8xA100 with a resolution of 1600x640. Are there any modifications to the model architecture or other potential causes for this? Can you provide some advice?

You can see https://github.com/exiawsh/StreamPETR/blob/main/docs/ViT_Large.md

yingfei1016 commented 1 year ago

The mismatch of 'rope_glb' parameter don‘t need handle,as the 'rope_glb' parameter are not learnable. The model will re-compute these parameters.

FANG-MING commented 1 year ago

@exiawsh Hello, I've implemented a standard Facebook ViT version, but it's running into out of memory error on 8xA100 with a resolution of 1600x640. Are there any modifications to the model architecture or other potential causes for this? Can you provide some advice?

You can see https://github.com/exiawsh/StreamPETR/blob/main/docs/ViT_Large.md

But I got ViT __init__() got an unexpected keyword argument 'global_window_size

FANG-MING commented 1 year ago

The mismatch of 'rope_glb' parameter don‘t need handle,as the 'rope_glb' parameter are not learnable. The model will re-compute these parameters.

And also got def forward(self, t): return t * self.freqs_cos + rotate_half(t) * self.freqs_sin RuntimeError: The size of tensor a (1000) must match the size of tensor b (400) at non-singleton dimension 2

yingfei1016 commented 1 year ago

But I got ViT __init__() got an unexpected keyword argument 'global_window_size

586c79bf4ba324c73fe0a06f5f7280a

You need read here carefully!