Sense-X / Co-DETR

[ICCV 2023] DETRs with Collaborative Hybrid Assignments Training
MIT License
968 stars 105 forks source link

The problems that arise with distributed training using the dino model do not occur with deformable detr #62

Closed yimeng436 closed 4 months ago

yimeng436 commented 11 months ago

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.

yimeng436 commented 11 months ago

I fixed the problem by changing with_cp in the configuration file to False, but I don't know why it happened

yimeng436 commented 11 months ago

What is the function of this configuration item with_cp? Will it affect the training result? If anyone knows about this, please let me know. Thank you very much

TempleX98 commented 11 months ago

with_cp means the number of checkpointing layers in the transformer encoder. It only affects the training memory and speed.

yimeng436 commented 11 months ago

with_cp表示变压器编码器中的检查点层数。它只影响训练记忆和速度。

Thank you very much. I am a novice, I would like to ask what graphics card you use for training, I can not run Swin-L + DINO with 3090 24G video memory

TempleX98 commented 11 months ago
  1. We use A100 80G GPUs.
  2. You can enable backbone checkpointing by setting use_checkpoint=True.
yimeng436 commented 11 months ago
  1. We use A100 80G GPUs.
  2. You can enable backbone checkpointing by setting use_checkpoint=True.

Thank you very much. I'll try

zimenglan-sysu-512 commented 11 months ago

hi @yimeng436 can u successfully run the [Swin-L + DINO] with use_checkpoint=True?

zimenglan-sysu-512 commented 11 months ago
  1. We use A100 80G GPUs.
  2. You can enable backbone checkpointing by setting use_checkpoint=True.

hi @yimeng436 can v100 graphics card can train the ViT-L model which get 66.0%mAP with use_checkpoint=True?

yimeng436 commented 11 months ago

hi @yimeng436 can u successfully run the [Swin-L + DINO] with use_checkpoint=True?

yeah, With this configuration I was able to successfully run with the co_dino_5scale_swin_large_3x configuration file. However, the co_dino_5scale_swin_large_16e_o365tococo file still does not work with the maximum data enhancement (1536, 2048)

yimeng436 commented 11 months ago
  1. We use A100 80G GPUs.
  2. You can enable backbone checkpointing by setting use_checkpoint=True.

hi @yimeng436 can v100 graphics card can train the ViT-L model which get 66.0%mAP with use_checkpoint=True?

I'm a newbie, I don't know how to use the ViT-L model, I just ran up with the author's profile。 I just want to ask how to use the ViT-L model to train, can you teach me

TempleX98 commented 11 months ago
  1. We use A100 80G GPUs.
  2. You can enable backbone checkpointing by setting use_checkpoint=True.

hi @yimeng436 can v100 graphics card can train the ViT-L model which get 66.0%mAP with use_checkpoint=True?

I'm a newbie, I don't know how to use the ViT-L model, I just ran up with the author's profile。 I just want to ask how to use the ViT-L model to train, can you teach me

We have not released the ViT settings. And the 3090 GPU has insufficient memory to train the ViT-L model.

yimeng436 commented 11 months ago
  1. We use A100 80G GPUs.
  2. You can enable backbone checkpointing by setting use_checkpoint=True.

hi @yimeng436 can v100 graphics card can train the ViT-L model which get 66.0%mAP with use_checkpoint=True?

I'm a newbie, I don't know how to use the ViT-L model, I just ran up with the author's profile。 I just want to ask how to use the ViT-L model to train, can you teach me

We have not released the ViT settings. And the 3090 GPU has insufficient memory to train the ViT-L model.

Thank you very much for your answer. Do you plan to release the configuration of VITL later

zimenglan-sysu-512 commented 11 months ago
  1. We use A100 80G GPUs.
  2. You can enable backbone checkpointing by setting use_checkpoint=True.

hi @yimeng436 can v100 graphics card can train the ViT-L model which get 66.0%mAP with use_checkpoint=True?

I'm a newbie, I don't know how to use the ViT-L model, I just ran up with the author's profile。 I just want to ask how to use the ViT-L model to train, can you teach me

We have not released the ViT settings. And the 3090 GPU has insufficient memory to train the ViT-L model.

hi @TempleX98 can u share some configs about how to train the ViT model, like the what kind of graphics card to use and the batch size, whether need gradient accumulation, and so on, thanks.

TempleX98 commented 11 months ago

@zimenglan-sysu-512 Hi, we may release the ViT model and config in the future. The model settings are presented in the appendix of our paper. We used 56 A100 80G GPUs with img_per_gpu set to 4 during pretraining. If you want to train this model with smaller graphics cards, you may need to use FSDP.

yimeng436 commented 11 months ago
  1. We use A100 80G GPUs.
  2. You can enable backbone checkpointing by setting use_checkpoint=True.

hi @yimeng436 can v100 graphics card can train the ViT-L model which get 66.0%mAP with use_checkpoint=True?

I'm a newbie, I don't know how to use the ViT-L model, I just ran up with the author's profile。 I just want to ask how to use the ViT-L model to train, can you teach me

We have not released the ViT settings. And the 3090 GPU has insufficient memory to train the ViT-L model.

Hi, I used 3090 run Swin-L+DINO and could not achieve the accuracy difference of 0.8 mentioned in your paper. Is this normal?

TempleX98 commented 11 months ago
  1. We use A100 80G GPUs.
  2. You can enable backbone checkpointing by setting use_checkpoint=True.

hi @yimeng436 can v100 graphics card can train the ViT-L model which get 66.0%mAP with use_checkpoint=True?

I'm a newbie, I don't know how to use the ViT-L model, I just ran up with the author's profile。 I just want to ask how to use the ViT-L model to train, can you teach me

We have not released the ViT settings. And the 3090 GPU has insufficient memory to train the ViT-L model.

Hi, I used 3090 run Swin-L+DINO and could not achieve the accuracy difference of 0.8 mentioned in your paper. Is this normal?

Can you show me your training log and config?

yimeng436 commented 11 months ago

dataset_type = 'CocoDataset' data_root = '/mnt/share/zyh/mmdetection-master/data/coco/' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RandomFlip', flip_ratio=0.5), dict( type='AutoAugment', policies=[[{ 'type': 'Resize', 'img_scale': [(480, 1333), (512, 1333), (544, 1333), (576, 1333), (608, 1333), (640, 1333), (672, 1333), (704, 1333), (736, 1333), (768, 1333), (800, 1333)], 'multiscale_mode': 'value', 'keep_ratio': True }], [{ 'type': 'Resize', 'img_scale': [(400, 4200), (500, 4200), (600, 4200)], 'multiscale_mode': 'value', 'keep_ratio': True }, { 'type': 'RandomCrop', 'crop_type': 'absolute_range', 'crop_size': (384, 600), 'allow_negative_crop': True }, { 'type': 'Resize', 'img_scale': [(480, 1333), (512, 1333), (544, 1333), (576, 1333), (608, 1333), (640, 1333), (672, 1333), (704, 1333), (736, 1333), (768, 1333), (800, 1333)], 'multiscale_mode': 'value', 'override': True, 'keep_ratio': True }]]), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=1), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=1), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=1, workers_per_gpu=1, train=dict( type='CocoDataset', ann_file='/mnt/share/zyh/mmdetection-master/data/coco/annotations/instances_train2017.json', img_prefix='/mnt/share/zyh/mmdetection-master/data/coco/train2017/', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RandomFlip', flip_ratio=0.5), dict( type='AutoAugment', policies=[[{ 'type': 'Resize', 'img_scale': [(480, 1333), (512, 1333), (544, 1333), (576, 1333), (608, 1333), (640, 1333), (672, 1333), (704, 1333), (736, 1333), (768, 1333), (800, 1333)], 'multiscale_mode': 'value', 'keep_ratio': True }], [{ 'type': 'Resize', 'img_scale': [(400, 4200), (500, 4200), (600, 4200)], 'multiscale_mode': 'value', 'keep_ratio': True }, { 'type': 'RandomCrop', 'crop_type': 'absolute_range', 'crop_size': (384, 600), 'allow_negative_crop': True }, { 'type': 'Resize', 'img_scale': [(480, 1333), (512, 1333), (544, 1333), (576, 1333), (608, 1333), (640, 1333), (672, 1333), (704, 1333), (736, 1333), (768, 1333), (800, 1333)], 'multiscale_mode': 'value', 'override': True, 'keep_ratio': True }]]), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=1), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ], filter_empty_gt=False), val=dict( type='CocoDataset', ann_file='/mnt/share/zyh/mmdetection-master/data/coco/annotations/instances_val2017.json', img_prefix='/mnt/share/zyh/mmdetection-master/data/coco/val2017/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=1), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ]), test=dict( type='CocoDataset', ann_file='/mnt/share/zyh/mmdetection-master/data/coco/annotations/instances_val2017.json', img_prefix='/mnt/share/zyh/mmdetection-master/data/coco/val2017/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=1), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ])) evaluation = dict(interval=1, metric='bbox') checkpoint_config = dict(interval=1) log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')]) custom_hooks = [dict(type='NumClassCheckHook')] dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = '/mnt/share/zyh/Co-DETR-new/tools/co_detr_dino/log/latest.pth' workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' auto_scale_lr = dict(enable=False, base_batch_size=16) num_dec_layer = 6 lambda_2 = 2.0 model = dict( type='CoDETR', backbone=dict( type='SwinTransformerV1', embed_dim=192, depths=[2, 2, 18, 2], num_heads=[6, 12, 24, 48], out_indices=(0, 1, 2, 3), window_size=12, ape=False, drop_path_rate=0.3, patch_norm=True, use_checkpoint=True, pretrained='/mnt/share/zyh/Co-DETR-base/tools/pretrain_weights/swin_large_patch4_window12_384_22k.pth'), neck=dict( type='ChannelMapper', in_channels=[192, 384, 768, 1536], kernel_size=1, out_channels=256, act_cfg=None, norm_cfg=dict(type='GN', num_groups=32), num_outs=5), rpn_head=dict( type='RPNHead', in_channels=256, feat_channels=256, anchor_generator=dict( type='AnchorGenerator', octave_base_scale=4, scales_per_octave=3, ratios=[0.5, 1.0, 2.0], strides=[4, 8, 16, 32, 64, 128]), bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[1.0, 1.0, 1.0, 1.0]), loss_cls=dict( type='CrossEntropyLoss', use_sigmoid=True, loss_weight=12.0), loss_bbox=dict(type='L1Loss', loss_weight=12.0)), query_head=dict( type='CoDINOHead', num_query=900, num_classes=80, num_feature_levels=5, in_channels=2048, sync_cls_avg_factor=True, as_two_stage=True, with_box_refine=True, mixed_selection=True, dn_cfg=dict( type='CdnQueryGenerator', noise_scale=dict(label=0.5, box=1.0), group_cfg=dict(dynamic=True, num_groups=None, num_dn_queries=100)), transformer=dict( type='CoDinoTransformer', with_pos_coord=True, with_coord_feat=False, num_co_heads=2, num_feature_levels=5, encoder=dict( type='DetrTransformerEncoder', num_layers=6, with_cp=False, transformerlayers=dict( type='BaseTransformerLayer', attn_cfgs=dict( type='MultiScaleDeformableAttention', embed_dims=256, num_levels=5, dropout=0.0), feedforward_channels=2048, ffn_dropout=0.0, operation_order=('self_attn', 'norm', 'ffn', 'norm'))), decoder=dict( type='DinoTransformerDecoder', num_layers=6, return_intermediate=True, transformerlayers=dict( type='DetrTransformerDecoderLayer', attn_cfgs=[ dict( type='MultiheadAttention', embed_dims=256, num_heads=8, dropout=0.0), dict( type='MultiScaleDeformableAttention', embed_dims=256, num_levels=5, dropout=0.0) ], feedforward_channels=2048, ffn_dropout=0.0, operation_order=('self_attn', 'norm', 'cross_attn', 'norm', 'ffn', 'norm')))), positional_encoding=dict( type='SinePositionalEncoding', num_feats=128, temperature=20, normalize=True), loss_cls=dict( type='QualityFocalLoss', use_sigmoid=True, beta=2.0, loss_weight=1.0), loss_bbox=dict(type='L1Loss', loss_weight=5.0), loss_iou=dict(type='GIoULoss', loss_weight=2.0)), roi_head=[ dict( type='CoStandardRoIHead', bbox_roi_extractor=dict( type='SingleRoIExtractor', roi_layer=dict( type='RoIAlign', output_size=7, sampling_ratio=0), out_channels=256, featmap_strides=[4, 8, 16, 32, 64], finest_scale=56), bbox_head=dict( type='Shared2FCBBoxHead', in_channels=256, fc_out_channels=1024, roi_feat_size=7, num_classes=80, bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[0.1, 0.1, 0.2, 0.2]), reg_class_agnostic=False, reg_decoded_bbox=True, loss_cls=dict( type='CrossEntropyLoss', use_sigmoid=False, loss_weight=12.0), loss_bbox=dict(type='GIoULoss', loss_weight=120.0))) ], bbox_head=[ dict( type='CoATSSHead', num_classes=80, in_channels=256, stacked_convs=1, feat_channels=256, anchor_generator=dict( type='AnchorGenerator', ratios=[1.0], octave_base_scale=8, scales_per_octave=1, strides=[4, 8, 16, 32, 64, 128]), bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[0.1, 0.1, 0.2, 0.2]), loss_cls=dict( type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, loss_weight=12.0), loss_bbox=dict(type='GIoULoss', loss_weight=24.0), loss_centerness=dict( type='CrossEntropyLoss', use_sigmoid=True, loss_weight=12.0)) ], train_cfg=[ dict( assigner=dict( type='HungarianAssigner', cls_cost=dict(type='FocalLossCost', weight=2.0), reg_cost=dict( type='BBoxL1Cost', weight=5.0, box_format='xywh'), iou_cost=dict(type='IoUCost', iou_mode='giou', weight=2.0))), dict( rpn=dict( assigner=dict( type='MaxIoUAssigner', pos_iou_thr=0.7, neg_iou_thr=0.3, min_pos_iou=0.3, match_low_quality=True, ignore_iof_thr=-1), sampler=dict( type='RandomSampler', num=256, pos_fraction=0.5, neg_pos_ub=-1, add_gt_as_proposals=False), allowed_border=-1, pos_weight=-1, debug=False), rpn_proposal=dict( nms_pre=4000, max_per_img=1000, nms=dict(type='nms', iou_threshold=0.7), min_bbox_size=0), rcnn=dict( assigner=dict( type='MaxIoUAssigner', pos_iou_thr=0.5, neg_iou_thr=0.5, min_pos_iou=0.5, match_low_quality=False, ignore_iof_thr=-1), sampler=dict( type='RandomSampler', num=512, pos_fraction=0.25, neg_pos_ub=-1, add_gt_as_proposals=True), pos_weight=-1, debug=False)), dict( assigner=dict(type='ATSSAssigner', topk=9), allowed_border=-1, pos_weight=-1, debug=False) ], test_cfg=[ dict(max_per_img=300, nms=dict(type='soft_nms', iou_threshold=0.8)), dict( rpn=dict( nms_pre=1000, max_per_img=1000, nms=dict(type='nms', iou_threshold=0.7), min_bbox_size=0), rcnn=dict( score_thr=0.0, nms=dict(type='nms', iou_threshold=0.5), max_per_img=100)), dict( nms_pre=1000, min_bbox_size=0, score_thr=0.0, nms=dict(type='nms', iou_threshold=0.6), max_per_img=100) ]) optimizer = dict( type='AdamW', lr=0.0002, weight_decay=0.0001, paramwise_cfg=dict(custom_keys=dict(backbone=dict(lr_mult=0.1)))) optimizer_config = dict(grad_clip=dict(max_norm=0.1, norm_type=2)) lr_config = dict(policy='step', step=[11]) runner = dict(type='EpochBasedRunner', max_epochs=12) pretrained = '/mnt/share/zyh/Co-DETR-base/tools/pretrain_weights/swin_large_patch4_window12_384_22k.pth' work_dir = './co_detr_dino' auto_resume = False gpu_ids = [0]

It is my config

yimeng436 commented 11 months ago

image

This is the information for the last epoch

TempleX98 commented 11 months ago

@yimeng436, I notice that you use only 1 GPU for training (1 image per GPU). Our default batch size is 16 and the learning rate should be linearly scaled: 0.0002*1/16=1.25e-05.

yimeng436 commented 11 months ago

@yimeng436, I notice that you use only 1 GPU for training (1 image per GPU). Our default batch size is 16 and the learning rate should be linearly scaled: .0.0002*1/16=1.25e-05

oh thanks a lot