Closed yimeng436 closed 4 months ago
I fixed the problem by changing with_cp in the configuration file to False, but I don't know why it happened
What is the function of this configuration item with_cp? Will it affect the training result? If anyone knows about this, please let me know. Thank you very much
with_cp
means the number of checkpointing layers in the transformer encoder. It only affects the training memory and speed.
with_cp
表示变压器编码器中的检查点层数。它只影响训练记忆和速度。
Thank you very much. I am a novice, I would like to ask what graphics card you use for training, I can not run Swin-L + DINO with 3090 24G video memory
use_checkpoint=True
.
- We use A100 80G GPUs.
- You can enable backbone checkpointing by setting
use_checkpoint=True
.
Thank you very much. I'll try
hi @yimeng436 can u successfully run the [Swin-L + DINO] with use_checkpoint=True?
- We use A100 80G GPUs.
- You can enable backbone checkpointing by setting
use_checkpoint=True
.
hi @yimeng436 can v100 graphics card can train the ViT-L model which get 66.0%mAP with use_checkpoint=True?
hi @yimeng436 can u successfully run the [Swin-L + DINO] with use_checkpoint=True?
yeah, With this configuration I was able to successfully run with the co_dino_5scale_swin_large_3x configuration file. However, the co_dino_5scale_swin_large_16e_o365tococo file still does not work with the maximum data enhancement (1536, 2048)
- We use A100 80G GPUs.
- You can enable backbone checkpointing by setting
use_checkpoint=True
.hi @yimeng436 can v100 graphics card can train the ViT-L model which get 66.0%mAP with use_checkpoint=True?
I'm a newbie, I don't know how to use the ViT-L model, I just ran up with the author's profile。 I just want to ask how to use the ViT-L model to train, can you teach me
- We use A100 80G GPUs.
- You can enable backbone checkpointing by setting
use_checkpoint=True
.hi @yimeng436 can v100 graphics card can train the ViT-L model which get 66.0%mAP with use_checkpoint=True?
I'm a newbie, I don't know how to use the ViT-L model, I just ran up with the author's profile。 I just want to ask how to use the ViT-L model to train, can you teach me
We have not released the ViT settings. And the 3090 GPU has insufficient memory to train the ViT-L model.
- We use A100 80G GPUs.
- You can enable backbone checkpointing by setting
use_checkpoint=True
.hi @yimeng436 can v100 graphics card can train the ViT-L model which get 66.0%mAP with use_checkpoint=True?
I'm a newbie, I don't know how to use the ViT-L model, I just ran up with the author's profile。 I just want to ask how to use the ViT-L model to train, can you teach me
We have not released the ViT settings. And the 3090 GPU has insufficient memory to train the ViT-L model.
Thank you very much for your answer. Do you plan to release the configuration of VITL later
- We use A100 80G GPUs.
- You can enable backbone checkpointing by setting
use_checkpoint=True
.hi @yimeng436 can v100 graphics card can train the ViT-L model which get 66.0%mAP with use_checkpoint=True?
I'm a newbie, I don't know how to use the ViT-L model, I just ran up with the author's profile。 I just want to ask how to use the ViT-L model to train, can you teach me
We have not released the ViT settings. And the 3090 GPU has insufficient memory to train the ViT-L model.
hi @TempleX98 can u share some configs about how to train the ViT model, like the what kind of graphics card to use and the batch size, whether need gradient accumulation, and so on, thanks.
@zimenglan-sysu-512 Hi, we may release the ViT model and config in the future. The model settings are presented in the appendix of our paper. We used 56 A100 80G GPUs with img_per_gpu
set to 4 during pretraining. If you want to train this model with smaller graphics cards, you may need to use FSDP.
- We use A100 80G GPUs.
- You can enable backbone checkpointing by setting
use_checkpoint=True
.hi @yimeng436 can v100 graphics card can train the ViT-L model which get 66.0%mAP with use_checkpoint=True?
I'm a newbie, I don't know how to use the ViT-L model, I just ran up with the author's profile。 I just want to ask how to use the ViT-L model to train, can you teach me
We have not released the ViT settings. And the 3090 GPU has insufficient memory to train the ViT-L model.
Hi, I used 3090 run Swin-L+DINO and could not achieve the accuracy difference of 0.8 mentioned in your paper. Is this normal?
- We use A100 80G GPUs.
- You can enable backbone checkpointing by setting
use_checkpoint=True
.hi @yimeng436 can v100 graphics card can train the ViT-L model which get 66.0%mAP with use_checkpoint=True?
I'm a newbie, I don't know how to use the ViT-L model, I just ran up with the author's profile。 I just want to ask how to use the ViT-L model to train, can you teach me
We have not released the ViT settings. And the 3090 GPU has insufficient memory to train the ViT-L model.
Hi, I used 3090 run Swin-L+DINO and could not achieve the accuracy difference of 0.8 mentioned in your paper. Is this normal?
Can you show me your training log and config?
dataset_type = 'CocoDataset' data_root = '/mnt/share/zyh/mmdetection-master/data/coco/' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RandomFlip', flip_ratio=0.5), dict( type='AutoAugment', policies=[[{ 'type': 'Resize', 'img_scale': [(480, 1333), (512, 1333), (544, 1333), (576, 1333), (608, 1333), (640, 1333), (672, 1333), (704, 1333), (736, 1333), (768, 1333), (800, 1333)], 'multiscale_mode': 'value', 'keep_ratio': True }], [{ 'type': 'Resize', 'img_scale': [(400, 4200), (500, 4200), (600, 4200)], 'multiscale_mode': 'value', 'keep_ratio': True }, { 'type': 'RandomCrop', 'crop_type': 'absolute_range', 'crop_size': (384, 600), 'allow_negative_crop': True }, { 'type': 'Resize', 'img_scale': [(480, 1333), (512, 1333), (544, 1333), (576, 1333), (608, 1333), (640, 1333), (672, 1333), (704, 1333), (736, 1333), (768, 1333), (800, 1333)], 'multiscale_mode': 'value', 'override': True, 'keep_ratio': True }]]), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=1), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=1), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=1, workers_per_gpu=1, train=dict( type='CocoDataset', ann_file='/mnt/share/zyh/mmdetection-master/data/coco/annotations/instances_train2017.json', img_prefix='/mnt/share/zyh/mmdetection-master/data/coco/train2017/', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RandomFlip', flip_ratio=0.5), dict( type='AutoAugment', policies=[[{ 'type': 'Resize', 'img_scale': [(480, 1333), (512, 1333), (544, 1333), (576, 1333), (608, 1333), (640, 1333), (672, 1333), (704, 1333), (736, 1333), (768, 1333), (800, 1333)], 'multiscale_mode': 'value', 'keep_ratio': True }], [{ 'type': 'Resize', 'img_scale': [(400, 4200), (500, 4200), (600, 4200)], 'multiscale_mode': 'value', 'keep_ratio': True }, { 'type': 'RandomCrop', 'crop_type': 'absolute_range', 'crop_size': (384, 600), 'allow_negative_crop': True }, { 'type': 'Resize', 'img_scale': [(480, 1333), (512, 1333), (544, 1333), (576, 1333), (608, 1333), (640, 1333), (672, 1333), (704, 1333), (736, 1333), (768, 1333), (800, 1333)], 'multiscale_mode': 'value', 'override': True, 'keep_ratio': True }]]), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=1), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ], filter_empty_gt=False), val=dict( type='CocoDataset', ann_file='/mnt/share/zyh/mmdetection-master/data/coco/annotations/instances_val2017.json', img_prefix='/mnt/share/zyh/mmdetection-master/data/coco/val2017/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=1), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ]), test=dict( type='CocoDataset', ann_file='/mnt/share/zyh/mmdetection-master/data/coco/annotations/instances_val2017.json', img_prefix='/mnt/share/zyh/mmdetection-master/data/coco/val2017/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=1), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ])) evaluation = dict(interval=1, metric='bbox') checkpoint_config = dict(interval=1) log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')]) custom_hooks = [dict(type='NumClassCheckHook')] dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = '/mnt/share/zyh/Co-DETR-new/tools/co_detr_dino/log/latest.pth' workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' auto_scale_lr = dict(enable=False, base_batch_size=16) num_dec_layer = 6 lambda_2 = 2.0 model = dict( type='CoDETR', backbone=dict( type='SwinTransformerV1', embed_dim=192, depths=[2, 2, 18, 2], num_heads=[6, 12, 24, 48], out_indices=(0, 1, 2, 3), window_size=12, ape=False, drop_path_rate=0.3, patch_norm=True, use_checkpoint=True, pretrained='/mnt/share/zyh/Co-DETR-base/tools/pretrain_weights/swin_large_patch4_window12_384_22k.pth'), neck=dict( type='ChannelMapper', in_channels=[192, 384, 768, 1536], kernel_size=1, out_channels=256, act_cfg=None, norm_cfg=dict(type='GN', num_groups=32), num_outs=5), rpn_head=dict( type='RPNHead', in_channels=256, feat_channels=256, anchor_generator=dict( type='AnchorGenerator', octave_base_scale=4, scales_per_octave=3, ratios=[0.5, 1.0, 2.0], strides=[4, 8, 16, 32, 64, 128]), bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[1.0, 1.0, 1.0, 1.0]), loss_cls=dict( type='CrossEntropyLoss', use_sigmoid=True, loss_weight=12.0), loss_bbox=dict(type='L1Loss', loss_weight=12.0)), query_head=dict( type='CoDINOHead', num_query=900, num_classes=80, num_feature_levels=5, in_channels=2048, sync_cls_avg_factor=True, as_two_stage=True, with_box_refine=True, mixed_selection=True, dn_cfg=dict( type='CdnQueryGenerator', noise_scale=dict(label=0.5, box=1.0), group_cfg=dict(dynamic=True, num_groups=None, num_dn_queries=100)), transformer=dict( type='CoDinoTransformer', with_pos_coord=True, with_coord_feat=False, num_co_heads=2, num_feature_levels=5, encoder=dict( type='DetrTransformerEncoder', num_layers=6, with_cp=False, transformerlayers=dict( type='BaseTransformerLayer', attn_cfgs=dict( type='MultiScaleDeformableAttention', embed_dims=256, num_levels=5, dropout=0.0), feedforward_channels=2048, ffn_dropout=0.0, operation_order=('self_attn', 'norm', 'ffn', 'norm'))), decoder=dict( type='DinoTransformerDecoder', num_layers=6, return_intermediate=True, transformerlayers=dict( type='DetrTransformerDecoderLayer', attn_cfgs=[ dict( type='MultiheadAttention', embed_dims=256, num_heads=8, dropout=0.0), dict( type='MultiScaleDeformableAttention', embed_dims=256, num_levels=5, dropout=0.0) ], feedforward_channels=2048, ffn_dropout=0.0, operation_order=('self_attn', 'norm', 'cross_attn', 'norm', 'ffn', 'norm')))), positional_encoding=dict( type='SinePositionalEncoding', num_feats=128, temperature=20, normalize=True), loss_cls=dict( type='QualityFocalLoss', use_sigmoid=True, beta=2.0, loss_weight=1.0), loss_bbox=dict(type='L1Loss', loss_weight=5.0), loss_iou=dict(type='GIoULoss', loss_weight=2.0)), roi_head=[ dict( type='CoStandardRoIHead', bbox_roi_extractor=dict( type='SingleRoIExtractor', roi_layer=dict( type='RoIAlign', output_size=7, sampling_ratio=0), out_channels=256, featmap_strides=[4, 8, 16, 32, 64], finest_scale=56), bbox_head=dict( type='Shared2FCBBoxHead', in_channels=256, fc_out_channels=1024, roi_feat_size=7, num_classes=80, bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[0.1, 0.1, 0.2, 0.2]), reg_class_agnostic=False, reg_decoded_bbox=True, loss_cls=dict( type='CrossEntropyLoss', use_sigmoid=False, loss_weight=12.0), loss_bbox=dict(type='GIoULoss', loss_weight=120.0))) ], bbox_head=[ dict( type='CoATSSHead', num_classes=80, in_channels=256, stacked_convs=1, feat_channels=256, anchor_generator=dict( type='AnchorGenerator', ratios=[1.0], octave_base_scale=8, scales_per_octave=1, strides=[4, 8, 16, 32, 64, 128]), bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[0.1, 0.1, 0.2, 0.2]), loss_cls=dict( type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, loss_weight=12.0), loss_bbox=dict(type='GIoULoss', loss_weight=24.0), loss_centerness=dict( type='CrossEntropyLoss', use_sigmoid=True, loss_weight=12.0)) ], train_cfg=[ dict( assigner=dict( type='HungarianAssigner', cls_cost=dict(type='FocalLossCost', weight=2.0), reg_cost=dict( type='BBoxL1Cost', weight=5.0, box_format='xywh'), iou_cost=dict(type='IoUCost', iou_mode='giou', weight=2.0))), dict( rpn=dict( assigner=dict( type='MaxIoUAssigner', pos_iou_thr=0.7, neg_iou_thr=0.3, min_pos_iou=0.3, match_low_quality=True, ignore_iof_thr=-1), sampler=dict( type='RandomSampler', num=256, pos_fraction=0.5, neg_pos_ub=-1, add_gt_as_proposals=False), allowed_border=-1, pos_weight=-1, debug=False), rpn_proposal=dict( nms_pre=4000, max_per_img=1000, nms=dict(type='nms', iou_threshold=0.7), min_bbox_size=0), rcnn=dict( assigner=dict( type='MaxIoUAssigner', pos_iou_thr=0.5, neg_iou_thr=0.5, min_pos_iou=0.5, match_low_quality=False, ignore_iof_thr=-1), sampler=dict( type='RandomSampler', num=512, pos_fraction=0.25, neg_pos_ub=-1, add_gt_as_proposals=True), pos_weight=-1, debug=False)), dict( assigner=dict(type='ATSSAssigner', topk=9), allowed_border=-1, pos_weight=-1, debug=False) ], test_cfg=[ dict(max_per_img=300, nms=dict(type='soft_nms', iou_threshold=0.8)), dict( rpn=dict( nms_pre=1000, max_per_img=1000, nms=dict(type='nms', iou_threshold=0.7), min_bbox_size=0), rcnn=dict( score_thr=0.0, nms=dict(type='nms', iou_threshold=0.5), max_per_img=100)), dict( nms_pre=1000, min_bbox_size=0, score_thr=0.0, nms=dict(type='nms', iou_threshold=0.6), max_per_img=100) ]) optimizer = dict( type='AdamW', lr=0.0002, weight_decay=0.0001, paramwise_cfg=dict(custom_keys=dict(backbone=dict(lr_mult=0.1)))) optimizer_config = dict(grad_clip=dict(max_norm=0.1, norm_type=2)) lr_config = dict(policy='step', step=[11]) runner = dict(type='EpochBasedRunner', max_epochs=12) pretrained = '/mnt/share/zyh/Co-DETR-base/tools/pretrain_weights/swin_large_patch4_window12_384_22k.pth' work_dir = './co_detr_dino' auto_resume = False gpu_ids = [0]
It is my config
This is the information for the last epoch
@yimeng436, I notice that you use only 1 GPU for training (1 image per GPU). Our default batch size is 16 and the learning rate should be linearly scaled: 0.0002*1/16=1.25e-05
.
@yimeng436, I notice that you use only 1 GPU for training (1 image per GPU). Our default batch size is 16 and the learning rate should be linearly scaled: .
0.0002*1/16=1.25e-05
oh thanks a lot
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the
forward
function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiplecheckpoint
functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.