Open zhongzee opened 4 months ago
这部分我们采用的是 ovdet 提供的 LVIS-base/novel 的annotation file(本质上只是筛选了common+freq和rare)。这里的config和COCO的微调脚本采用的一致的setting但有基础不同:
这部分我们采用的是 ovdet 提供的 LVIS-base/novel 的annotation file(本质上只是筛选了common+freq和rare)。这里的config和COCO的微调脚本采用的一致的setting但有基础不同:
- 训练类别仍旧是80,测试类别是1203(这部分没有做额外的探索)
- CLIP打开训练,采用了0.01x的学习率
- 相比YOLOv8的基准实验,没有采用“class-balanced sampling”. 感谢您的回复,您在论文中不是说在LVIS-base上训练吗,为什么训练类别是80呢?此外测试类别是1203,那这里筛选了common+freq和rare的意义是什么呢? 验证数据集配置如下正确吗?lvis_v1_val就是带有1203个类的完整lvis标注。训练集就是普通的coco2017,80个类别;这里设置 num_classes = 1203 num_training_classes = 80; coco_train_dataset = dict( delete=True, type='MultiModalDataset', dataset=dict( type='YOLOv5CocoDataset', data_root='/mnt/afs/huangtao3/wzz/YOLO-World/COCO2017', ann_file='annotations/instances_train2017.json', data_prefix=dict(img='train2017/'), filter_cfg=dict(filter_empty_gt=False, min_size=32)), class_text_path='data/texts/coco_class_texts.json', pipeline=train_pipeline)
coco_val_dataset = dict( delete=True, type='MultiModalDataset', dataset=dict(type='YOLOv5LVISV1Dataset', data_root='/mnt/afs/huangtao3/wzz/YOLO-World/pretrain_data/LVIS', test_mode=True, ann_file='annotations/lvis_v1_val.json', data_prefix=dict(img=''), batch_shapes_cfg=None), class_text_path='/mnt/afs/huangtao3/wzz/YOLO-World/data/texts/lvis_v1_class_texts.json', pipeline=test_pipeline) val_dataloader = dict(dataset=coco_val_dataset) test_dataloader = val_dataloader;希望能得到您的回复! 感恩!
train_dataset
需要采用LVIS的数据,采用只含有base
的LVIS标注
train_dataset
需要采用LVIS的数据,采用只含有base
的LVIS标注 感觉您的回复,下面是我修改的设置:YOLOv5LVISV1TrainNoRareDataset是我自定义的只包干了base的LVIS标注,改了meta-info;这样是对的嘛?此外num_training_classes=80对吗,不需要修改为base类的数量。然后class_text_path='data/texts/lvis_v1_class_texts.json', 这里lvis_v1_class_texts 需要改成只有base标注的信息嘛? coco_train_dataset = dict( delete=True, type='MultiModalDataset', dataset=dict( type='YOLOv5LVISV1TrainNoRareDataset', data_root='/mnt/afs/huangtao3/wzz/YOLO-World/pretrain_data/LVIS', ann_file='annotations/lvis_v1_train_norare.json', data_prefix=dict(img=''), filter_cfg=dict(filter_empty_gt=False, min_size=32)), class_text_path='data/texts/lvis_v1_class_texts.json', pipeline=train_pipeline)
目前这样是可以的
YOLOv5LVISV1TrainNoRareDataset
感谢您的回复和出色的工作
这部分我们采用的是 ovdet 提供的 LVIS-base/novel 的annotation file(本质上只是筛选了common+freq和rare)。这里的config和COCO的微调脚本采用的一致的setting但有基础不同:
- 训练类别仍旧是80,测试类别是1203(这部分没有做额外的探索)
- CLIP打开训练,采用了0.01x的学习率
- 相比YOLOv8的基准实验,没有采用“class-balanced sampling”. 请问您说的没有采用“class-balanced sampling”.是指没有设置下面内容吗? text_transform = [ dict(type='RandomLoadText', num_neg_samples=(num_classes, num_classes), max_num_samples=num_training_classes, padding_to_max=True, padding_value=''), dict(type='mmdet.PackDetInputs', meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip', 'flip_direction', 'texts')) ] 此外对于验证集就是测试完整的1203个类别对吗:(请问您使用的是V100吗?batchsize设置的多少呢) coco_val_dataset = dict( delete=True, type='MultiModalDataset', dataset=dict(type='YOLOv5LVISV1Dataset', data_root='/mnt/afs/huangtao3/wzz/YOLO-World/pretrain_data/LVIS', test_mode=True, ann_file='annotations/lvis_v1_val.json', data_prefix=dict(img=''), batch_shapes_cfg=None), class_text_path='/mnt/afs/huangtao3/wzz/YOLO-World/data/texts/lvis_v1_class_texts.json', pipeline=test_pipeline)
当我尝试在COCO数据集上执行OVD设置时,设置base类为48个,测试65个类(常见48类+17个新类),失去了开放检测能力!(这里是否需要打开CLIP呢?删除frozen_modules=['all']?)
2024/05/18 15:45:43 - mmengine - INFO - bbox_mAP_copypaste: 0.044 0.059 0.048 0.032 0.045 0.060
2024/05/18 15:45:44 - mmengine - INFO - Epoch(val) [21][605/605] coco/bbox_mAP: 0.0440 coco/bbox_mAP_50: 0.0590 coco/bbox_mAP_75: 0.0480 coco/bbox_mAP_s: 0.0320 coco/bbox_mAP_m: 0.0450 coco/bbox_mAP_l: 0.0600 data_time: 0.0011 time: 0.0712
2024/05/18 15:46:25 - mmengine - INFO - Epoch(train) [22][ 50/842] base_lr: 2.0000e-04 lr: 1.5050e-04 eta: 9:55:06 time: 0.8143 data_time: 0.0942 memory: 10889 grad_norm: 567.7311 loss: 378.1312 loss_cls: 120.1962 loss_bbox: 118.0376 loss_dfl: 139.8975
2024/05/18 15:47:00 - mmengine - INFO - Epoch(train) [22][100/842] base_lr: 2.0000e-04 lr: 1.5050e-04 eta: 9:54:27 time: 0.6991 data_time: 0.0034 memory: 11075 grad_norm: 602.8765 loss: 382.0110 loss_cls: 122.0841 loss_bbox: 120.9391 loss_dfl: 138.9878
2024/05/18 15:47:36 - mmengine - INFO - Epoch(train) [22][150/842] base_lr: 2.0000e-04 lr: 1.5050e-04 eta: 9:53:51 time: 0.7212 data_time: 0.0033 memory: 11395 grad_norm: 644.6851 loss: 384.3241 loss_cls: 121.6649 loss_bbox: 121.9390 loss_dfl: 140.7202
2024/05/18 15:48:11 - mmengine - INFO - Epoch(train) [22][200/842] base_lr: 2.0000e-04 lr: 1.5050e-04 eta: 9:53:12 time: 0.6936 data_time: 0.0035 memory: 11302 grad_norm: 621.4471 loss: 377.9176 loss_cls: 120.0765 loss_bbox: 118.8016 loss_dfl: 139.0394
2024/05/18 15:48:46 - mmengine - INFO - Epoch(train) [22][250/842] base_lr: 2.0000e-04 lr: 1.5050e-04 eta: 9:52:33 time: 0.6985 data_time: 0.0034 memory: 11195 grad_norm: inf loss: 382.9778 loss_cls: 121.5206 loss_bbox: 122.4125 loss_dfl: 139.0447
2024/05/18 15:49:22 - mmengine - INFO - Epoch(train) [22][300/842] base_lr: 2.0000e-04 lr: 1.5050e-04 eta: 9:51:57 time: 0.7176 data_time: 0.0035 memory: 11355 grad_norm: 613.2525 loss: 383.9486 loss_cls: 124.0898 loss_bbox: 120.0726 loss_dfl: 139.7862
2024/05/18 15:49:34 - mmengine - INFO - Exp name: yolo_world_v2_l_vlpan_bn_2e-4_80e_8gpus_mask-refine_finetune_coco_test1_ovd_orin_test65_20240518_111944
2024/05/18 15:49:57 - mmengine - INFO - Epoch(train) [22][350/842] base_lr: 2.0000e-04 lr: 1.5050e-04 eta: 9:51:18 time: 0.7022 data_time: 0.0037 memory: 11182 grad_norm: 638.4827 loss: 381.9660 loss_cls: 121.1140 loss_bbox: 121.0842 loss_dfl: 139.7678
2024/05/18 15:50:32 - mmengine - INFO - Epoch(train) [22][400/842] base_lr: 2.0000e-04 lr: 1.5050e-04 eta: 9:50:39 time: 0.6952 data_time: 0.0035 memory: 11129 grad_norm: 602.5377 loss: 382.5560 loss_cls: 121.8785 loss_bbox: 120.7352 loss_dfl: 139.9422
2024/05/18 15:51:08 - mmengine - INFO - Epoch(train) [22][450/842] base_lr: 2.0000e-04 lr: 1.5050e-04 eta: 9:50:04 time: 0.7286 data_time: 0.0035 memory: 11369 grad_norm: 616.4025 loss: 385.9237 loss_cls: 124.0875 loss_bbox: 121.1290 loss_dfl: 140.7072
2024/05/18 15:51:43 - mmengine - INFO - Epoch(train) [22][500/842] base_lr: 2.0000e-04 lr: 1.5050e-04 eta: 9:49:25 time: 0.6933 data_time: 0.0035 memory: 11089 grad_norm: 615.7453 loss: 385.0143 loss_cls: 123.3356 loss_bbox: 121.5186 loss_dfl: 140.1600
2024/05/18 15:52:18 - mmengine - INFO - Epoch(train) [22][550/842] base_lr: 2.0000e-04 lr: 1.5050e-04 eta: 9:48:47 time: 0.7064 data_time: 0.0033 memory: 11249 grad_norm: 634.6645 loss: 373.9372 loss_cls: 118.9502 loss_bbox: 117.0383 loss_dfl: 137.9486
2024/05/18 15:52:55 - mmengine - INFO - Epoch(train) [22][600/842] base_lr: 2.0000e-04 lr: 1.5050e-04 eta: 9:48:12 time: 0.7251 data_time: 0.0035 memory: 11195 grad_norm: 623.2596 loss: 374.9919 loss_cls: 119.9821 loss_bbox: 117.1996 loss_dfl: 137.8102
2024/05/18 15:53:30 - mmengine - INFO - Epoch(train) [22][650/842] base_lr: 2.0000e-04 lr: 1.5050e-04 eta: 9:47:35 time: 0.7127 data_time: 0.0035 memory: 11049 grad_norm: 623.1412 loss: 383.1364 loss_cls: 122.4215 loss_bbox: 120.2235 loss_dfl: 140.4914
2024/05/18 15:54:04 - mmengine - INFO - Epoch(train) [22][700/842] base_lr: 2.0000e-04 lr: 1.5050e-04 eta: 9:46:55 time: 0.6841 data_time: 0.0033 memory: 10915 grad_norm: 619.5305 loss: 382.5745 loss_cls: 122.5780 loss_bbox: 119.1784 loss_dfl: 140.8181
2024/05/18 15:54:41 - mmengine - INFO - Epoch(train) [22][750/842] base_lr: 2.0000e-04 lr: 1.5050e-04 eta: 9:46:20 time: 0.7289 data_time: 0.0048 memory: 11289 grad_norm: 624.5849 loss: 381.7764 loss_cls: 121.5486 loss_bbox: 120.0085 loss_dfl: 140.2192
2024/05/18 15:55:16 - mmengine - INFO - Epoch(train) [22][800/842] base_lr: 2.0000e-04 lr: 1.5050e-04 eta: 9:45:43 time: 0.7090 data_time: 0.0035 memory: 11169 grad_norm: 643.3431 loss: 386.9809 loss_cls: 124.8837 loss_bbox: 121.1803 loss_dfl: 140.9169
2024/05/18 15:55:45 - mmengine - INFO - Exp name: yolo_world_v2_l_vlpan_bn_2e-4_80e_8gpus_mask-refine_finetune_coco_test1_ovd_orin_test65_20240518_111944
2024/05/18 15:55:49 - mmengine - INFO - Epoch(val) [22][ 50/605] eta: 0:00:38 time: 0.0701 data_time: 0.0009 memory: 10822
2024/05/18 15:55:52 - mmengine - INFO - Epoch(val) [22][100/605] eta: 0:00:35 time: 0.0700 data_time: 0.0004 memory: 1754
2024/05/18 15:55:56 - mmengine - INFO - Epoch(val) [22][150/605] eta: 0:00:32 time: 0.0733 data_time: 0.0043 memory: 1754
2024/05/18 15:56:00 - mmengine - INFO - Epoch(val) [22][200/605] eta: 0:00:28 time: 0.0702 data_time: 0.0003 memory: 1754
2024/05/18 15:56:03 - mmengine - INFO - Epoch(val) [22][250/605] eta: 0:00:25 time: 0.0735 data_time: 0.0039 memory: 1754
2024/05/18 15:56:07 - mmengine - INFO - Epoch(val) [22][300/605] eta: 0:00:21 time: 0.0697 data_time: 0.0004 memory: 1754
2024/05/18 15:56:10 - mmengine - INFO - Epoch(val) [22][350/605] eta: 0:00:18 time: 0.0700 data_time: 0.0004 memory: 1754
2024/05/18 15:56:14 - mmengine - INFO - Epoch(val) [22][400/605] eta: 0:00:14 time: 0.0698 data_time: 0.0004 memory: 1754
2024/05/18 15:56:17 - mmengine - INFO - Epoch(val) [22][450/605] eta: 0:00:10 time: 0.0697 data_time: 0.0004 memory: 1754
2024/05/18 15:56:21 - mmengine - INFO - Epoch(val) [22][500/605] eta: 0:00:07 time: 0.0699 data_time: 0.0004 memory: 1754
2024/05/18 15:56:24 - mmengine - INFO - Epoch(val) [22][550/605] eta: 0:00:03 time: 0.0694 data_time: 0.0004 memory: 1754
2024/05/18 15:56:28 - mmengine - INFO - Epoch(val) [22][600/605] eta: 0:00:00 time: 0.0747 data_time: 0.0038 memory: 1754
2024/05/18 15:56:44 - mmengine - INFO - Evaluating bbox...
2024/05/18 15:58:05 - mmengine - INFO - bbox_mAP_copypaste: 0.044 0.059 0.048 0.032 0.045 0.060
配置文件如下:使用的是yolo_world_v2_l_vlpan_bn_2e-4_80e_8gpus_mask-refine_finetune_coco.py ;其中YOLOv5OVDTrainBCocoDataset修改了metainfo只有48个类,YOLOv5OVDValTCocoDataset是65个类,instances_val2017_all_2_text是65个类的类名,而class_text_path='data/texts/coco_class_texts.json',这里没有改,仍然是80个类的,是不是这里出的问题呢?
num_classes = 65 num_training_classes = 48 max_epochs = 80 # Maximum training epochs close_mosaic_epochs = 10 save_epoch_intervals = 5 text_channels = 512 neck_embed_channels = [128, 256, base.last_stage_out_channels // 2] neck_num_heads = [4, 8, base.last_stage_out_channels // 2 // 32] base_lr = 2e-4 weight_decay = 0.05 train_batch_size_per_gpu = 16 load_from = '/mnt/afs/huangtao3/wzz/YOLO-World/weights/YOLO-World/yolo_world_v2_l_obj365v1_goldg_cc3mlite_pretrain-ca93cd1f.pth'
text_model_name = '/mnt/afs/huangtao3/wzz/YOLO-World/weights/clip-vit-base-patch32' persistent_workers = False
model = dict( type='YOLOWorldDetector', mm_neck=True, num_train_classes=num_training_classes, num_test_classes=num_classes, data_preprocessor=dict(type='YOLOWDetDataPreprocessor'), backbone=dict( delete=True, type='MultiModalYOLOBackbone', image_model={{base.model.backbone}}, text_model=dict( type='HuggingCLIPLanguageBackbone', model_name=text_model_name, frozen_modules=['all'])), neck=dict(type='YOLOWorldPAFPN', guide_channels=text_channels, embed_channels=neck_embed_channels, num_heads=neck_num_heads, block_cfg=dict(type='MaxSigmoidCSPLayerWithTwoConv')), bbox_head=dict(type='YOLOWorldHead', head_module=dict(type='YOLOWorldHeadModule', use_bn_head=True, embed_dims=text_channels, num_classes=num_training_classes)), train_cfg=dict(assigner=dict(num_classes=num_training_classes)))
text_transform = [ dict(type='RandomLoadText', num_neg_samples=(num_classes, num_classes), max_num_samples=num_training_classes, padding_to_max=True, padding_value=''), dict(type='mmdet.PackDetInputs', meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip', 'flip_direction', 'texts')) ] mosaic_affine_transform = [ dict( type='MultiModalMosaic', img_scale=base.img_scale, pad_val=114.0, pre_transform=base.pre_transform), dict(type='YOLOv5CopyPaste', prob=base.copypaste_prob), dict( type='YOLOv5RandomAffine', max_rotate_degree=0.0, max_shear_degree=0.0, max_aspect_ratio=100., scaling_ratio_range=(1 - base.affine_scale, 1 + base.affine_scale),
border=(-_base_.img_scale[0] // 2, -_base_.img_scale[1] // 2),
border_val=(114, 114, 114),
min_area_ratio=_base_.min_area_ratio,
use_mask_refine=_base_.use_mask2refine)
] train_pipeline = [ base.pre_transform, mosaic_affine_transform, dict( type='YOLOv5MultiModalMixUp', prob=base.mixup_prob, pre_transform=[base.pre_transform, mosaic_affine_transform]), base.last_transform[:-1], text_transform ] train_pipeline_stage2 = [ base.train_pipeline_stage2[:-1], text_transform ] coco_train_dataset = dict( delete=True, type='MultiModalDataset', dataset=dict( type='YOLOv5OVDTrainBCocoDataset', data_root='/mnt/afs/huangtao3/wzz/YOLO-World/COCO2017', ann_file='annotations/ovd_ins_train2017_b.json', data_prefix=dict(img='train2017/'), filter_cfg=dict(filter_empty_gt=False, min_size=32)), class_text_path='data/texts/coco_class_texts.json', pipeline=train_pipeline)
train_dataloader = dict( persistent_workers=persistent_workers, batch_size=train_batch_size_per_gpu, collate_fn=dict(type='yolow_collate'), dataset=coco_train_dataset) test_pipeline = [ *base.test_pipeline[:-1], dict(type='LoadText'), dict( type='mmdet.PackDetInputs', meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'scale_factor', 'pad_param', 'texts')) ] coco_val_dataset = dict( delete=True, type='MultiModalDataset', dataset=dict( type='YOLOv5OVDValTCocoDataset', data_root='/mnt/afs/huangtao3/wzz/YOLO-World/COCO2017', ann_file='annotations/instances_val2017_all_2.json', data_prefix=dict(img='val2017/'), filter_cfg=dict(filter_empty_gt=False, min_size=32)), class_text_path='/mnt/afs/huangtao3/wzz/YOLO-World/COCO2017/annotations/instances_val2017_all_2_text.json', pipeline=test_pipeline) val_dataloader = dict(dataset=coco_val_dataset)
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=300 catIds=all] = 0.033 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=300 catIds=all] = 0.043 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=300 catIds=all] = 0.035 Average Precision (AP) @[ IoU=0.50:0.95 | area= s | maxDets=300 catIds=all] = 0.025 Average Precision (AP) @[ IoU=0.50:0.95 | area= m | maxDets=300 catIds=all] = 0.044 Average Precision (AP) @[ IoU=0.50:0.95 | area= l | maxDets=300 catIds=all] = 0.064 Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=300 catIds= r] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=300 catIds= c] = 0.001 Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=300 catIds= f] = 0.085 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 catIds=all] = 0.048 Average Recall (AR) @[ IoU=0.50:0.95 | area= s | maxDets=300 catIds=all] = 0.035 Average Recall (AR) @[ IoU=0.50:0.95 | area= m | maxDets=300 catIds=all] = 0.066 Average Recall (AR) @[ IoU=0.50:0.95 | area= l | maxDets=300 catIds=all] = 0.093 05/20 15:41:30 - mmengine - INFO - Epoch(val) [10][2477/2477] lvis/bbox_AP: 0.0330 lvis/bbox_AP50: 0.0430 lvis/bbox_AP75: 0.0350 lvis/bbox_APs: 0.0250 lvis/bbox_APm: 0.0440 lvis/bbox_APl: 0.0640 lvis/bbox_APr: 0.0000 lvis/bbox_APc: 0.0010 lvis/bbox_APf: 0.0850 data_time: 0.0054 time: 1.4089 作者您好,我换成LVIS数据后结果如上,完整参数如下(YOLOv5LVISV1TrainNoRareDataset 就是代表866个base类,对应的json名也是866个类)试了很多次,不知道问题出在哪里。。。:num_classes = 1203 num_training_classes = 80 max_epochs = 80 # Maximum training epochs close_mosaic_epochs = 10 save_epoch_intervals = 5 text_channels = 512 neck_embed_channels = [128, 256, base.last_stage_out_channels // 2] neck_num_heads = [4, 8, base.last_stage_out_channels // 2 // 32] base_lr = 2e-4 weight_decay = 0.05 train_batch_size_per_gpu = 4 load_from = '/mnt/afs/huangtao3/wzz/YOLO-World/weights/YOLO-World/yolo_world_v2_l_obj365v1_goldg_cc3mlite_pretrain-ca93cd1f.pth'
text_model_name = '/mnt/afs/huangtao3/wzz/YOLO-World/weights/clip-vit-base-patch32' persistent_workers = False
model = dict( type='YOLOWorldDetector', mm_neck=True, num_train_classes=num_training_classes, num_test_classes=num_classes, data_preprocessor=dict(type='YOLOWDetDataPreprocessor'), backbone=dict( delete=True, type='MultiModalYOLOBackbone', image_model={{base.model.backbone}}, text_model=dict( type='HuggingCLIPLanguageBackbone', model_name=text_model_name)),# frozen_modules=['all']删除 neck=dict(type='YOLOWorldPAFPN', guide_channels=text_channels, embed_channels=neck_embed_channels, num_heads=neck_num_heads, block_cfg=dict(type='MaxSigmoidCSPLayerWithTwoConv')), bbox_head=dict(type='YOLOWorldHead', head_module=dict(type='YOLOWorldHeadModule', use_bn_head=True, embed_dims=text_channels, num_classes=num_training_classes)), train_cfg=dict(assigner=dict(num_classes=num_training_classes)))
text_transform = [ dict(type='RandomLoadText', num_neg_samples=(num_classes, num_classes), max_num_samples=num_training_classes, padding_to_max=True, padding_value=''), dict(type='mmdet.PackDetInputs', meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip', 'flip_direction', 'texts')) ] mosaic_affine_transform = [ dict( type='MultiModalMosaic', img_scale=base.img_scale, pad_val=114.0, pre_transform=base.pre_transform), dict(type='YOLOv5CopyPaste', prob=base.copypaste_prob), dict( type='YOLOv5RandomAffine', max_rotate_degree=0.0, max_shear_degree=0.0, max_aspect_ratio=100., scaling_ratio_range=(1 - base.affine_scale, 1 + base.affine_scale),
border=(-_base_.img_scale[0] // 2, -_base_.img_scale[1] // 2),
border_val=(114, 114, 114),
min_area_ratio=_base_.min_area_ratio,
use_mask_refine=_base_.use_mask2refine)
] train_pipeline = [ base.pre_transform, mosaic_affine_transform, dict( type='YOLOv5MultiModalMixUp', prob=base.mixup_prob, pre_transform=[base.pre_transform, mosaic_affine_transform]), base.last_transform[:-1], text_transform ] train_pipeline_stage2 = [ base.train_pipeline_stage2[:-1], text_transform ]
coco_train_dataset = dict( delete=True, type='MultiModalDataset', dataset=dict( type='YOLOv5LVISV1TrainNoRareDataset', data_root='/mnt/afs/huangtao3/wzz/YOLO-World/pretrain_data/LVIS', ann_file='annotations/lvis_v1_train_norare.json', data_prefix=dict(img=''), filter_cfg=dict(filter_empty_gt=False, min_size=32)), class_text_path='/mnt/afs/huangtao3/wzz/YOLO-World/pretrain_data/LVIS/annotations/norare_category_names_text.json', pipeline=train_pipeline)
train_dataloader = dict( persistent_workers=persistent_workers, batch_size=train_batch_size_per_gpu, collate_fn=dict(type='yolow_collate'), dataset=coco_train_dataset) test_pipeline = [ *base.test_pipeline[:-1], dict(type='LoadText'), dict( type='mmdet.PackDetInputs', meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'scale_factor', 'pad_param', 'texts')) ]
coco_val_dataset = dict( delete=True, type='MultiModalDataset', dataset=dict(type='YOLOv5LVISV1Dataset', data_root='/mnt/afs/huangtao3/wzz/YOLO-World/pretrain_data/LVIS', test_mode=True, ann_file='annotations/lvis_v1_val.json', data_prefix=dict(img=''), batch_shapes_cfg=None), class_text_path='/mnt/afs/huangtao3/wzz/YOLO-World/data/texts/lvis_v1_class_texts.json', pipeline=test_pipeline) val_dataloader = dict(dataset=coco_val_dataset) test_dataloader = val_dataloader
default_hooks = dict( param_scheduler=dict( scheduler_type='linear', lr_factor=0.01, max_epochs=max_epochs), checkpoint=dict( max_keep_ckpts=-1, save_best=None, interval=save_epoch_intervals)) custom_hooks = [ dict( type='EMAHook', ema_type='ExpMomentumEMA', momentum=0.0001, update_buffers=True, strict_load=False, priority=49), dict( type='mmdet.PipelineSwitchHook', switch_epoch=max_epochs - close_mosaic_epochs, switch_pipeline=train_pipeline_stage2) ] train_cfg = dict( max_epochs=max_epochs, val_interval=5, dynamic_intervals=[((max_epochs - close_mosaic_epochs), base.val_interval_stage2)]) optim_wrapper = dict( optimizer=dict( delete=True, type='AdamW', lr=base_lr, weight_decay=weight_decay, batch_size_per_gpu=train_batch_size_per_gpu), paramwise_cfg=dict( custom_keys={'backbone.text_model': dict(lr_mult=0.01), 'logit_scale': dict(weight_decay=0.0)}), constructor='YOLOWv5OptimizerConstructor')
val_evaluator = dict(type='mmdet.LVISMetric', ann_file='/mnt/afs/huangtao3/wzz/YOLO-World/pretrain_data/LVIS/annotations/lvis_v1_val.json', metric='bbox')
test_evaluator = val_evaluator
作者您好!这里对于表7的设置,应该是训练集866个类,验证集1203个类是吗?还是说训练和验证都是866个类,然后测试的时候用1203个类测试呢?
The complete config is:
_base_ = (
'../../../../../third_party/mmyolo/configs/yolov8/'
'yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py')
custom_imports = dict(
imports=['projects.YoloW.yolow'],
allow_failed_imports=False)
# hyper-parameters
num_classes = 1203
num_training_classes = 80
max_epochs = 80 # Maximum training epochs
close_mosaic_epochs = 10
save_epoch_intervals = 5
text_channels = 512
neck_embed_channels = [128, 256, _base_.last_stage_out_channels // 2]
neck_num_heads = [4, 8, _base_.last_stage_out_channels // 2 // 32]
base_lr = 2e-4
# wrong!
weight_decay = 0.05
train_batch_size_per_gpu = 8
persistent_workers = False
# Polygon2Mask
downsample_ratio = 4
mask_overlap = False
use_mask2refine = True
max_aspect_ratio = 100
min_area_ratio = 0.01
persistent_workers = False
# zero-shot lvis mini:28.6 AP
load_from = 'outputs/pretrain_yolow-v8_l_clipv2_te_sattneck_2e-3adamw_16xb8-100e_obj365v1_cc3m_noregress_gqa_train_lviseval_clip10ep/epoch_100.pth'
# model settings
model = dict(
type='YOLOWDetector',
mm_neck=True,
num_train_classes=num_training_classes,
num_test_classes=num_classes,
data_preprocessor=dict(type='YOLOWDetDataPreprocessor'),
backbone=dict(
_delete_=True,
type='MMTransformer',
image_model={{_base_.model.backbone}},
text_model=dict(
type='HuggingCLIPLanguageBackboneV2',
model_name=
'/group/30042/adriancheng/pretrained_models/clip-vit-base-patch32-projection',
frozen_modules=[])),
neck=dict(type='TextEnhancedYOLOWv8PAFPN',
guide_channels=text_channels,
embed_channels=neck_embed_channels,
num_heads=neck_num_heads,
block_cfg=dict(type='SigmoidAttnCSPLayerWithTwoConv'),
text_enhancder=dict(type='TextEnhanceModuleV2',
embed_channels=256,
num_heads=8,
pool_size=3)),
bbox_head=dict(type='YOLOWv8Head',
head_module=dict(type='YOLOWv8HeadModule',
embed_dims=text_channels,
num_classes=num_training_classes)),
train_cfg=dict(assigner=dict(num_classes=num_training_classes)))
# dataset settings
text_transform = [
dict(type='RandomLoadText',
num_neg_samples=(num_classes, num_classes),
max_num_samples=num_training_classes,
padding_to_max=True,
padding_value=''),
dict(type='mmdet.PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip',
'flip_direction', 'texts'))
]
mosaic_affine_transform = [
dict(
type='MultiModalMosaic',
img_scale=_base_.img_scale,
pad_val=114.0,
pre_transform=_base_.pre_transform),
dict(type='YOLOv5CopyPaste', prob=_base_.copypaste_prob),
dict(
type='YOLOv5RandomAffine',
max_rotate_degree=0.0,
max_shear_degree=0.0,
max_aspect_ratio=100.,
scaling_ratio_range=(1 - _base_.affine_scale,
1 + _base_.affine_scale),
# img_scale is (width, height)
border=(-_base_.img_scale[0] // 2, -_base_.img_scale[1] // 2),
border_val=(114, 114, 114),
min_area_ratio=_base_.min_area_ratio,
use_mask_refine=_base_.use_mask2refine)
]
train_pipeline = [
*_base_.pre_transform,
*mosaic_affine_transform,
dict(
type='YOLOv5MultiModalMixUp',
prob=_base_.mixup_prob,
pre_transform=[*_base_.pre_transform,
*mosaic_affine_transform]),
*_base_.last_transform[:-1],
*text_transform
]
train_pipeline_stage2 = [
*_base_.train_pipeline_stage2[:-1],
*text_transform
]
coco_train_dataset = dict(
_delete_=True,
type='MultiModalDataset',
dataset=dict(
type='YOLOv5LVISV1Dataset',
data_root='data/coco',
ann_file='lvis/lvis_v1_train_base.json',
data_prefix=dict(img=''),
filter_cfg=dict(filter_empty_gt=True, min_size=32)),
class_text_path='data/captions/lvis_v1_base_class_captions.json',
pipeline=train_pipeline)
train_dataloader = dict(
persistent_workers=persistent_workers,
batch_size=train_batch_size_per_gpu,
collate_fn=dict(type='yolow_collate'),
dataset=coco_train_dataset)
test_pipeline = [
*_base_.test_pipeline[:-1],
dict(type='LoadText'),
dict(
type='mmdet.PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
'scale_factor', 'pad_param', 'texts'))
]
coco_val_dataset = dict(
_delete_=True,
type='MultiModalDataset',
dataset=dict(type='YOLOv5LVISV1Dataset',
data_root='data/coco/',
test_mode=True,
ann_file='lvis/lvis_v1_minival_inserted_image_name.json',
data_prefix=dict(img=''),
batch_shapes_cfg=None),
class_text_path='data/captions/lvis_v1_class_captions.json',
pipeline=test_pipeline)
val_dataloader = dict(dataset=coco_val_dataset)
test_dataloader = val_dataloader
val_evaluator = dict(type='mmdet.LVISMetric',
ann_file='data/coco/lvis/lvis_v1_minival_inserted_image_name.json',
metric='bbox')
test_evaluator = val_evaluator
# training settings
default_hooks = dict(
param_scheduler=dict(
scheduler_type='linear',
lr_factor=0.01,
max_epochs=max_epochs),
checkpoint=dict(
max_keep_ckpts=-1,
save_best=None,
interval=save_epoch_intervals))
custom_hooks = [
dict(
type='EMAHook',
ema_type='ExpMomentumEMA',
momentum=0.0001,
update_buffers=True,
strict_load=False,
priority=49),
dict(
type='mmdet.PipelineSwitchHook',
switch_epoch=max_epochs - close_mosaic_epochs,
switch_pipeline=train_pipeline_stage2)
]
train_cfg = dict(
max_epochs=max_epochs,
val_interval=5,
dynamic_intervals=[((max_epochs - close_mosaic_epochs),
_base_.val_interval_stage2)])
optim_wrapper = dict(
optimizer=dict(
_delete_=True,
type='AdamW',
lr=base_lr,
weight_decay=weight_decay,
batch_size_per_gpu=train_batch_size_per_gpu),
paramwise_cfg=dict(
bias_decay_mult=0.0,
norm_decay_mult=0.0,
custom_keys={'backbone.text_model': dict(lr_mult=0.01),
'logit_scale': dict(weight_decay=0.0)}),
constructor='YOLOWv5OptimizerConstructor')
NOTE: this config is from the experimental codebase and you can not directly copy-paster
since the Python classes are not consistent with those in this repo.
The complete config is:
_base_ = ( '../../../../../third_party/mmyolo/configs/yolov8/' 'yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py') custom_imports = dict( imports=['projects.YoloW.yolow'], allow_failed_imports=False) # hyper-parameters num_classes = 1203 num_training_classes = 80 max_epochs = 80 # Maximum training epochs close_mosaic_epochs = 10 save_epoch_intervals = 5 text_channels = 512 neck_embed_channels = [128, 256, _base_.last_stage_out_channels // 2] neck_num_heads = [4, 8, _base_.last_stage_out_channels // 2 // 32] base_lr = 2e-4 # wrong! weight_decay = 0.05 train_batch_size_per_gpu = 8 persistent_workers = False # Polygon2Mask downsample_ratio = 4 mask_overlap = False use_mask2refine = True max_aspect_ratio = 100 min_area_ratio = 0.01 persistent_workers = False # zero-shot lvis mini:28.6 AP load_from = 'outputs/pretrain_yolow-v8_l_clipv2_te_sattneck_2e-3adamw_16xb8-100e_obj365v1_cc3m_noregress_gqa_train_lviseval_clip10ep/epoch_100.pth' # model settings model = dict( type='YOLOWDetector', mm_neck=True, num_train_classes=num_training_classes, num_test_classes=num_classes, data_preprocessor=dict(type='YOLOWDetDataPreprocessor'), backbone=dict( _delete_=True, type='MMTransformer', image_model={{_base_.model.backbone}}, text_model=dict( type='HuggingCLIPLanguageBackboneV2', model_name= '/group/30042/adriancheng/pretrained_models/clip-vit-base-patch32-projection', frozen_modules=[])), neck=dict(type='TextEnhancedYOLOWv8PAFPN', guide_channels=text_channels, embed_channels=neck_embed_channels, num_heads=neck_num_heads, block_cfg=dict(type='SigmoidAttnCSPLayerWithTwoConv'), text_enhancder=dict(type='TextEnhanceModuleV2', embed_channels=256, num_heads=8, pool_size=3)), bbox_head=dict(type='YOLOWv8Head', head_module=dict(type='YOLOWv8HeadModule', embed_dims=text_channels, num_classes=num_training_classes)), train_cfg=dict(assigner=dict(num_classes=num_training_classes))) # dataset settings text_transform = [ dict(type='RandomLoadText', num_neg_samples=(num_classes, num_classes), max_num_samples=num_training_classes, padding_to_max=True, padding_value=''), dict(type='mmdet.PackDetInputs', meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip', 'flip_direction', 'texts')) ] mosaic_affine_transform = [ dict( type='MultiModalMosaic', img_scale=_base_.img_scale, pad_val=114.0, pre_transform=_base_.pre_transform), dict(type='YOLOv5CopyPaste', prob=_base_.copypaste_prob), dict( type='YOLOv5RandomAffine', max_rotate_degree=0.0, max_shear_degree=0.0, max_aspect_ratio=100., scaling_ratio_range=(1 - _base_.affine_scale, 1 + _base_.affine_scale), # img_scale is (width, height) border=(-_base_.img_scale[0] // 2, -_base_.img_scale[1] // 2), border_val=(114, 114, 114), min_area_ratio=_base_.min_area_ratio, use_mask_refine=_base_.use_mask2refine) ] train_pipeline = [ *_base_.pre_transform, *mosaic_affine_transform, dict( type='YOLOv5MultiModalMixUp', prob=_base_.mixup_prob, pre_transform=[*_base_.pre_transform, *mosaic_affine_transform]), *_base_.last_transform[:-1], *text_transform ] train_pipeline_stage2 = [ *_base_.train_pipeline_stage2[:-1], *text_transform ] coco_train_dataset = dict( _delete_=True, type='MultiModalDataset', dataset=dict( type='YOLOv5LVISV1Dataset', data_root='data/coco', ann_file='lvis/lvis_v1_train_base.json', data_prefix=dict(img=''), filter_cfg=dict(filter_empty_gt=True, min_size=32)), class_text_path='data/captions/lvis_v1_base_class_captions.json', pipeline=train_pipeline) train_dataloader = dict( persistent_workers=persistent_workers, batch_size=train_batch_size_per_gpu, collate_fn=dict(type='yolow_collate'), dataset=coco_train_dataset) test_pipeline = [ *_base_.test_pipeline[:-1], dict(type='LoadText'), dict( type='mmdet.PackDetInputs', meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'scale_factor', 'pad_param', 'texts')) ] coco_val_dataset = dict( _delete_=True, type='MultiModalDataset', dataset=dict(type='YOLOv5LVISV1Dataset', data_root='data/coco/', test_mode=True, ann_file='lvis/lvis_v1_minival_inserted_image_name.json', data_prefix=dict(img=''), batch_shapes_cfg=None), class_text_path='data/captions/lvis_v1_class_captions.json', pipeline=test_pipeline) val_dataloader = dict(dataset=coco_val_dataset) test_dataloader = val_dataloader val_evaluator = dict(type='mmdet.LVISMetric', ann_file='data/coco/lvis/lvis_v1_minival_inserted_image_name.json', metric='bbox') test_evaluator = val_evaluator # training settings default_hooks = dict( param_scheduler=dict( scheduler_type='linear', lr_factor=0.01, max_epochs=max_epochs), checkpoint=dict( max_keep_ckpts=-1, save_best=None, interval=save_epoch_intervals)) custom_hooks = [ dict( type='EMAHook', ema_type='ExpMomentumEMA', momentum=0.0001, update_buffers=True, strict_load=False, priority=49), dict( type='mmdet.PipelineSwitchHook', switch_epoch=max_epochs - close_mosaic_epochs, switch_pipeline=train_pipeline_stage2) ] train_cfg = dict( max_epochs=max_epochs, val_interval=5, dynamic_intervals=[((max_epochs - close_mosaic_epochs), _base_.val_interval_stage2)]) optim_wrapper = dict( optimizer=dict( _delete_=True, type='AdamW', lr=base_lr, weight_decay=weight_decay, batch_size_per_gpu=train_batch_size_per_gpu), paramwise_cfg=dict( bias_decay_mult=0.0, norm_decay_mult=0.0, custom_keys={'backbone.text_model': dict(lr_mult=0.01), 'logit_scale': dict(weight_decay=0.0)}), constructor='YOLOWv5OptimizerConstructor')
NOTE: this config is from the experimental codebase and you can not directly
copy-paster
since the Python classes are not consistent with those in this repo.
太感谢您了,我会尝试这个配置:请问下面的这个验证集中的lvis_v1_minival_inserted_image_name是指LVIS minival数据集是吗? lvis_v1_val.json就是表1里的LVIS_val数据集(AP_val);;coco_val_dataset = dict( delete=True, type='MultiModalDataset', dataset=dict(type='YOLOv5LVISV1Dataset', data_root='data/coco/', test_mode=True, ann_file='lvis/lvis_v1_minival_inserted_image_name.json', data_prefix=dict(img=''), batch_shapes_cfg=None), class_text_path='data/captions/lvis_v1_class_captions.json', pipeline=test_pipeline) val_dataloader = dict(dataset=coco_val_dataset) test_dataloader = val_dataloader
val_evaluator = dict(type='mmdet.LVISMetric', ann_file='data/coco/lvis/lvis_v1_minival_inserted_image_name.json', metric='bbox') test_evaluator = val_evaluator
这部分需要修改为正常的LVIS 1.0 val
的annotation,不再使用这个版本的标注,lvis_v1_minival_inserted_image_name
仅在预训练中作为测试。
这部分需要修改为正常的
LVIS 1.0 val
的annotation,不再使用这个版本的标注,lvis_v1_minival_inserted_image_name
仅在预训练中作为测试。 好的好的 感谢您
这部分需要修改为正常的
LVIS 1.0 val
的annotation,不再使用这个版本的标注,lvis_v1_minival_inserted_image_name
仅在预训练中作为测试。 好的好的 感谢您 coco_train_dataset = dict( delete=True, type='MultiModalDataset', dataset=dict( type='YOLOv5LVISV1Dataset', data_root='data/coco', ann_file='lvis/lvis_v1_train_base.json', data_prefix=dict(img=''), filter_cfg=dict(filter_empty_gt=True, min_size=32)), class_text_path='data/captions/lvis_v1_base_class_captions.json', pipeline=train_pipeline) 请问一下这里lvis_v1_train_base 只有866个类,而YOLOv5LVISV1Dataset里的matainfo信息有1203个类 这不会影响验证吗?category_id可以对应的上吗? 此外如果我验证数据集的类别数小于1203个类的话,仍然只要改ann_file 而不需要改metainfo是吗
这部分需要修改为正常的
LVIS 1.0 val
的annotation,不再使用这个版本的标注,lvis_v1_minival_inserted_image_name
仅在预训练中作为测试。
您好我尝试了上述配置, 但是结果仍然很奇怪. 希望能得到您的指导. 2024/05/22 01:00:37 - mmengine - INFO - Evaluating bbox... 2024/05/22 01:12:21 - mmengine - INFO - Epoch(val) [5][2477/2477] lvis/bbox_AP: 0.0320 lvis/bbox_AP50: 0.0420 lvis/bbox_AP75: 0.0340 lvis/bbox_APs: 0.0240 lvis/bbox_APm: 0.0430 lvis/bbox_APl: 0.0610 lvis/bbox_APr: 0.0000 lvis/bbox_APc: 0.0000 lvis/bbox_APf: 0.0820 data_time: 0.0077 time: 1.4106
2024/05/22 04:35:35 - mmengine - INFO - Evaluating bbox... 2024/05/22 04:47:39 - mmengine - INFO - Epoch(val) [10][2477/2477] lvis/bbox_AP: 0.0330 lvis/bbox_AP50: 0.0430 lvis/bbox_AP75: 0.0350 lvis/bbox_APs: 0.0250 lvis/bbox_APm: 0.0440 lvis/bbox_APl: 0.0640 lvis/bbox_APr: 0.0000 lvis/bbox_APc: 0.0000 lvis/bbox_APf: 0.0840 data_time: 0.0073 time: 1.4091
完整日志如下: 20240521_213221.log
不知道是否可以分享您实验中的lvis_v1_train_base.json不知道是否和该训练数据有关? 完整配置如下:`
_base_ = (
'../../third_party/mmyolo/configs/yolov8/'
'yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py')
custom_imports = dict(
imports=['yolo_world'],
allow_failed_imports=False)
# hyper-parameters
num_classes = 1203
num_training_classes = 80
max_epochs = 80 # Maximum training epochs
close_mosaic_epochs = 10
save_epoch_intervals = 5
text_channels = 512
neck_embed_channels = [128, 256, _base_.last_stage_out_channels // 2]
neck_num_heads = [4, 8, _base_.last_stage_out_channels // 2 // 32]
base_lr = 2e-4
weight_decay = 0.05
train_batch_size_per_gpu = 4
load_from = '/mnt/afs/huangtao3/wzz/YOLO-World/weights/YOLO-World/yolo_world_v2_l_obj365v1_goldg_cc3mlite_pretrain-ca93cd1f.pth'
# text_model_name = '../pretrained_models/clip-vit-base-patch32-projection'
text_model_name = '/mnt/afs/huangtao3/wzz/YOLO-World/weights/clip-vit-base-patch32'
persistent_workers = False
# model settings
model = dict(
type='YOLOWorldTransformerDetector',
mm_neck=True,
num_train_classes=num_training_classes,
num_test_classes=num_classes,
use_LLM="description",
use_lvis=True,
data_preprocessor=dict(type='YOLOWDetDataPreprocessor'),
backbone=dict(
_delete_=True,
type='MultiModalYOLOBackbone',
image_model={{_base_.model.backbone}},
text_model=dict(
type='HuggingCLIPLanguageBackbone',
model_name=text_model_name,
frozen_modules=[])),# frozen_modules=['all']删除
neck=dict(type='YOLOWorldPAFPN',
guide_channels=text_channels,
embed_channels=neck_embed_channels,
num_heads=neck_num_heads,
block_cfg=dict(type='MaxSigmoidCSPLayerWithTwoConv')),
bbox_head=dict(type='YOLOWorldHead',
head_module=dict(type='YOLOWorldHeadModule',
use_bn_head=True,
embed_dims=text_channels,
num_classes=num_training_classes)),
train_cfg=dict(assigner=dict(num_classes=num_training_classes)))
# dataset settings
text_transform = [
dict(type='RandomLoadText',
num_neg_samples=(num_classes, num_classes),
max_num_samples=num_training_classes,
padding_to_max=True,
padding_value=''),
dict(type='mmdet.PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip',
'flip_direction', 'texts'))
]
mosaic_affine_transform = [
dict(
type='MultiModalMosaic',
img_scale=_base_.img_scale,
pad_val=114.0,
pre_transform=_base_.pre_transform),
dict(type='YOLOv5CopyPaste', prob=_base_.copypaste_prob),
dict(
type='YOLOv5RandomAffine',
max_rotate_degree=0.0,
max_shear_degree=0.0,
max_aspect_ratio=100.,
scaling_ratio_range=(1 - _base_.affine_scale,
1 + _base_.affine_scale),
# img_scale is (width, height)
border=(-_base_.img_scale[0] // 2, -_base_.img_scale[1] // 2),
border_val=(114, 114, 114),
min_area_ratio=_base_.min_area_ratio,
use_mask_refine=_base_.use_mask2refine)
]
train_pipeline = [
*_base_.pre_transform,
*mosaic_affine_transform,
dict(
type='YOLOv5MultiModalMixUp',
prob=_base_.mixup_prob,
pre_transform=[*_base_.pre_transform,
*mosaic_affine_transform]),
*_base_.last_transform[:-1],
*text_transform
]
train_pipeline_stage2 = [
*_base_.train_pipeline_stage2[:-1],
*text_transform
]
coco_train_dataset = dict(
_delete_=True,
type='MultiModalDataset',
dataset=dict(
type='YOLOv5LVISV1Dataset',
data_root='/mnt/afs/huangtao3/wzz/YOLO-World/pretrain_data/LVIS',
ann_file='annotations/lvis_v1_train_norare.json',
data_prefix=dict(img=''),
filter_cfg=dict(filter_empty_gt=False, min_size=32)),
class_text_path='/mnt/afs/huangtao3/wzz/YOLO-World/pretrain_data/LVIS/annotations/norare_category_names_text.json',
pipeline=train_pipeline)
train_dataloader = dict(
persistent_workers=persistent_workers,
batch_size=train_batch_size_per_gpu,
collate_fn=dict(type='yolow_collate'),
dataset=coco_train_dataset)
test_pipeline = [
*_base_.test_pipeline[:-1],
dict(type='LoadText'),
dict(
type='mmdet.PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
'scale_factor', 'pad_param', 'texts'))
]
coco_val_dataset = dict(
_delete_=True,
type='MultiModalDataset',
dataset=dict(type='YOLOv5LVISV1Dataset',
data_root='/mnt/afs/huangtao3/wzz/YOLO-World/pretrain_data/LVIS',
test_mode=True,
ann_file='annotations/lvis_v1_val.json',
data_prefix=dict(img=''),
batch_shapes_cfg=None),
class_text_path='/mnt/afs/huangtao3/wzz/YOLO-World/data/texts/lvis_v1_class_texts.json',
pipeline=test_pipeline)
val_dataloader = dict(dataset=coco_val_dataset)
test_dataloader = val_dataloader
# training settings
default_hooks = dict(
param_scheduler=dict(
scheduler_type='linear',
lr_factor=0.01,
max_epochs=max_epochs),
checkpoint=dict(
max_keep_ckpts=-1,
save_best=None,
interval=save_epoch_intervals))
custom_hooks = [
dict(
type='EMAHook',
ema_type='ExpMomentumEMA',
momentum=0.0001,
update_buffers=True,
strict_load=False,
priority=49),
dict(
type='mmdet.PipelineSwitchHook',
switch_epoch=max_epochs - close_mosaic_epochs,
switch_pipeline=train_pipeline_stage2)
]
train_cfg = dict(
max_epochs=max_epochs,
val_interval=5,
dynamic_intervals=[((max_epochs - close_mosaic_epochs),
_base_.val_interval_stage2)])
optim_wrapper = dict(
optimizer=dict(
_delete_=True,
type='AdamW',
lr=base_lr,
weight_decay=weight_decay,
batch_size_per_gpu=train_batch_size_per_gpu),
paramwise_cfg=dict(
custom_keys={'backbone.text_model': dict(lr_mult=0.01),
'logit_scale': dict(weight_decay=0.0)}),
constructor='YOLOWv5OptimizerConstructor')
# evaluation settings
val_evaluator = dict(type='mmdet.LVISMetric',
ann_file='/mnt/afs/huangtao3/wzz/YOLO-World/pretrain_data/LVIS/annotations/lvis_v1_val.json',
metric='bbox')
test_evaluator = val_evaluator
我也遇到了同样的问题,APr值非常小,这是什么原因造成的
_base_ = (
'../third_party/mmyolo/configs/yolov8/'
'yolov8_s_mask-refine_syncbn_fast_8xb16-500e_coco.py')
custom_imports = dict(
imports=['yolo_world'],
allow_failed_imports=False)
# hyper-parameters
num_classes = 1203
num_training_classes = 80
max_epochs = 80 # Maximum training epochs
close_mosaic_epochs = 10
save_epoch_intervals = 5
text_channels = 512
neck_embed_channels = [128, 256, _base_.last_stage_out_channels // 2]
neck_num_heads = [4, 8, _base_.last_stage_out_channels // 2 // 32]
base_lr = 2e-4
weight_decay = 0.05
train_batch_size_per_gpu = 6
# load_from = 'pretrained_models/yolo_world_s_clip_t2i_bn_2e-3adamw_32xb16-100e_obj365v1_goldg_train-55b943ea.pth'
load_from = 'checkpoints/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth'
# text_model_name = '../pretrained_models/clip-vit-base-patch32-projection'
text_model_name = 'openai/clip-vit-base-patch32'
persistent_workers = False
mixup_prob = 0.15
copypaste_prob = 0.3
# model settings
model = dict(
type='YOLOWorldDetector',
mm_neck=True,
num_train_classes=num_training_classes,
num_test_classes=num_classes,
data_preprocessor=dict(type='YOLOWDetDataPreprocessor'),
backbone=dict(
_delete_=True,
type='MultiModalYOLOBackbone',
image_model={{_base_.model.backbone}},
text_model=dict(
type='HuggingCLIPLanguageBackbone',
model_name=text_model_name,
frozen_modules=['all'])),
neck=dict(type='YOLOWorldPAFPN',
guide_channels=text_channels,
embed_channels=neck_embed_channels,
num_heads=neck_num_heads,
block_cfg=dict(type='MaxSigmoidCSPLayerWithTwoConv')),
bbox_head=dict(type='YOLOWorldHead',
head_module=dict(type='YOLOWorldHeadModule',
use_bn_head=True,
embed_dims=text_channels,
num_classes=num_training_classes)),
train_cfg=dict(assigner=dict(num_classes=num_training_classes)))
# dataset settings
text_transform = [
dict(type='RandomLoadText',
num_neg_samples=(num_classes, num_classes),
max_num_samples=num_training_classes,
padding_to_max=True,
padding_value=''),
dict(type='mmdet.PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip',
'flip_direction', 'texts'))
]
mosaic_affine_transform = [
dict(
type='MultiModalMosaic',
img_scale=_base_.img_scale,
pad_val=114.0,
pre_transform=_base_.pre_transform),
dict(type='YOLOv5CopyPaste', prob=copypaste_prob),
dict(
type='YOLOv5RandomAffine',
max_rotate_degree=0.0,
max_shear_degree=0.0,
max_aspect_ratio=100.,
scaling_ratio_range=(1 - _base_.affine_scale,
1 + _base_.affine_scale),
# img_scale is (width, height)
border=(-_base_.img_scale[0] // 2, -_base_.img_scale[1] // 2),
border_val=(114, 114, 114),
min_area_ratio=_base_.min_area_ratio,
use_mask_refine=_base_.use_mask2refine)
]
train_pipeline = [
*_base_.pre_transform,
*mosaic_affine_transform,
dict(
type='YOLOv5MultiModalMixUp',
prob=mixup_prob,
pre_transform=[*_base_.pre_transform,
*mosaic_affine_transform]),
*_base_.last_transform[:-1],
*text_transform
]
train_pipeline_stage2 = [
*_base_.train_pipeline_stage2[:-1],
*text_transform
]
coco_train_dataset = dict(
_delete_=True,
type='MultiModalDataset',
dataset=dict(
type='YOLOv5LVISV1Dataset',
data_root='data/coco',
ann_file='lvis/lvis_v1_train_base.json',
data_prefix=dict(img=''),
filter_cfg=dict(filter_empty_gt=False, min_size=32)),
class_text_path='data/texts/lvis_v1_class_texts.json',
pipeline=train_pipeline)
train_dataloader = dict(
persistent_workers=persistent_workers,
batch_size=train_batch_size_per_gpu,
collate_fn=dict(type='yolow_collate'),
dataset=coco_train_dataset)
test_pipeline = [
*_base_.test_pipeline[:-1],
dict(type='LoadText'),
dict(
type='mmdet.PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
'scale_factor', 'pad_param', 'texts'))
]
coco_val_dataset = dict(
_delete_=True,
type='MultiModalDataset',
dataset=dict(
type='YOLOv5LVISV1Dataset',
data_root='data/coco',
test_mode=True,
ann_file='lvis/lvis_v1_val.json',
data_prefix=dict(img=''),
filter_cfg=dict(filter_empty_gt=False, min_size=32)),
class_text_path='data/texts/lvis_v1_class_texts.json',
pipeline=test_pipeline)
val_dataloader = dict(dataset=coco_val_dataset)
test_dataloader = val_dataloader
# training settings
default_hooks = dict(
param_scheduler=dict(
scheduler_type='linear',
lr_factor=0.01,
max_epochs=max_epochs),
checkpoint=dict(
max_keep_ckpts=-1,
save_best=None,
interval=save_epoch_intervals))
custom_hooks = [
dict(
type='EMAHook',
ema_type='ExpMomentumEMA',
momentum=0.0001,
update_buffers=True,
strict_load=False,
priority=49),
dict(
type='mmdet.PipelineSwitchHook',
switch_epoch=max_epochs - close_mosaic_epochs,
switch_pipeline=train_pipeline_stage2)
]
train_cfg = dict(
max_epochs=max_epochs,
val_interval=5,
dynamic_intervals=[((max_epochs - close_mosaic_epochs),
_base_.val_interval_stage2)])
optim_wrapper = dict(
optimizer=dict(
_delete_=True,
type='AdamW',
lr=base_lr,
weight_decay=weight_decay,
batch_size_per_gpu=train_batch_size_per_gpu),
paramwise_cfg=dict(
custom_keys={'backbone.text_model': dict(lr_mult=0.01),
'logit_scale': dict(weight_decay=0.0)}),
constructor='YOLOWv5OptimizerConstructor')
# evaluation settings
val_evaluator = dict(
_delete_=True,
type='mmdet.LVISMetric',
proposal_nums=(100, 1, 10),
ann_file='data/coco/lvis/lvis_v1_val.json',
metric='bbox')
@Flier-01, 可以打开CLIP训练
好的,我将再次试验。请问只需要将frozen_modules=['all']改为frozen_modules=[]就可以了吗
yep
好的,我将再次试验。请问只需要将frozen_modules=['all']改为frozen_modules=[]就可以了吗
请问你复现成功了嘛~
yep
您好,请问能否给一个lvis_v1_train_base.json和lvis_v1_val.json的链接啊,我用了@Flier-01 完全一样的配置依旧没有结果。不知道和这个是否有关。此外data/coco这个就是标准的coco数据集是嘛
@Flier-01, 可以打开CLIP训练
我按照您所说的打开CLIP进行训练即frozen_modules=[],但是似乎并没有起作用,APr 结果反而变得更糟糕了,请问这是哪里出现了问题呢,能否提供一个完整的配置文件,期待你的答复
@Flier-01, 可以打开CLIP训练
我按照您所说的打开CLIP进行训练即frozen_modules=[],但是似乎并没有起作用,APr 结果反而变得更糟糕了,请问这是哪里出现了问题呢,能否提供一个完整的配置文件,期待你的答复 你好,我使用了你的配置,连你的这个结果都还没出来。请问你使用的代码是他这里什么时候的啊,或者是否可以提供一下你使用的配置里训练集/验证集的链接?我们可以一起探讨解决这个问题,非常感谢。
yep
您好,请问能否给一个lvis_v1_train_base.json和lvis_v1_val.json的链接啊,我用了@Flier-01 完全一样的配置依旧没有结果。不知道和这个是否有关。此外data/coco这个就是标准的coco数据集是嘛
下载lvis数据集的标注文件,会有lvis_v1_train.json和lvis_v1_val.json文件,你只需运行下面的脚本,便可以从lvis_v1_train.json中提取出lvis_v1_train_base.json
import argparse
from tqdm import tqdm
parser = argparse.ArgumentParser()
parser.add_argument("--json_path", default="data/coco/lvis/lvis_v1_train.json")
parser.add_argument("--out_path", default="data/coco/lvis/lvis_v1_train_base.json")
args = parser.parse_args()
with open(args.json_path, 'r') as f:
json_coco = json.load(f)
annotations = []
cat_id2cat_info = {cat_info['id']: cat_info for cat_info in json_coco['categories']}
for ann in tqdm(json_coco['annotations']):
cat_id = ann['category_id']
cat_info = cat_id2cat_info[cat_id]
frequency = cat_info['frequency']
if frequency in ['f', 'c']:
annotations.append(ann)
json_coco['annotations'] = annotations
with open(args.out_path, 'w') as f:
json.dump(json_coco, f)
标注文件 f
yep
您好,请问能否给一个lvis_v1_train_base.json和lvis_v1_val.json的链接啊,我用了@Flier-01 完全一样的配置依旧没有结果。不知道和这个是否有关。此外data/coco这个就是标准的coco数据集是嘛
下载lvis数据集的标注文件,会有lvis_v1_train.json和lvis_v1_val.json文件,你只需运行下面的脚本,便可以从lvis_v1_train.json中提取出lvis_v1_train_base.json
import argparse from tqdm import tqdm parser = argparse.ArgumentParser() parser.add_argument("--json_path", default="data/coco/lvis/lvis_v1_train.json") parser.add_argument("--out_path", default="data/coco/lvis/lvis_v1_train_base.json") args = parser.parse_args() with open(args.json_path, 'r') as f: json_coco = json.load(f) annotations = [] cat_id2cat_info = {cat_info['id']: cat_info for cat_info in json_coco['categories']} for ann in tqdm(json_coco['annotations']): cat_id = ann['category_id'] cat_info = cat_id2cat_info[cat_id] frequency = cat_info['frequency'] if frequency in ['f', 'c']: annotations.append(ann) json_coco['annotations'] = annotations with open(args.out_path, 'w') as f: json.dump(json_coco, f)
非常感谢,我会再次尝试!
@Flier-01, 可以打开CLIP训练
我按照您所说的打开CLIP进行训练即frozen_modules=[],但是似乎并没有起作用,APr 结果反而变得更糟糕了,请问这是哪里出现了问题呢,能否提供一个完整的配置文件,期待你的答复 你好,我使用了你的配置,连你的这个结果都还没出来。请问你使用的代码是他这里什么时候的啊,或者是否可以提供一下你使用的配置里训练集/验证集的链接?我们可以一起探讨解决这个问题,非常感谢。
配置文件是我自己写的,所以可能会有点问题,我也正在寻求作者的帮助。
标注文件 f
yep
您好,请问能否给一个lvis_v1_train_base.json和lvis_v1_val.json的链接啊,我用了@Flier-01 完全一样的配置依旧没有结果。不知道和这个是否有关。此外data/coco这个就是标准的coco数据集是嘛
下载lvis数据集的标注文件,会有lvis_v1_train.json和lvis_v1_val.json文件,你只需运行下面的脚本,便可以从lvis_v1_train.json中提取出lvis_v1_train_base.json
import argparse from tqdm import tqdm parser = argparse.ArgumentParser() parser.add_argument("--json_path", default="data/coco/lvis/lvis_v1_train.json") parser.add_argument("--out_path", default="data/coco/lvis/lvis_v1_train_base.json") args = parser.parse_args() with open(args.json_path, 'r') as f: json_coco = json.load(f) annotations = [] cat_id2cat_info = {cat_info['id']: cat_info for cat_info in json_coco['categories']} for ann in tqdm(json_coco['annotations']): cat_id = ann['category_id'] cat_info = cat_id2cat_info[cat_id] frequency = cat_info['frequency'] if frequency in ['f', 'c']: annotations.append(ann) json_coco['annotations'] = annotations with open(args.out_path, 'w') as f: json.dump(json_coco, f)
非常感谢,我会再次尝试!
我试了你的这个代码,得到的base.json和我之前生成的是完全一致的。这代表我们的数据集应该是一样的,但是我用了你写的配置文件,得到的MAP值都是接近于0,很奇怪。
不好意思,回复晚了! 之前提供的config好像不太对,我在这重新发一下:
_base_ = (
'../../../../../third_party/mmyolo/configs/yolov8/'
'yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py')
custom_imports = dict(
imports=['projects.YoloW.yolow'],
allow_failed_imports=False)
# hyper-parameters
num_classes = 1203
num_training_classes = 80
max_epochs = 80 # Maximum training epochs
close_mosaic_epochs = 10
save_epoch_intervals = 5
text_channels = 512
neck_embed_channels = [128, 256, _base_.last_stage_out_channels // 2]
neck_num_heads = [4, 8, _base_.last_stage_out_channels // 2 // 32]
base_lr = 2e-4
# wrong!
weight_decay = 0.05
train_batch_size_per_gpu = 8
persistent_workers = False
# Polygon2Mask
downsample_ratio = 4
mask_overlap = False
use_mask2refine = True
max_aspect_ratio = 100
min_area_ratio = 0.01
persistent_workers = False
load_from = '<load from>'
# model settings
model = dict(
type='YOLOWDetector',
mm_neck=True,
num_train_classes=num_training_classes,
num_test_classes=num_classes,
data_preprocessor=dict(type='YOLOWDetDataPreprocessor'),
backbone=dict(
_delete_=True,
type='MMTransformer',
image_model={{_base_.model.backbone}},
text_model=dict(
type='HuggingCLIPLanguageBackboneV2',
model_name=
'/group/30042/adriancheng/pretrained_models/clip-vit-base-patch32-projection',
frozen_modules=[])),
neck=dict(type='TextEnhancedYOLOWv8PAFPN',
guide_channels=text_channels,
embed_channels=neck_embed_channels,
num_heads=neck_num_heads,
block_cfg=dict(type='SigmoidAttnCSPLayerWithTwoConv'),
text_enhancder=dict(type='TextEnhanceModuleV2',
embed_channels=256,
num_heads=8,
pool_size=3)),
bbox_head=dict(type='YOLOWv8Head',
head_module=dict(type='YOLOWv8HeadModule',
embed_dims=text_channels,
num_classes=num_training_classes)),
train_cfg=dict(assigner=dict(num_classes=num_training_classes)))
# dataset settings
text_transform = [
dict(type='RandomLoadText',
num_neg_samples=(num_classes, num_classes),
max_num_samples=num_training_classes,
padding_to_max=True,
padding_value=''),
dict(type='mmdet.PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip',
'flip_direction', 'texts'))
]
mosaic_affine_transform = [
dict(
type='MultiModalMosaic',
img_scale=_base_.img_scale,
pad_val=114.0,
pre_transform=_base_.pre_transform),
dict(type='YOLOv5CopyPaste', prob=_base_.copypaste_prob),
dict(
type='YOLOv5RandomAffine',
max_rotate_degree=0.0,
max_shear_degree=0.0,
max_aspect_ratio=100.,
scaling_ratio_range=(1 - _base_.affine_scale,
1 + _base_.affine_scale),
# img_scale is (width, height)
border=(-_base_.img_scale[0] // 2, -_base_.img_scale[1] // 2),
border_val=(114, 114, 114),
min_area_ratio=_base_.min_area_ratio,
use_mask_refine=_base_.use_mask2refine)
]
train_pipeline = [
*_base_.pre_transform,
*mosaic_affine_transform,
dict(
type='YOLOv5MultiModalMixUp',
prob=_base_.mixup_prob,
pre_transform=[*_base_.pre_transform,
*mosaic_affine_transform]),
*_base_.last_transform[:-1],
*text_transform
]
train_pipeline_stage2 = [
*_base_.train_pipeline_stage2[:-1],
*text_transform
]
coco_train_dataset = dict(
_delete_=True,
type='MultiModalDataset',
dataset=dict(
type='YOLOv5LVISV1Dataset',
data_root='data/coco',
ann_file='lvis/lvis_v1_train_base.json',
data_prefix=dict(img=''),
filter_cfg=dict(filter_empty_gt=True, min_size=32)),
class_text_path='data/captions/lvis_v1_base_class_captions.json',
pipeline=train_pipeline)
train_dataloader = dict(
persistent_workers=persistent_workers,
batch_size=train_batch_size_per_gpu,
collate_fn=dict(type='yolow_collate'),
dataset=coco_train_dataset)
test_pipeline = [
*_base_.test_pipeline[:-1],
dict(type='LoadTextFixed'),
dict(
type='mmdet.PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
'scale_factor', 'pad_param', 'texts'))
]
coco_val_dataset = dict(
_delete_=True,
type='MultiModalDataset',
dataset=dict(type='YOLOv5LVISV1Dataset',
data_root='data/coco/',
test_mode=True,
ann_file='lvis/lvis_v1_val.json',
data_prefix=dict(img=''),
batch_shapes_cfg=None),
class_text_path='data/captions/lvis_v1_class_captions.json',
pipeline=test_pipeline)
val_dataloader = dict(dataset=coco_val_dataset)
test_dataloader = val_dataloader
val_evaluator = dict(type='mmdet.LVISMetric',
ann_file='data/coco/lvis/lvis_v1_val.json',
metric='bbox')
test_evaluator = val_evaluator
# training settings
default_hooks = dict(
param_scheduler=dict(
scheduler_type='linear',
lr_factor=0.01,
max_epochs=max_epochs),
checkpoint=dict(
max_keep_ckpts=-1,
save_best=None,
interval=save_epoch_intervals))
custom_hooks = [
dict(
type='EMAHook',
ema_type='ExpMomentumEMA',
momentum=0.0001,
update_buffers=True,
strict_load=False,
priority=49),
dict(
type='mmdet.PipelineSwitchHook',
switch_epoch=max_epochs - close_mosaic_epochs,
switch_pipeline=train_pipeline_stage2)
]
train_cfg = dict(
max_epochs=max_epochs,
val_interval=5,
dynamic_intervals=[((max_epochs - close_mosaic_epochs),
_base_.val_interval_stage2)])
optim_wrapper = dict(
optimizer=dict(
_delete_=True,
type='AdamW',
lr=base_lr,
weight_decay=weight_decay,
batch_size_per_gpu=train_batch_size_per_gpu),
paramwise_cfg=dict(
bias_decay_mult=0.0,
norm_decay_mult=0.0,
custom_keys={'backbone.text_model': dict(lr_mult=0.01),
'logit_scale': dict(weight_decay=0.0)}),
constructor='YOLOWv5OptimizerConstructor')
麻烦按需修改,因为是很早的config文件,和目前release的code不会完全match。
提供一个training log: 20231118_135730.log
您好,我尝试了您的配置,在YOLOv8-s上的结果依然是novel ap约等于0而其他结果正常,以下是我的配置文件
_base_ = (
'../../third_party/mmyolo/configs/yolov8/'
'yolov8_s_mask-refine_syncbn_fast_8xb16-500e_coco.py')
custom_imports = dict(
imports=['yolo_world'],
allow_failed_imports=False)
# hyper-parameters
num_classes = 1023
num_training_classes = 80
max_epochs = 80 # Maximum training epochs
close_mosaic_epochs = 10
save_epoch_intervals = 5
text_channels = 512
neck_embed_channels = [128, 256, _base_.last_stage_out_channels // 2]
neck_num_heads = [4, 8, _base_.last_stage_out_channels // 2 // 32]
base_lr = 2e-4
weight_decay = 0.05
train_batch_size_per_gpu = 8
load_from = '<load from>'
text_model_name = 'work_dirs/clip-vit-base-patch32'
persistent_workers = False
mixup_prob = 0.15
copypaste_prob = 0.3
# model settings
model = dict(
type='YOLOWorldDetector',
mm_neck=True,
num_train_classes=num_training_classes,
num_test_classes=num_classes,
data_preprocessor=dict(type='YOLOWDetDataPreprocessor'),
backbone=dict(
_delete_=True,
type='MultiModalYOLOBackbone',
image_model={{_base_.model.backbone}},
text_model=dict(
type='HuggingCLIPLanguageBackbone',
model_name=text_model_name,
frozen_modules=[])), # clip not frozen
neck=dict(type='YOLOWorldPAFPN',
guide_channels=text_channels,
embed_channels=neck_embed_channels,
num_heads=neck_num_heads,
block_cfg=dict(type='MaxSigmoidCSPLayerWithTwoConv')),
bbox_head=dict(type='YOLOWorldHead',
head_module=dict(type='YOLOWorldHeadModule',
use_bn_head=True,
embed_dims=text_channels,
num_classes=num_training_classes)),
train_cfg=dict(assigner=dict(num_classes=num_training_classes)))
# dataset settings
text_transform = [
dict(type='RandomLoadText',
num_neg_samples=(num_classes, num_classes),
max_num_samples=num_training_classes,
padding_to_max=True,
padding_value=''),
dict(type='mmdet.PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip',
'flip_direction', 'texts'))
]
mosaic_affine_transform = [
dict(
type='MultiModalMosaic',
img_scale=_base_.img_scale,
pad_val=114.0,
pre_transform=_base_.pre_transform),
dict(type='YOLOv5CopyPaste', prob=copypaste_prob),
dict(
type='YOLOv5RandomAffine',
max_rotate_degree=0.0,
max_shear_degree=0.0,
max_aspect_ratio=100.,
scaling_ratio_range=(1 - _base_.affine_scale,
1 + _base_.affine_scale),
# img_scale is (width, height)
border=(-_base_.img_scale[0] // 2, -_base_.img_scale[1] // 2),
border_val=(114, 114, 114),
min_area_ratio=_base_.min_area_ratio,
use_mask_refine=_base_.use_mask2refine)
]
train_pipeline = [
*_base_.pre_transform,
*mosaic_affine_transform,
dict(
type='YOLOv5MultiModalMixUp',
prob=mixup_prob,
pre_transform=[*_base_.pre_transform,
*mosaic_affine_transform]),
*_base_.last_transform[:-1],
*text_transform
]
train_pipeline_stage2 = [
*_base_.train_pipeline_stage2[:-1],
*text_transform
]
lvis_train_dataset = dict(
_delete_=True,
type='MultiModalDataset',
dataset=dict(
type='YOLOv5LVISV1Dataset',
test_mode=False,
data_root='data/coco',
ann_file='lvis/lvis_v1_train_base.json',
data_prefix=dict(img=''),
filter_cfg=dict(filter_empty_gt=True, min_size=32)),
class_text_path='data/texts/lvis_v1_class_texts.json',
pipeline=train_pipeline)
train_dataloader = dict(
persistent_workers=persistent_workers,
batch_size=train_batch_size_per_gpu,
collate_fn=dict(type='yolow_collate'),
dataset=lvis_train_dataset)
test_pipeline = [
*_base_.test_pipeline[:-1],
dict(type='LoadText'),
dict(
type='mmdet.PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
'scale_factor', 'pad_param', 'texts'))
]
lvis_val_dataset = dict(
_delete_=True,
type='MultiModalDataset',
dataset=dict(
type='YOLOv5LVISV1Dataset',
data_root='data/coco',
test_mode=True,
ann_file='lvis/lvis_v1_val.json',
data_prefix=dict(img=''),
batch_shapes_cfg=None),
class_text_path='data/texts/lvis_v1_class_texts.json',
pipeline=test_pipeline)
val_dataloader = dict(dataset=lvis_val_dataset)
test_dataloader = val_dataloader
# training settings
default_hooks = dict(
param_scheduler=dict(
scheduler_type='linear',
lr_factor=0.01,
max_epochs=max_epochs),
checkpoint=dict(
max_keep_ckpts=-1,
save_best=None,
interval=save_epoch_intervals))
custom_hooks = [
dict(
type='EMAHook',
ema_type='ExpMomentumEMA',
momentum=0.0001,
update_buffers=True,
strict_load=False,
priority=49),
dict(
type='mmdet.PipelineSwitchHook',
switch_epoch=max_epochs - close_mosaic_epochs,
switch_pipeline=train_pipeline_stage2)
]
train_cfg = dict(
max_epochs=max_epochs,
val_interval=5,
dynamic_intervals=[((max_epochs - close_mosaic_epochs),
_base_.val_interval_stage2)])
optim_wrapper = dict(
optimizer=dict(
_delete_=True,
type='AdamW',
lr=base_lr,
weight_decay=weight_decay,
batch_size_per_gpu=train_batch_size_per_gpu),
paramwise_cfg=dict(
custom_keys={'backbone.text_model': dict(lr_mult=0.01),
'logit_scale': dict(weight_decay=0.0)}),
constructor='YOLOWv5OptimizerConstructor')
# evaluation settings
val_evaluator = dict(
_delete_=True,
type='mmdet.LVISMetric',
ann_file='lvis/lvis_v1_val.json',
metric='bbox')
test_evaluator = val_evaluator
训练得到的结果如下:
2024/07/05 06:22:11 - mmengine - INFO - Epoch(val) [80][2477/2477] lvis/bbox_AP: 0.2590 lvis/bbox_AP50: 0.3530 lvis/bbox_AP75: 0.2750 lvis/bbox_APs: 0.1840 lvis/bbox_APm: 0.3580 lvis/bbox_APl: 0.4420 lvis/bbox_APr: 0.0010 lvis/bbox_APc: 0.2600 lvis/bbox_APf: 0.3720 data_time: 0.0005 time: 0.1143
@Flier-01, 可以打开CLIP训练
我按照您所说的打开CLIP进行训练即frozen_modules=[],但是似乎并没有起作用,APr 结果反而变得更糟糕了,请问这是哪里出现了问题呢,能否提供一个完整的配置文件,期待你的答复 你好,我使用了你的配置,连你的这个结果都还没出来。请问你使用的代码是他这里什么时候的啊,或者是否可以提供一下你使用的配置里训练集/验证集的链接?我们可以一起探讨解决这个问题,非常感谢。
配置文件是我自己写的,所以可能会有点问题,我也正在寻求作者的帮助。 https://drive.google.com/file/d/1ahmCUXyFAQqnlMb-ZDDSQUMnIosYqhu5/view 我在ovdet原始仓库找到了
目前configs下只有finetune-coco和zero-shot的配置文件,请问如下结果应该怎么配置configs文件呢?训练和测试类别数应该怎么设置