AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection
https://www.yoloworld.cc
GNU General Public License v3.0
4.07k stars 394 forks source link

Meet problem when evaluating on obj365 val set #210

Open 1170300714 opened 3 months ago

1170300714 commented 3 months ago

Thanks for your great work!

I wants to evaluate the performance of yolo_world_s_clip_base_dual_vlpan_2e-3adamw_32xb16_100e_o365_goldg_train_pretrained-18bea4d2.pth on val set of obj365v1.

I modify the config of configs/pretrain_v1/yolo_world_s_dual_vlpan_l2norm_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py to follows:

_base_ = ('../../third_party/mmyolo/configs/yolov8/'
          'yolov8_s_syncbn_fast_8xb16-500e_coco.py')
custom_imports = dict(imports=['yolo_world'],
                      allow_failed_imports=False)

# hyper-parameters
num_classes = 365
num_training_classes = 80
max_epochs = 100  # Maximum training epochs
close_mosaic_epochs = 2
save_epoch_intervals = 2
text_channels = 512
neck_embed_channels = [128, 256, _base_.last_stage_out_channels // 2]
neck_num_heads = [4, 8, _base_.last_stage_out_channels // 2 // 32]
base_lr = 2e-3
weight_decay = 0.05 / 2
train_batch_size_per_gpu = 16

# model settings
model = dict(
    type='YOLOWorldDetector',
    mm_neck=True,
    num_train_classes=num_training_classes,
    num_test_classes=num_classes,
    data_preprocessor=dict(type='YOLOWDetDataPreprocessor'),
    backbone=dict(
        _delete_=True,
        type='MultiModalYOLOBackbone',
        image_model={{_base_.model.backbone}},
        text_model=dict(
            type='HuggingCLIPLanguageBackbone',
            #model_name='openai/clip-vit-base-patch32',
            model_name='/hdd2/lsk/YOLO-World-master/huggingfaceclip',
            frozen_modules=['all'])),
    neck=dict(type='YOLOWorldDualPAFPN',
              guide_channels=text_channels,
              embed_channels=neck_embed_channels,
              num_heads=neck_num_heads,
              block_cfg=dict(type='MaxSigmoidCSPLayerWithTwoConv'),
              text_enhancder=dict(type='ImagePoolingAttentionModule',
                                  embed_channels=256,
                                  num_heads=8)),
    bbox_head=dict(type='YOLOWorldHead',
                   head_module=dict(type='YOLOWorldHeadModule',
                                    embed_dims=text_channels,
                                    num_classes=num_training_classes)),
    train_cfg=dict(assigner=dict(num_classes=num_training_classes)))

# dataset settings
text_transform = [
    dict(type='RandomLoadText',
         num_neg_samples=(num_classes, num_classes),
         max_num_samples=num_training_classes,
         padding_to_max=True,
         padding_value=''),
    dict(type='mmdet.PackDetInputs',
         meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip',
                    'flip_direction', 'texts'))
]
train_pipeline = [
    *_base_.pre_transform,
    dict(type='MultiModalMosaic',
         img_scale=_base_.img_scale,
         pad_val=114.0,
         pre_transform=_base_.pre_transform),
    dict(
        type='YOLOv5RandomAffine',
        max_rotate_degree=0.0,
        max_shear_degree=0.0,
        scaling_ratio_range=(1 - _base_.affine_scale, 1 + _base_.affine_scale),
        max_aspect_ratio=_base_.max_aspect_ratio,
        border=(-_base_.img_scale[0] // 2, -_base_.img_scale[1] // 2),
        border_val=(114, 114, 114)),
    *_base_.last_transform[:-1],
    *text_transform,
]
train_pipeline_stage2 = [*_base_.train_pipeline_stage2[:-1], *text_transform]
obj365v1_train_dataset = dict(
    type='MultiModalDataset',
    dataset=dict(
        type='YOLOv5Objects365V1Dataset',
        data_root='data/objects365v1/',
        ann_file='annotations/objects365_train.json',
        data_prefix=dict(img='train/'),
        filter_cfg=dict(filter_empty_gt=False, min_size=32)),
    class_text_path='data/texts/obj365v1_class_texts.json',
    pipeline=train_pipeline)

mg_train_dataset = dict(type='YOLOv5MixedGroundingDataset',
                        data_root='data/mixed_grounding/',
                        ann_file='annotations/final_mixed_train_no_coco.json',
                        data_prefix=dict(img='gqa/images/'),
                        filter_cfg=dict(filter_empty_gt=False, min_size=32),
                        pipeline=train_pipeline)

flickr_train_dataset = dict(
    type='YOLOv5MixedGroundingDataset',
    data_root='data/flickr/',
    ann_file='annotations/final_flickr_separateGT_train.json',
    data_prefix=dict(img='full_images/'),
    filter_cfg=dict(filter_empty_gt=True, min_size=32),
    pipeline=train_pipeline)

train_dataloader = dict(batch_size=train_batch_size_per_gpu,
                        collate_fn=dict(type='yolow_collate'),
                        dataset=dict(_delete_=True,
                                     type='ConcatDataset',
                                     datasets=[
                                         obj365v1_train_dataset,
                                         flickr_train_dataset, mg_train_dataset
                                     ],
                                     ignore_keys=['classes', 'palette']))

test_pipeline = [
    *_base_.test_pipeline[:-1],
    dict(type='LoadText'),
    dict(type='mmdet.PackDetInputs',
         meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
                    'scale_factor', 'pad_param', 'texts'))
]
obj365v1_val_dataset = dict(
    type='MultiModalDataset',
    dataset=dict(
        type='YOLOv5Objects365V1Dataset',
        data_root='/home/huangzitong/lsk/YOLO-World-master/val',
        test_mode=True,
        ann_file='/home/huangzitong/lsk/YOLO-World-master/objects365_val.json',
        data_prefix=dict(img=''),
        filter_cfg=dict(filter_empty_gt=False, min_size=32)),
    class_text_path='data/texts/obj365v1_class_texts.json',
    pipeline=test_pipeline)
'''
coco_val_dataset = dict(
    _delete_=True,
    type='MultiModalDataset',
    dataset=dict(type='YOLOv5LVISV1Dataset',
                 data_root='data/coco/',
                 test_mode=True,
                 ann_file='lvis/lvis_v1_minival_inserted_image_name.json',
                 data_prefix=dict(img=''),
                 batch_shapes_cfg=None),
    class_text_path='data/texts/lvis_v1_class_texts.json',
    pipeline=test_pipeline)'''
val_dataloader = dict(
    _delete_=True,
    batch_size=1,
    num_workers=2,
    persistent_workers=True,
    pin_memory=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=obj365v1_val_dataset)
test_dataloader = val_dataloader

val_evaluator = dict(type='mmdet.CocoMetric',
                     ann_file='/home/huangzitong/lsk/YOLO-World-master/objects365_val.json',
                     metric='bbox')
test_evaluator = val_evaluator

# training settings
default_hooks = dict(param_scheduler=dict(max_epochs=max_epochs),
                     checkpoint=dict(interval=save_epoch_intervals,
                                     rule='greater'))
custom_hooks = [
    dict(type='EMAHook',
         ema_type='ExpMomentumEMA',
         momentum=0.0001,
         update_buffers=True,
         strict_load=False,
         priority=49),
    dict(type='mmdet.PipelineSwitchHook',
         switch_epoch=max_epochs - close_mosaic_epochs,
         switch_pipeline=train_pipeline_stage2)
]
train_cfg = dict(max_epochs=max_epochs,
                 val_interval=10,
                 dynamic_intervals=[((max_epochs - close_mosaic_epochs),
                                     _base_.val_interval_stage2)])
optim_wrapper = dict(optimizer=dict(
    _delete_=True,
    type='AdamW',
    lr=base_lr,
    weight_decay=weight_decay,
    batch_size_per_gpu=train_batch_size_per_gpu),
                     paramwise_cfg=dict(bias_decay_mult=0.0,
                                        norm_decay_mult=0.0,
                                        custom_keys={
                                            'backbone.text_model':
                                            dict(lr_mult=0.01),
                                            'logit_scale':
                                            dict(weight_decay=0.0)
                                        }),
                     constructor='YOLOWv5OptimizerConstructor')

and test with the command:

bash ./tools/dist_test.sh configs/pretrain_v1/yolo_world_s_dual_vlpan_l2norm_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py checkpoints/yolo_world_s_clip_base_dual_vlpan_2e-3adamw_32xb16_100e_o365_goldg_train_pretrained-18bea4d2.pth 3 --out results.pkl

But get very low performance: Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.002 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.003 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.002 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.002 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.004 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.007 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.014 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.015 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.007 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.019 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.032 04/03 20:50:33 - mmengine - INFO - bbox_mAP_copypaste: 0.002 0.003 0.002 0.000 0.002 0.004 04/03 20:51:50 - mmengine - INFO - Results has been saved to results.pkl.

I think it some mistake of our rewriten confif, so could you please help me to check the config? Thanks a lot!

wondervictor commented 3 months ago

You need to compare the text json and the validation annotation file.

1170300714 commented 3 months ago

Thanks for you reply. In fact I use the obj365v1_class_texts.json which you release as the text json, and the use the objects365_val.json which is released from the obj365 official website as the validation annotation file.

Further more, I have checked the class name, index and order of these two files, which are exactly matched with each other:

the text json file: image

the anno file: image image

wondervictor commented 3 months ago

Hi @1170300714, I've checked this bug. You need to sort the categories of objects365 first since the categories are not consistent between train and val.

Please modify the evaluation metric as follows:

val_evaluator = dict(type='mmdet.CocoMetric',
                     ann_file='data/objects365v1/annotations/objects365_val.json',
                     metric='bbox',
                     sort_categories=True,
                     format_only=False)
test_evaluator = val_evaluator

You will obtain the right results, for example (YOLO-World-v2-L):

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.266
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.354
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.290
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.132
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.298
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.415
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.292
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.507
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.538
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.348
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.598
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.726

Visualization examples:

o365_vis_sample2 o365_vis_sample
1170300714 commented 3 months ago

Thanks for your help! It works~

zhongzee commented 2 months ago
sort_categories=True

您好,感谢您出色的工作!!我在复现Yolo-world在LVIS mimi上的zero-shot指标也出现类似的问题。我的部分配置如下: 这里的LVIS 就是 Coco val2017 验证数据集,请问是否也要设置sort_categories=True?还是有其他配置问题呢? coco_val_dataset = dict( delete=True, type='MultiModalDataset', dataset=dict(type='YOLOv5LVISV1Dataset', data_root='/mnt/afs/huangtao3/wzz/YOLO-World/pretrain_data/LVIS', test_mode=True, ann_file='/mnt/afs/huangtao3/wzz/YOLO-World/pretrain_data/LVIS/lvis_v1_minival_inserted_image_name.json', data_prefix=dict(img=''), batch_shapes_cfg=None), class_text_path='data/texts/lvis_v1_class_texts.json', pipeline=test_pipeline) val_evaluator = dict(type='mmdet.LVISMetric', ann_file='/mnt/afs/huangtao3/wzz/YOLO-World/pretrain_data/LVIS/lvis_v1_minival_inserted_image_name.json', metric='bbox') 结果如下: 2024/05/04 00:24:47 - mmengine - INFO - Evaluating bbox... 2024/05/04 00:26:33 - mmengine - INFO - Epoch(test) [4809/4809] lvis/bbox_AP: 0.0230 lvis/bbox_AP50: 0.0320 lvis/bbox_AP75: 0.0250 lvis/bbox_APs: 0.0160 lvis/bbox_APm: 0.0380 lvis/bbox_APl: 0.0670 lvis/bbox_APr: 0.0000 lvis/bbox_APc: 0.0000 lvis/bbox_APf: 0.0480 data_time: 0.0006 time: 1.5346 配置文件是:yolo_world_v2_l_clip_large_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py