AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection
https://www.yoloworld.cc
GNU General Public License v3.0
4.58k stars 444 forks source link

I cannot finetune LVIS dataset because of IndexError #506

Open caiweihan opened 5 days ago

caiweihan commented 5 days ago

I can successfully finetune on COCO dataset, and I can test weights for segmentations through: ./tools/dist_test.sh /homedata/whcai/YOLO-World/configs/segmentation/yolo_world_seg_m_dual_vlpan_2e-4_80e_8gpus_allmodules_finetune_lvis.py /mnt/nas/TrueNas1/whcai/YOLO-World/YOLO-World-huggingface/yolo_world_seg_m_dual_vlpan_2e-4_80e_8gpus_allmodules_finetune_lvis-ca465825.pth 3 But I cannot finetune on LVIS dataset. There will be an error:

Traceback (most recent call last):
  File "/homedata/whcai/YOLO-World/./tools/train.py", line 120, in <module>
    main()
  File "/homedata/whcai/YOLO-World/./tools/train.py", line 116, in main
    runner.train()
  File "/homedata/whcai/miniconda3/envs/YOLO-World/lib/python3.10/site-packages/mmengine/runner/runner.py", line 1777, in train
    model = self.train_loop.run()  # type: ignore
  File "/homedata/whcai/miniconda3/envs/YOLO-World/lib/python3.10/site-packages/mmengine/runner/loops.py", line 98, in run
    self.run_epoch()
  File "/homedata/whcai/miniconda3/envs/YOLO-World/lib/python3.10/site-packages/mmengine/runner/loops.py", line 114, in run_epoch
    for idx, data_batch in enumerate(self.dataloader):
  File "/homedata/whcai/miniconda3/envs/YOLO-World/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/homedata/whcai/miniconda3/envs/YOLO-World/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/homedata/whcai/miniconda3/envs/YOLO-World/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/homedata/whcai/miniconda3/envs/YOLO-World/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/homedata/whcai/miniconda3/envs/YOLO-World/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/homedata/whcai/miniconda3/envs/YOLO-World/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/homedata/whcai/miniconda3/envs/YOLO-World/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/homedata/whcai/YOLO-World/yolo_world/datasets/mm_dataset.py", line 86, in __getitem__
    return self.pipeline(data_info)
  File "/homedata/whcai/miniconda3/envs/YOLO-World/lib/python3.10/site-packages/mmengine/dataset/base_dataset.py", line 60, in __call__
    data = t(data)
  File "/homedata/whcai/miniconda3/envs/YOLO-World/lib/python3.10/site-packages/mmcv/transforms/base.py", line 12, in __call__
    return self.transform(results)
  File "/homedata/whcai/miniconda3/envs/YOLO-World/lib/python3.10/site-packages/mmdet/structures/bbox/box_type.py", line 267, in wrapper
    return func(self, results)
  File "/homedata/whcai/YOLO-World/yolo_world/datasets/transformers/mm_mix_img_transforms.py", line 194, in transform
    results = self._update_label_text(results)
  File "/homedata/whcai/YOLO-World/yolo_world/datasets/transformers/mm_mix_img_transforms.py", line 103, in _update_label_text
    text = res['texts'][label]
IndexError: list index out of range

What is the problem? This is my configuration code:

_base_ = (
    '../../third_party/mmyolo/configs/yolov8/yolov8_m_mask-refine_syncbn_fast_8xb16-500e_coco.py'
)
custom_imports = dict(imports=['yolo_world'], allow_failed_imports=False)
# hyper-parameters
num_classes = 1203
num_training_classes = 80
max_epochs = 80  # Maximum training epochs
close_mosaic_epochs = 10
save_epoch_intervals = 5
text_channels = 512
neck_embed_channels = [128, 256, _base_.last_stage_out_channels // 2]
neck_num_heads = [4, 8, _base_.last_stage_out_channels // 2 // 32]
base_lr = 2e-4

weight_decay = 0.05
train_batch_size_per_gpu = 8
# load_from = 'pretrained_models/yolo_world_m_clip_base_dual_vlpan_2e-3adamw_32xb16_100e_o365_goldg_train_pretrained-2b7bd1be.pth'
load_from = '/mnt/nas/TrueNas1/whcai/YOLO-World/YOLO-World-huggingface/yolo_world_seg_m_dual_vlpan_2e-4_80e_8gpus_allmodules_finetune_lvis-ca465825.pth'
persistent_workers = False

# Polygon2Mask
downsample_ratio = 4
mask_overlap = False
use_mask2refine = True
max_aspect_ratio = 100
min_area_ratio = 0.01

# model settings
model = dict(
    type='YOLOWorldDetector',
    mm_neck=True,
    num_train_classes=num_training_classes,
    num_test_classes=num_classes,
    data_preprocessor=dict(type='YOLOWDetDataPreprocessor'),
    backbone=dict(
        _delete_=True,
        type='MultiModalYOLOBackbone',
        image_model={{_base_.model.backbone}},
        text_model=dict(
            type='HuggingCLIPLanguageBackbone',
            # model_name='openai/clip-vit-base-patch32',
            model_name='/homedata/whcai/YOLO-World/pretrained_models/clip-vit-base-patch32-projection',
            frozen_modules=[])),
    neck=dict(type='YOLOWorldDualPAFPN',
              guide_channels=text_channels,
              embed_channels=neck_embed_channels,
              num_heads=neck_num_heads,
              block_cfg=dict(type='MaxSigmoidCSPLayerWithTwoConv'),
              text_enhancder=dict(type='ImagePoolingAttentionModule',
                                  embed_channels=256,
                                  num_heads=8)),
    bbox_head=dict(type='YOLOWorldSegHead',
                   head_module=dict(type='YOLOWorldSegHeadModule',
                                    embed_dims=text_channels,
                                    num_classes=num_training_classes,
                                    mask_channels=32,
                                    proto_channels=256),
                   mask_overlap=mask_overlap,
                   loss_mask=dict(type='mmdet.CrossEntropyLoss',
                                  use_sigmoid=True,
                                  reduction='none'),
                   loss_mask_weight=1.0),
    train_cfg=dict(assigner=dict(num_classes=num_training_classes)),
    test_cfg=dict(mask_thr_binary=0.5, fast_test=True))

pre_transform = [
    dict(type='LoadImageFromFile', backend_args=_base_.backend_args),
    dict(type='LoadAnnotations',
         with_bbox=True,
         with_mask=True,
         mask2bbox=True)
]

last_transform = [
    dict(type='mmdet.Albu',
         transforms=_base_.albu_train_transforms,
         bbox_params=dict(type='BboxParams',
                          format='pascal_voc',
                          label_fields=['gt_bboxes_labels',
                                        'gt_ignore_flags']),
         keymap={
             'img': 'image',
             'gt_bboxes': 'bboxes'
         }),
    dict(type='YOLOv5HSVRandomAug'),
    dict(type='mmdet.RandomFlip', prob=0.5),
    dict(type='Polygon2Mask',
         downsample_ratio=downsample_ratio,
         mask_overlap=mask_overlap),
]

# dataset settings
text_transform = [
    dict(type='RandomLoadText',
         num_neg_samples=(num_classes, num_classes),
         max_num_samples=num_training_classes,
         padding_to_max=True,
         padding_value=''),
    dict(type='PackDetInputs',
         meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip',
                    'flip_direction', 'texts'))
]
mosaic_affine_transform = [
    dict(type='MultiModalMosaic',
         img_scale=_base_.img_scale,
         pad_val=114.0,
         pre_transform=pre_transform),
    dict(type='YOLOv5CopyPaste', prob=_base_.copypaste_prob),
    dict(
        type='YOLOv5RandomAffine',
        max_rotate_degree=0.0,
        max_shear_degree=0.0,
        max_aspect_ratio=100.,
        scaling_ratio_range=(1 - _base_.affine_scale, 1 + _base_.affine_scale),
        # img_scale is (width, height)
        border=(-_base_.img_scale[0] // 2, -_base_.img_scale[1] // 2),
        border_val=(114, 114, 114),
        min_area_ratio=_base_.min_area_ratio,
        use_mask_refine=True)
]
train_pipeline = [
    *pre_transform, *mosaic_affine_transform,
    dict(type='YOLOv5MultiModalMixUp',
         prob=_base_.mixup_prob,
         pre_transform=[*pre_transform, *mosaic_affine_transform]),
    *last_transform, *text_transform
]

_train_pipeline_stage2 = [
    *pre_transform,
    dict(type='YOLOv5KeepRatioResize', scale=_base_.img_scale),
    dict(type='LetterResize',
         scale=_base_.img_scale,
         allow_scale_up=True,
         pad_val=dict(img=114.0)),
    dict(type='YOLOv5RandomAffine',
         max_rotate_degree=0.0,
         max_shear_degree=0.0,
         scaling_ratio_range=(1 - _base_.affine_scale,
                              1 + _base_.affine_scale),
         max_aspect_ratio=_base_.max_aspect_ratio,
         border_val=(114, 114, 114),
         min_area_ratio=min_area_ratio,
         use_mask_refine=use_mask2refine), *last_transform
]
train_pipeline_stage2 = [*_train_pipeline_stage2, *text_transform]
coco_train_dataset = dict(
    _delete_=True,
    type='MultiModalDataset',
    dataset=dict(type='YOLOv5LVISV1Dataset',
                 # data_root='data/coco',
                 data_root='/mnt/nas/TrueNas1/whcai/YOLO-World/coco2017',
                 ann_file='lvis/lvis_v1_train_base.json',
                 data_prefix=dict(img=''),
                 filter_cfg=dict(filter_empty_gt=True, min_size=32)),
    # class_text_path='data/texts/lvis_v1_base_class_texts.json',
    class_text_path='/homedata/whcai/YOLO-World/data/texts/lvis_v1_base_class_captions.json',
    pipeline=train_pipeline)
train_dataloader = dict(persistent_workers=persistent_workers,
                        batch_size=train_batch_size_per_gpu,
                        collate_fn=dict(type='yolow_collate'),
                        dataset=coco_train_dataset)

test_pipeline = [
    *_base_.test_pipeline[:-1],
    dict(type='LoadText'),
    dict(type='mmdet.PackDetInputs',
         meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
                    'scale_factor', 'pad_param', 'texts'))
]

# training settings
default_hooks = dict(param_scheduler=dict(scheduler_type='linear',
                                          lr_factor=0.01,
                                          max_epochs=max_epochs),
                     checkpoint=dict(max_keep_ckpts=-1,
                                     save_best=None,
                                     interval=save_epoch_intervals))
custom_hooks = [
    dict(type='EMAHook',
         ema_type='ExpMomentumEMA',
         momentum=0.0001,
         update_buffers=True,
         strict_load=False,
         priority=49),
    dict(type='mmdet.PipelineSwitchHook',
         switch_epoch=max_epochs - close_mosaic_epochs,
         switch_pipeline=train_pipeline_stage2)
]
train_cfg = dict(max_epochs=max_epochs,
                 val_interval=5,
                 dynamic_intervals=[((max_epochs - close_mosaic_epochs),
                                     _base_.val_interval_stage2)])
optim_wrapper = dict(optimizer=dict(
    _delete_=True,
    type='AdamW',
    lr=base_lr,
    weight_decay=weight_decay,
    batch_size_per_gpu=train_batch_size_per_gpu),
                     paramwise_cfg=dict(bias_decay_mult=0.0,
                                        norm_decay_mult=0.0,
                                        custom_keys={
                                            'backbone.text_model':
                                            dict(lr_mult=0.01),
                                            'logit_scale':
                                            dict(weight_decay=0.0)
                                        }),
                     constructor='YOLOWv5OptimizerConstructor')

# evaluation settings
coco_val_dataset = dict(
    _delete_=True,
    type='MultiModalDataset',
    dataset=dict(type='YOLOv5LVISV1Dataset',
                 # data_root='data/coco/',
                 data_root='/mnt/nas/TrueNas1/whcai/YOLO-World/coco2017',
                 test_mode=True,
                 ann_file='lvis/lvis_v1_val.json',
                 data_prefix=dict(img=''),
                 batch_shapes_cfg=None),
    # class_text_path='data/captions/lvis_v1_class_captions.json',
    class_text_path='/homedata/whcai/YOLO-World/data/texts/lvis_v1_class_texts.json',
    pipeline=test_pipeline)
val_dataloader = dict(dataset=coco_val_dataset)
test_dataloader = val_dataloader

val_evaluator = dict(type='mmdet.LVISMetric',
                     # ann_file='data/coco/lvis/lvis_v1_val.json',
                     ann_file='/mnt/nas/TrueNas1/whcai/YOLO-World/coco2017/lvis/lvis_v1_val.json',
                     metric=['bbox', 'segm'])
test_evaluator = val_evaluator
find_unused_parameters = True
caiweihan commented 5 days ago

My problem is same with this guy: https://github.com/AILab-CVC/YOLO-World/issues/381 But he closed his issue, so I don't know whether he solved it.

caiweihan commented 4 days ago

我在微调 Seg 时遇到的几个问题: 为什么这个 issue 中说可以直接将 lvis_v1_train_base.json 改成 lvis_v1_train.json:https://github.com/AILab-CVC/YOLO-World/issues/330 但这个 issue 中却说要通过一个脚本从lvis_v1_train.json中提取出lvis_v1_train_base.json:https://github.com/AILab-CVC/YOLO-World/issues/322

另外,coco_train_dataset 中 class_text_path='data/texts/lvis_v1_base_class_texts.json',coco_val_dataset 中 class_text_path='data/captions/lvis_v1_class_captions.json',但我在文件中找不到这两个文件,只能找到 data/texts/lvis_v1_class_texts.json' 和 data/texts/lvis_v1_class_texts.json',不仅子文件夹不一样,名字也不一样,这是配置文件写错了,还是有别的文件需要放在对应位置?