AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection
https://www.yoloworld.cc
GNU General Public License v3.0
4.27k stars 416 forks source link

Fine tuning on a custom dataset #333

Open MrOCW opened 3 months ago

MrOCW commented 3 months ago

Hi, i'd like to fine tune YOLO-World on a custom dataset In this dataset, I have images, respective captions and bounding boxes matching the caption. e.g. image1 of multiple colored candies. caption: blue crystal candy. bbox = [x,y,w,h] image2 of multiple colored candies. caption: red sweet. bbox = [x,y,w,h] How should I structure my dataset, number of classes etc? Are they simply the number of unique captions in my training dataset?

wondervictor commented 3 months ago

Hi @MrOCW,

  1. You need to organize your dataset into a coco-format, refer to: https://cocodataset.org/#format-data
  2. You can add the caption for each box annotation.
  3. You need to use MixedGroundingDataset to load your dataset.
MrOCW commented 3 months ago

@wondervictor

  1. How about the num_classes, num_training_classes?
  2. based on
    coco_train_dataset = dict(
    _delete_=True,
    type='MultiModalDataset',
    dataset=dict(
        type='YOLOv5CocoDataset',
        data_root='data/coco',
        ann_file='annotations/instances_train2017.json',
        data_prefix=dict(img='train2017/'),
        filter_cfg=dict(filter_empty_gt=False, min_size=32)),
    class_text_path='data/texts/coco_class_texts.json',
    pipeline=train_pipeline)

    What should be in class_text_path? Do i create a list of classes that matches the unique captions in my VLM data?

  3. I am supposed to modify the configs in configs/finetune_coco right?
Unicorn123455678 commented 3 months ago

3. MixedGroundingDataset How about the num_classes, num_training_classes?whats the difference of them?I need the answer too

demooooooo0303 commented 3 months ago

Hi @MrOCW,

  1. You need to organize your dataset into a coco-format, refer to: https://cocodataset.org/#format-data
  2. You can add the caption for each box annotation.
  3. You need to use MixedGroundingDataset to load your dataset.

what is MixedGroundingDataset?In finetune config there is type='MultiModalDataset'