AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection
https://www.yoloworld.cc
GNU General Public License v3.0
4.64k stars 449 forks source link

Hoping for some guidance in open set detection finetuning (Long post) #360

Open XieKaiwen opened 5 months ago

XieKaiwen commented 5 months ago

hi, no offense at all to any of the authors of this github page who have worked very hard to help me with my questions and also worked very hard on this project. I just want to give a suggestion that, there should be a more detailed guide/more resources on this page to help people who want to perform training on open-set detection (because for closed-set detection the support seems to be quite sufficient). For example, in the config files, the example finetuning config file example given is for coco_dataset which is usually used for close set detections and not for MixedGroundingDatasets. Alot of the settings in the config files in this github repo are also not tailored to strictly Open-Set finetuning. This has made my experience in trying to leverage on your hardwork relatively harder and shaky because of being uncertain on the exact things to do in order to shape it to my own usage.

That being said, the reason why I am rather confused and lost in doing finetuning using a MixedGroundingDataset is because of my strange results. (Summary below)

Here is an example of the comparison between the pretrained model and my finetuned model on the same image with the same caption.

Pretrained Model: download

Finetuned Model: download (1)

Original image: image_26

Caption used: "blue and white commercial aircraft . red, white, and blue fighter jet . white, black, and grey missile . white drone . black fighter jet . red and white missile . " - I split the class_names when doing inference by using "." instead of ","

Problem: Of course, it can be seen that the bounding boxes has indeed improved and precision of the bounding boxes have been improved. But I am quite confused because, each item in this picture should have a unique label. Instead of each having a unique label, some are classified wrong, not only classified wrongly but also classified wrongly with a CONFIDENCE of 1.0 which makes it impossible to do any general data post processing as well. - This finetuned model was finetuned for 5 epochs on my MixedGroundingDataset (my dataset example would be below)

My configuration file:

_backend_args = None
_multiscale_resize_transforms = [
    dict(
        transforms=[
            dict(scale=(
                640,
                640,
            ), type='YOLOv5KeepRatioResize'),
            dict(
                allow_scale_up=False,
                pad_val=dict(img=114),
                scale=(
                    640,
                    640,
                ),
                type='LetterResize'),
        ],
        type='Compose'),
    dict(
        transforms=[
            dict(scale=(
                320,
                320,
            ), type='YOLOv5KeepRatioResize'),
            dict(
                allow_scale_up=False,
                pad_val=dict(img=114),
                scale=(
                    320,
                    320,
                ),
                type='LetterResize'),
        ],
        type='Compose'),
    dict(
        transforms=[
            dict(scale=(
                960,
                960,
            ), type='YOLOv5KeepRatioResize'),
            dict(
                allow_scale_up=False,
                pad_val=dict(img=114),
                scale=(
                    960,
                    960,
                ),
                type='LetterResize'),
        ],
        type='Compose'),
]
affine_scale = 0.9
albu_train_transforms = [
    dict(p=0.01, type='Blur'),
    dict(p=0.01, type='MedianBlur'),
    dict(p=0.01, type='ToGray'),
    dict(p=0.01, type='CLAHE'),
]
backend_args = None
base_lr = 0.0002
batch_shapes_cfg = None
close_mosaic_epochs = 5
custom_hooks = [
    dict(
        ema_type='ExpMomentumEMA',
        momentum=0.0001,
        priority=49,
        strict_load=False,
        type='EMAHook',
        update_buffers=True),
    dict(
        switch_epoch=0,
        switch_pipeline=[
            dict(backend_args=None, type='LoadImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True),
            dict(scale=(
                1536,
                896,
            ), type='YOLOv5KeepRatioResize'),
            dict(
                allow_scale_up=True,
                pad_val=dict(img=114.0),
                scale=(
                    1536,
                    896,
                ),
                type='LetterResize'),
            dict(
                border_val=(
                    114,
                    114,
                    114,
                ),
                max_aspect_ratio=100,
                max_rotate_degree=0.0,
                max_shear_degree=0.0,
                scaling_ratio_range=(
                    0.09999999999999998,
                    1.9,
                ),
                type='YOLOv5RandomAffine'),
            dict(
                bbox_params=dict(
                    format='pascal_voc',
                    label_fields=[
                        'gt_bboxes_labels',
                        'gt_ignore_flags',
                    ],
                    type='BboxParams'),
                keymap=dict(gt_bboxes='bboxes', img='image'),
                transforms=[
                    dict(p=0.01, type='Blur'),
                    dict(p=0.01, type='MedianBlur'),
                    dict(p=0.01, type='ToGray'),
                    dict(p=0.01, type='CLAHE'),
                ],
                type='mmdet.Albu'),
            dict(type='YOLOv5HSVRandomAug'),
            dict(prob=0.5, type='mmdet.RandomFlip'),
            dict(
                max_num_samples=80,
                num_neg_samples=(
                    80,
                    80,
                ),
                padding_to_max=True,
                padding_value='',
                type='RandomLoadText'),
            dict(
                meta_keys=(
                    'img_id',
                    'img_path',
                    'ori_shape',
                    'img_shape',
                    'flip',
                    'flip_direction',
                    'texts',
                ),
                type='mmdet.PackDetInputs'),
        ],
        type='mmdet.PipelineSwitchHook'),
]
custom_imports = dict(
    allow_failed_imports=False, imports=[
        'yolo_world',
    ])
data_root = 'yolo_train_dataset/'
dataset_type = 'YOLOv5MixedGroundingDataset'
deepen_factor = 1.0
default_hooks = dict(
    checkpoint=dict(
        interval=1, max_keep_ckpts=-1, save_best=None, type='CheckpointHook'),
    logger=dict(interval=50, type='LoggerHook'),
    param_scheduler=dict(
        lr_factor=0.01,
        max_epochs=5,
        scheduler_type='linear',
        type='YOLOv5ParamSchedulerHook'),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    timer=dict(type='IterTimerHook'),
    visualization=dict(type='mmdet.DetVisualizationHook'))
default_scope = 'mmyolo'
env_cfg = dict(
    cudnn_benchmark=True,
    dist_cfg=dict(backend='nccl'),
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
img_scale = (
    1536,
    896,
)
img_scales = [
    (
        640,
        640,
    ),
    (
        320,
        320,
    ),
    (
        960,
        960,
    ),
]
last_stage_out_channels = 512
last_transform = [
    dict(
        bbox_params=dict(
            format='pascal_voc',
            label_fields=[
                'gt_bboxes_labels',
                'gt_ignore_flags',
            ],
            type='BboxParams'),
        keymap=dict(gt_bboxes='bboxes', img='image'),
        transforms=[
            dict(p=0.01, type='Blur'),
            dict(p=0.01, type='MedianBlur'),
            dict(p=0.01, type='ToGray'),
            dict(p=0.01, type='CLAHE'),
        ],
        type='mmdet.Albu'),
    dict(type='YOLOv5HSVRandomAug'),
    dict(prob=0.5, type='mmdet.RandomFlip'),
    dict(
        meta_keys=(
            'img_id',
            'img_path',
            'ori_shape',
            'img_shape',
            'flip',
            'flip_direction',
        ),
        type='mmdet.PackDetInputs'),
]
launcher = 'pytorch'
load_from = 'pretrained_weights/yolo_world_v2_l_obj365v1_goldg_pretrain_1280ft-9babe3f6.pth'
log_level = 'INFO'
log_processor = dict(by_epoch=True, type='LogProcessor', window_size=50)
loss_bbox_weight = 7.5
loss_cls_weight = 0.5
loss_dfl_weight = 0.375
lr_factor = 0.01
max_aspect_ratio = 100
max_epochs = 5
max_keep_ckpts = 2
mixup_prob = 0.15
model = dict(
    backbone=dict(
        image_model=dict(
            act_cfg=dict(inplace=True, type='SiLU'),
            arch='P5',
            deepen_factor=1.0,
            last_stage_out_channels=512,
            norm_cfg=dict(eps=0.001, momentum=0.03, type='BN'),
            type='YOLOv8CSPDarknet',
            widen_factor=1.0),
        text_model=dict(
            frozen_modules=[
                'all',
            ],
            model_name='./configs/openai/clip-vit-base-patch32',
            type='HuggingCLIPLanguageBackbone'),
        type='MultiModalYOLOBackbone'),
    bbox_head=dict(
        bbox_coder=dict(type='DistancePointBBoxCoder'),
        head_module=dict(
            act_cfg=dict(inplace=True, type='SiLU'),
            embed_dims=512,
            featmap_strides=[
                8,
                16,
                32,
            ],
            in_channels=[
                256,
                512,
                512,
            ],
            norm_cfg=dict(eps=0.001, momentum=0.03, type='BN'),
            num_classes=80,
            reg_max=16,
            type='YOLOWorldHeadModule',
            widen_factor=1.0),
        loss_bbox=dict(
            bbox_format='xyxy',
            iou_mode='ciou',
            loss_weight=7.5,
            reduction='sum',
            return_iou=False,
            type='IoULoss'),
        loss_cls=dict(
            loss_weight=0.5,
            reduction='none',
            type='mmdet.CrossEntropyLoss',
            use_sigmoid=True),
        loss_dfl=dict(
            loss_weight=0.375,
            reduction='mean',
            type='mmdet.DistributionFocalLoss'),
        prior_generator=dict(
            offset=0.5, strides=[
                8,
                16,
                32,
            ], type='mmdet.MlvlPointGenerator'),
        type='YOLOWorldHead'),
    data_preprocessor=dict(
        bgr_to_rgb=True,
        mean=[
            0.0,
            0.0,
            0.0,
        ],
        std=[
            255.0,
            255.0,
            255.0,
        ],
        type='YOLOWDetDataPreprocessor'),
    mm_neck=True,
    neck=dict(
        act_cfg=dict(inplace=True, type='SiLU'),
        block_cfg=dict(type='MaxSigmoidCSPLayerWithTwoConv'),
        deepen_factor=1.0,
        embed_channels=[
            128,
            256,
            256,
        ],
        guide_channels=512,
        in_channels=[
            256,
            512,
            512,
        ],
        norm_cfg=dict(eps=0.001, momentum=0.03, type='BN'),
        num_csp_blocks=3,
        num_heads=[
            4,
            8,
            8,
        ],
        out_channels=[
            256,
            512,
            512,
        ],
        text_enhancder=dict(
            embed_channels=256,
            num_heads=8,
            type='ImagePoolingAttentionModule'),
        type='YOLOWorldDualPAFPN',
        widen_factor=1.0),
    num_test_classes=80,
    num_train_classes=80,
    test_cfg=dict(
        max_per_img=300,
        multi_label=True,
        nms=dict(iou_threshold=0.7, type='nms'),
        nms_pre=30000,
        score_thr=0.001),
    train_cfg=dict(
        assigner=dict(
            alpha=0.5,
            beta=6.0,
            eps=1e-09,
            num_classes=80,
            topk=10,
            type='BatchTaskAlignedAssigner',
            use_ciou=True)),
    type='YOLOWorldDetector')
model_test_cfg = dict(
    max_per_img=300,
    multi_label=True,
    nms=dict(iou_threshold=0.7, type='nms'),
    nms_pre=30000,
    score_thr=0.001)
mosaic_affine_transform = [
    dict(
        img_scale=(
            1536,
            896,
        ),
        pad_val=114.0,
        pre_transform=[
            dict(backend_args=None, type='LoadImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True),
        ],
        type='MultiModalMosaic'),
    dict(
        border=(
            -768,
            -448,
        ),
        border_val=(
            114,
            114,
            114,
        ),
        max_aspect_ratio=100.0,
        max_rotate_degree=0.0,
        max_shear_degree=0.0,
        scaling_ratio_range=(
            0.09999999999999998,
            1.9,
        ),
        type='YOLOv5RandomAffine'),
]
neck_embed_channels = [
    128,
    256,
    256,
]
neck_num_heads = [
    4,
    8,
    8,
]
norm_cfg = dict(eps=0.001, momentum=0.03, type='BN')
num_classes = 80
num_det_layers = 3
num_training_classes = 80
optim_wrapper = dict(
    clip_grad=dict(max_norm=10.0),
    constructor='YOLOWv5OptimizerConstructor',
    loss_scale='dynamic',
    optimizer=dict(
        batch_size_per_gpu=4, lr=0.0002, type='AdamW', weight_decay=0.05),
    paramwise_cfg=dict(
        custom_keys=dict({
            'backbone.text_model': dict(lr_mult=0.01),
            'logit_scale': dict(weight_decay=0.0)
        })),
    type='AmpOptimWrapper')
param_scheduler = None
persistent_workers = True
pre_transform = [
    dict(backend_args=None, type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
]
resume = False
save_epoch_intervals = 1
strides = [
    8,
    16,
    32,
]
tal_alpha = 0.5
tal_beta = 6.0
tal_topk = 10
text_channels = 512
text_transform = [
    dict(
        max_num_samples=80,
        num_neg_samples=(
            80,
            80,
        ),
        padding_to_max=True,
        padding_value='',
        type='RandomLoadText'),
    dict(
        meta_keys=(
            'img_id',
            'img_path',
            'ori_shape',
            'img_shape',
            'flip',
            'flip_direction',
            'texts',
        ),
        type='mmdet.PackDetInputs'),
]
train_ann_file = 'annotations/yolo_world_train.json'
train_batch_size_per_gpu = 4
train_cfg = dict(
    dynamic_intervals=[
        (
            0,
            20,
        ),
    ],
    max_epochs=5,
    type='EpochBasedTrainLoop',
    val_interval=100)
train_data_prefix = 'unaug_padded_images/'
train_dataloader = dict(
    batch_size=4,
    collate_fn=dict(type='yolow_collate'),
    dataset=dict(
        datasets=[
            dict(
                ann_file='annotations/yolo_world_train.json',
                data_prefix=dict(img='unaug_padded_images/'),
                data_root='../yolo_train_dataset/',
                filter_cfg=dict(filter_empty_gt=False, min_size=32),
                pipeline=[
                    dict(backend_args=None, type='LoadImageFromFile'),
                    dict(type='LoadAnnotations', with_bbox=True),
                    dict(
                        img_scale=(
                            1536,
                            896,
                        ),
                        pad_val=114.0,
                        pre_transform=[
                            dict(backend_args=None, type='LoadImageFromFile'),
                            dict(type='LoadAnnotations', with_bbox=True),
                        ],
                        type='MultiModalMosaic'),
                    dict(
                        border=(
                            -768,
                            -448,
                        ),
                        border_val=(
                            114,
                            114,
                            114,
                        ),
                        max_aspect_ratio=100.0,
                        max_rotate_degree=0.0,
                        max_shear_degree=0.0,
                        scaling_ratio_range=(
                            0.09999999999999998,
                            1.9,
                        ),
                        type='YOLOv5RandomAffine'),
                    dict(
                        pre_transform=[
                            dict(backend_args=None, type='LoadImageFromFile'),
                            dict(type='LoadAnnotations', with_bbox=True),
                            dict(
                                img_scale=(
                                    1536,
                                    896,
                                ),
                                pad_val=114.0,
                                pre_transform=[
                                    dict(
                                        backend_args=None,
                                        type='LoadImageFromFile'),
                                    dict(
                                        type='LoadAnnotations',
                                        with_bbox=True),
                                ],
                                type='MultiModalMosaic'),
                            dict(
                                border=(
                                    -768,
                                    -448,
                                ),
                                border_val=(
                                    114,
                                    114,
                                    114,
                                ),
                                max_aspect_ratio=100.0,
                                max_rotate_degree=0.0,
                                max_shear_degree=0.0,
                                scaling_ratio_range=(
                                    0.09999999999999998,
                                    1.9,
                                ),
                                type='YOLOv5RandomAffine'),
                        ],
                        prob=0.15,
                        type='YOLOv5MultiModalMixUp'),
                    dict(
                        bbox_params=dict(
                            format='pascal_voc',
                            label_fields=[
                                'gt_bboxes_labels',
                                'gt_ignore_flags',
                            ],
                            type='BboxParams'),
                        keymap=dict(gt_bboxes='bboxes', img='image'),
                        transforms=[
                            dict(p=0.01, type='Blur'),
                            dict(p=0.01, type='MedianBlur'),
                            dict(p=0.01, type='ToGray'),
                            dict(p=0.01, type='CLAHE'),
                        ],
                        type='mmdet.Albu'),
                    dict(type='YOLOv5HSVRandomAug'),
                    dict(prob=0.5, type='mmdet.RandomFlip'),
                    dict(
                        max_num_samples=80,
                        num_neg_samples=(
                            80,
                            80,
                        ),
                        padding_to_max=True,
                        padding_value='',
                        type='RandomLoadText'),
                    dict(
                        meta_keys=(
                            'img_id',
                            'img_path',
                            'ori_shape',
                            'img_shape',
                            'flip',
                            'flip_direction',
                            'texts',
                        ),
                        type='mmdet.PackDetInputs'),
                ],
                type='YOLOv5MixedGroundingDataset'),
        ],
        ignore_keys=[
            'classes',
            'palette',
        ],
        type='ConcatDataset'),
    num_workers=4,
    persistent_workers=True,
    pin_memory=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))
train_num_workers = 4
train_pipeline = [
    dict(backend_args=None, type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(
        img_scale=(
            1536,
            896,
        ),
        pad_val=114.0,
        pre_transform=[
            dict(backend_args=None, type='LoadImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True),
        ],
        type='MultiModalMosaic'),
    dict(
        border=(
            -768,
            -448,
        ),
        border_val=(
            114,
            114,
            114,
        ),
        max_aspect_ratio=100.0,
        max_rotate_degree=0.0,
        max_shear_degree=0.0,
        scaling_ratio_range=(
            0.09999999999999998,
            1.9,
        ),
        type='YOLOv5RandomAffine'),
    dict(
        pre_transform=[
            dict(backend_args=None, type='LoadImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True),
            dict(
                img_scale=(
                    1536,
                    896,
                ),
                pad_val=114.0,
                pre_transform=[
                    dict(backend_args=None, type='LoadImageFromFile'),
                    dict(type='LoadAnnotations', with_bbox=True),
                ],
                type='MultiModalMosaic'),
            dict(
                border=(
                    -768,
                    -448,
                ),
                border_val=(
                    114,
                    114,
                    114,
                ),
                max_aspect_ratio=100.0,
                max_rotate_degree=0.0,
                max_shear_degree=0.0,
                scaling_ratio_range=(
                    0.09999999999999998,
                    1.9,
                ),
                type='YOLOv5RandomAffine'),
        ],
        prob=0.15,
        type='YOLOv5MultiModalMixUp'),
    dict(
        bbox_params=dict(
            format='pascal_voc',
            label_fields=[
                'gt_bboxes_labels',
                'gt_ignore_flags',
            ],
            type='BboxParams'),
        keymap=dict(gt_bboxes='bboxes', img='image'),
        transforms=[
            dict(p=0.01, type='Blur'),
            dict(p=0.01, type='MedianBlur'),
            dict(p=0.01, type='ToGray'),
            dict(p=0.01, type='CLAHE'),
        ],
        type='mmdet.Albu'),
    dict(type='YOLOv5HSVRandomAug'),
    dict(prob=0.5, type='mmdet.RandomFlip'),
    dict(
        max_num_samples=80,
        num_neg_samples=(
            80,
            80,
        ),
        padding_to_max=True,
        padding_value='',
        type='RandomLoadText'),
    dict(
        meta_keys=(
            'img_id',
            'img_path',
            'ori_shape',
            'img_shape',
            'flip',
            'flip_direction',
            'texts',
        ),
        type='mmdet.PackDetInputs'),
]
train_pipeline_stage2 = [
    dict(backend_args=None, type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(scale=(
        1536,
        896,
    ), type='YOLOv5KeepRatioResize'),
    dict(
        allow_scale_up=True,
        pad_val=dict(img=114.0),
        scale=(
            1536,
            896,
        ),
        type='LetterResize'),
    dict(
        border_val=(
            114,
            114,
            114,
        ),
        max_aspect_ratio=100,
        max_rotate_degree=0.0,
        max_shear_degree=0.0,
        scaling_ratio_range=(
            0.09999999999999998,
            1.9,
        ),
        type='YOLOv5RandomAffine'),
    dict(
        bbox_params=dict(
            format='pascal_voc',
            label_fields=[
                'gt_bboxes_labels',
                'gt_ignore_flags',
            ],
            type='BboxParams'),
        keymap=dict(gt_bboxes='bboxes', img='image'),
        transforms=[
            dict(p=0.01, type='Blur'),
            dict(p=0.01, type='MedianBlur'),
            dict(p=0.01, type='ToGray'),
            dict(p=0.01, type='CLAHE'),
        ],
        type='mmdet.Albu'),
    dict(type='YOLOv5HSVRandomAug'),
    dict(prob=0.5, type='mmdet.RandomFlip'),
    dict(
        max_num_samples=80,
        num_neg_samples=(
            80,
            80,
        ),
        padding_to_max=True,
        padding_value='',
        type='RandomLoadText'),
    dict(
        meta_keys=(
            'img_id',
            'img_path',
            'ori_shape',
            'img_shape',
            'flip',
            'flip_direction',
            'texts',
        ),
        type='mmdet.PackDetInputs'),
]
tta_model = dict(
    tta_cfg=dict(max_per_img=300, nms=dict(iou_threshold=0.65, type='nms')),
    type='mmdet.DetTTAModel')
tta_pipeline = [
    dict(backend_args=None, type='LoadImageFromFile'),
    dict(
        transforms=[
            [
                dict(
                    transforms=[
                        dict(scale=(
                            640,
                            640,
                        ), type='YOLOv5KeepRatioResize'),
                        dict(
                            allow_scale_up=False,
                            pad_val=dict(img=114),
                            scale=(
                                640,
                                640,
                            ),
                            type='LetterResize'),
                    ],
                    type='Compose'),
                dict(
                    transforms=[
                        dict(scale=(
                            320,
                            320,
                        ), type='YOLOv5KeepRatioResize'),
                        dict(
                            allow_scale_up=False,
                            pad_val=dict(img=114),
                            scale=(
                                320,
                                320,
                            ),
                            type='LetterResize'),
                    ],
                    type='Compose'),
                dict(
                    transforms=[
                        dict(scale=(
                            960,
                            960,
                        ), type='YOLOv5KeepRatioResize'),
                        dict(
                            allow_scale_up=False,
                            pad_val=dict(img=114),
                            scale=(
                                960,
                                960,
                            ),
                            type='LetterResize'),
                    ],
                    type='Compose'),
            ],
            [
                dict(prob=1.0, type='mmdet.RandomFlip'),
                dict(prob=0.0, type='mmdet.RandomFlip'),
            ],
            [
                dict(type='mmdet.LoadAnnotations', with_bbox=True),
            ],
            [
                dict(
                    meta_keys=(
                        'img_id',
                        'img_path',
                        'ori_shape',
                        'img_shape',
                        'scale_factor',
                        'pad_param',
                        'flip',
                        'flip_direction',
                    ),
                    type='mmdet.PackDetInputs'),
            ],
        ],
        type='TestTimeAug'),
]
val_interval_stage2 = 20
vis_backends = [
    dict(type='LocalVisBackend'),
]
visualizer = dict(
    name='visualizer',
    type='mmdet.DetLocalVisualizer',
    vis_backends=[
        dict(type='LocalVisBackend'),
    ])
vlm_train_dataset = dict(
    ann_file='annotations/yolo_world_train.json',
    data_prefix=dict(img='unaug_padded_images/'),
    data_root='../yolo_train_dataset/',
    filter_cfg=dict(filter_empty_gt=False, min_size=32),
    pipeline=[
        dict(backend_args=None, type='LoadImageFromFile'),
        dict(type='LoadAnnotations', with_bbox=True),
        dict(
            img_scale=(
                1536,
                896,
            ),
            pad_val=114.0,
            pre_transform=[
                dict(backend_args=None, type='LoadImageFromFile'),
                dict(type='LoadAnnotations', with_bbox=True),
            ],
            type='MultiModalMosaic'),
        dict(
            border=(
                -768,
                -448,
            ),
            border_val=(
                114,
                114,
                114,
            ),
            max_aspect_ratio=100.0,
            max_rotate_degree=0.0,
            max_shear_degree=0.0,
            scaling_ratio_range=(
                0.09999999999999998,
                1.9,
            ),
            type='YOLOv5RandomAffine'),
        dict(
            pre_transform=[
                dict(backend_args=None, type='LoadImageFromFile'),
                dict(type='LoadAnnotations', with_bbox=True),
                dict(
                    img_scale=(
                        1536,
                        896,
                    ),
                    pad_val=114.0,
                    pre_transform=[
                        dict(backend_args=None, type='LoadImageFromFile'),
                        dict(type='LoadAnnotations', with_bbox=True),
                    ],
                    type='MultiModalMosaic'),
                dict(
                    border=(
                        -768,
                        -448,
                    ),
                    border_val=(
                        114,
                        114,
                        114,
                    ),
                    max_aspect_ratio=100.0,
                    max_rotate_degree=0.0,
                    max_shear_degree=0.0,
                    scaling_ratio_range=(
                        0.09999999999999998,
                        1.9,
                    ),
                    type='YOLOv5RandomAffine'),
            ],
            prob=0.15,
            type='YOLOv5MultiModalMixUp'),
        dict(
            bbox_params=dict(
                format='pascal_voc',
                label_fields=[
                    'gt_bboxes_labels',
                    'gt_ignore_flags',
                ],
                type='BboxParams'),
            keymap=dict(gt_bboxes='bboxes', img='image'),
            transforms=[
                dict(p=0.01, type='Blur'),
                dict(p=0.01, type='MedianBlur'),
                dict(p=0.01, type='ToGray'),
                dict(p=0.01, type='CLAHE'),
            ],
            type='mmdet.Albu'),
        dict(type='YOLOv5HSVRandomAug'),
        dict(prob=0.5, type='mmdet.RandomFlip'),
        dict(
            max_num_samples=80,
            num_neg_samples=(
                80,
                80,
            ),
            padding_to_max=True,
            padding_value='',
            type='RandomLoadText'),
        dict(
            meta_keys=(
                'img_id',
                'img_path',
                'ori_shape',
                'img_shape',
                'flip',
                'flip_direction',
                'texts',
            ),
            type='mmdet.PackDetInputs'),
    ],
    type='YOLOv5MixedGroundingDataset')
weight_decay = 0.05
widen_factor = 1.0
work_dir = 'training_epochs/retrain_unaug_images_5epochs'

I removed the test and validation code in the configuration file because I do not have a validation dataset. I set the image_scale property to (1536, 896) with my images (added padding).

Below are examples of my dataset: image image image

Below are the logs for my training epochs (1st vs 5th)

2024/05/28 12:54:25 - mmengine - INFO - Epoch(train) [1][  50/1277]  base_lr: 2.0000e-04 lr: 2.5581e-06  eta: 2:06:56  time: 1.2023  data_time: 0.0481  memory: 10132  grad_norm: nan  loss: 37.3969  loss_cls: 25.2205  loss_bbox: 7.0688  loss_dfl: 5.1076
2024/05/28 12:55:07 - mmengine - INFO - Epoch(train) [1][ 100/1277]  base_lr: 2.0000e-04 lr: 5.1684e-06  eta: 1:46:33  time: 0.8322  data_time: 0.0065  memory: 9217  grad_norm: 288.2513  loss: 35.2861  loss_cls: 24.9258  loss_bbox: 5.8003  loss_dfl: 4.5601
2024/05/28 12:55:50 - mmengine - INFO - Epoch(train) [1][ 150/1277]  base_lr: 2.0000e-04 lr: 7.7786e-06  eta: 1:40:19  time: 0.8619  data_time: 0.0074  memory: 9218  grad_norm: 246.2677  loss: 33.5517  loss_cls: 24.4261  loss_bbox: 5.0191  loss_dfl: 4.1065
2024/05/28 12:56:33 - mmengine - INFO - Epoch(train) [1][ 200/1277]  base_lr: 2.0000e-04 lr: 1.0389e-05  eta: 1:36:50  time: 0.8612  data_time: 0.0072  memory: 9217  grad_norm: 258.0434  loss: 32.1251  loss_cls: 23.8915  loss_bbox: 4.3540  loss_dfl: 3.8796
2024/05/28 12:57:18 - mmengine - INFO - Epoch(train) [1][ 250/1277]  base_lr: 2.0000e-04 lr: 1.2999e-05  eta: 1:35:13  time: 0.8985  data_time: 0.0082  memory: 9217  grad_norm: 253.4781  loss: 31.3735  loss_cls: 23.3204  loss_bbox: 4.3102  loss_dfl: 3.7428
2024/05/28 12:58:02 - mmengine - INFO - Epoch(train) [1][ 300/1277]  base_lr: 2.0000e-04 lr: 1.5610e-05  eta: 1:33:37  time: 0.8831  data_time: 0.0077  memory: 9217  grad_norm: 234.5045  loss: 30.9059  loss_cls: 22.9524  loss_bbox: 4.2570  loss_dfl: 3.6965
2024/05/28 14:39:47 - mmengine - INFO - Epoch(train) [5][1100/1277]  base_lr: 2.0000e-04 lr: 8.1200e-05  eta: 0:03:00  time: 1.3310  data_time: 0.0070  memory: 9217  grad_norm: 68.3267  loss: 22.0745  loss_cls: 14.8527  loss_bbox: 3.8226  loss_dfl: 3.3991
2024/05/28 14:40:55 - mmengine - INFO - Epoch(train) [5][1150/1277]  base_lr: 2.0000e-04 lr: 8.1200e-05  eta: 0:02:10  time: 1.3614  data_time: 0.0064  memory: 9217  grad_norm: 64.3287  loss: 22.1809  loss_cls: 14.9217  loss_bbox: 3.8726  loss_dfl: 3.3866
2024/05/28 14:41:58 - mmengine - INFO - Epoch(train) [5][1200/1277]  base_lr: 2.0000e-04 lr: 8.1200e-05  eta: 0:01:19  time: 1.2616  data_time: 0.0058  memory: 9218  grad_norm: 72.3350  loss: 22.2899  loss_cls: 14.9003  loss_bbox: 3.9856  loss_dfl: 3.4040
2024/05/28 14:43:00 - mmengine - INFO - Epoch(train) [5][1250/1277]  base_lr: 2.0000e-04 lr: 8.1200e-05  eta: 0:00:27  time: 1.2284  data_time: 0.0062  memory: 9217  grad_norm: 63.7254  loss: 22.1008  loss_cls: 14.8993  loss_bbox: 3.8408  loss_dfl: 3.3607

As can be seen the loss has a trend of decreasing, which is a good sign (supported by the bounding boxes being of better quality). But that still does not explain why the confidence of every prediction is mostly so close to 1.0, with low confidence bboxes being very rare in the predictions.

Hence I am wondering if I screwed up in either my training process, my configuration file or my dataset, or I just have not trained the model enough (i previously did a 20 epoch training and 30 epoch training as well on my model but it seems like the same thing happened again, 5 epochs this time was because I am experimenting to try and solve this issue)

Summary: Training for various epochs, all lead to questionable results, bounding boxes with suspicious confidence values (even with misclassification). Inquiring for help in any way or form to diagnose the root cause of this issue, i personally am suspecting I wrongly editted something in my configuration file or dataset or just simply didnt use good enough hyperparameters when training. THANKS FOR HELP PROVIDED, IT IS VERY APPRECIATED

wondervictor commented 5 months ago

Hi @XieKaiwen, I'll reply to you as soon as possible. Indeed, it's an important issue.

MarxMelencio commented 5 months ago

Hi @wondervictor ... Any update regarding this? Thanks!

aliencaocao commented 5 months ago

Hi fellow TIL 2024 participant,

2 things i can point out from your provided info.

  1. You set num classes to 80 which is num classes of COCO but have you checked that you actually have exact 80 unique captions/flying objects? Else, you should be setting them accordingly else your model is forced to learn the wrong class/text.
  2. As for confience score too extreme, you can use this technique called label smoothing which changes all ground truth labels from 1.0 to 0.9 (assuming label smoothing=0.1), and change all false labels to 0.1 instead of 0.0. This usually helps with overfitting and extreme confidence scores caused by it.

As an addon, folks at Ultralytics has implemented yolo world v2 training in a much easier to use way. You can check that out. This repository is more for academic use which could explain its general lack of documebtation, but much higher flexibility, which you probably dont need yet.

CZ999-07 commented 3 months ago

How do you define your custom dataset and what method do you use to add the caption?

XieKaiwen commented 3 months ago

@CZ999-07 I followed the format in the LVIS dataset. and the captions I merged the name of all the different objects in the image into one sentence, then I used the negative tokens and positive tokens to indicate each one for the annotations.

CZ999-07 commented 3 months ago

我遵循了LVIS数据集中的格式。在标题中,我将图像中所有不同对象的名称合并为一个句子,然后我使用负标记和正标记来表示注释中的每一个。

The custom dataset I used for fine tuning, the dataset is the traditional yolov5 format converted to coco format, nothing else has changed. I don't have a caption for it, I want to implement a way to give the model some linguistic cues to make it more accurate when reasoning, so I think the dataset should have a caption for that, is it one for each category, or for all targets in a picture, for example, for a picture of an apple tree with a lot of fruits on it, is it just a caption for the this category of apples or every apple. Also, is there any method script required to add caption to a dataset. I am a novice in this area, thanks for answering this question.

XieKaiwen commented 3 months ago

@CZ999-07 You captions really depends on how you want to use the model. I merged all the names of the different classes into one sentence, e.g. "plane . helicopter . missile ." because the only thing I was doing was doing open-set detection, and nothing other than that, I did not have to include any complex reasoning in my captions, e.g. "upside down plane" or something like that.

So in your example for apples, if you want to detect each apple individually, then u can follow my format, which is just "apple". But lets say u only want to detect the apples on the tree, then ur caption needs to be more complicated like "apples on the tree" instead of just "apple". In the end it just depends on what u want to do with ur model. So if you want to have a bounding box over all of the apples in ur picture, u just need to use "apple" as the text prompt.

If you want to implement a way for the model to take in more of the linguistic cues, u can also experiment with the text threshold hyperparameter (in one of the config files if i remember correctly). I think the negative tokens and positive tokens u indicate for each annotation also will affect how the model learns

As for method script required to add caption to a dataset. If your captions follow a very fixed structure, u can consider writing a code for it, but if you dont think that is feasible then manual annotation might be needed.

I will insert a snippet of my dataset below to show you what the dataset format is like because I also had problems with it when I was working on this last time. You have to go and find the flickr dataset yourself because the link on the github repo does not work (only the link on the ultralytics page worked). You can reference the flickr dataset format to create ur dataset.

Can find the flickr dataset under the prepare-datasets section of their docs

Note: I do not know if open-set detection here has been fixed yet, because it seems like there is an issue with it that the github repo author is working on right now as well

CZ999-07 commented 3 months ago

你的字幕实际上取决于你想如何使用这个模型。我将所有不同类别的名称合并为一个句子,例如“plane .直升机。因为我唯一做的就是做开放式检测,除此之外,我不必在我的标题中包含任何复杂的推理,例如“倒置的飞机”或类似的东西。

因此,在您的苹果示例中,如果您想单独检测每个苹果,那么您可以遵循我的格式,即“苹果”。但是,假设你只想检测树上的苹果,那么你的标题需要更复杂,比如“树上的苹果”,而不仅仅是“苹果”。最后,这只取决于你想用你的模型做什么。因此,如果你想在图片中的所有苹果上都有一个边界框,你只需要使用“苹果”作为文本提示。

如果你想实现一种让模型接受更多语言线索的方法,你也可以试验文本阈值超参数(如果我没记错的话,在其中一个配置文件中)。我认为 u 为每个注释指示的负标记和正标记也会影响模型的学习方式

至于向数据集添加标题所需的方法脚本。如果你的字幕遵循一个非常固定的结构,你可以考虑为它编写一个代码,但如果你认为这是不可行的,那么可能需要手动注释。

我将在下面插入我的数据集的片段,向您展示数据集格式是什么样的,因为我上次在处理这个问题时也遇到了问题。你必须自己去找 flickr 数据集,因为 github 仓库上的链接不起作用(只有 ultralytics 页面上的链接起作用)。您可以引用 flickr 数据集格式来创建您的数据集。

可以在他们文档的 prepare-datasets 部分下找到 flickr 数据集

注意:我不知道这里的开集检测是否已修复,因为 github repo 作者现在也在处理它似乎存在一个问题

你所采用的的集成在v8上的yolo_world还是AIlab的源码呢

XieKaiwen commented 3 months ago

@CZ999-07 i used the code from this github repo