AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection
https://www.yoloworld.cc
GNU General Public License v3.0
4.05k stars 391 forks source link

Zero-shot performance about YOLOWorldPromptDetector #154

Open taofuyu opened 4 months ago

taofuyu commented 4 months ago

I rush into the same question like before, #71 , #78 . I modify the config in configs/prompt_tuning_coco/, generate custom embedding file, to fine-tune my dataset which has 4 categories. When inference, I generate a new embedding file which has 7 categories(4 old classes seen in training and 3 new classes) and replace the old embedding file in the config. These 3 new classes CAN NOT be detected, even setting score threshold to 0.01 It seems like losing open-vocabulary/zero-shot ability.

wondervictor commented 4 months ago

Hi @taofuyu, you need to freeze all parameters (backbone, head, and neck) except the embeddings. However, I need to double-check whether all layers are frozen.

taofuyu commented 4 months ago

Ok, I will have a try and update the result.

wondervictor commented 4 months ago

You can evaluate the 4-category detection and 3-category detection separately and then perform the joint evaluation.

taofuyu commented 4 months ago

But, parameters of backbone, head, and neck are all frozen, and the only updated parameters 'embeddings' are not saved into hard disk (during inference, the pre-computed embedding file is still used), so it seems there is nothing changed in the model ?

taofuyu commented 4 months ago

It seems to validate my idea. After running 10 epochs now, the model can only detect 'car', which appears in the pre-trained datasets, other new categories can not be detected (can be detected when not freeze the model)

Hudaodao99 commented 4 months ago

@taofuyu Do you know the difference between all_fine_tuning and prompt tuning? I'm not clear about the config file of all_fine_tuning

taofuyu commented 4 months ago

@taofuyu Do you know the difference between all_fine_tuning and prompt tuning? I'm not clear about the config file of all_fine_tuning

you can compare these two files, by VSCode or something. The main difference is the value of freeze_all, True or False

wondervictor commented 4 months ago

@Hudaodao99 It's my fault, I should start a branch to avoid misleading. The prompt tuning only optimizes the embeddings while the all finetuning optimizes all parameters without the need of a text encoder.

taofuyu commented 4 months ago

But, parameters of backbone, head, and neck are all frozen, and the only updated parameters 'embeddings' are not saved into hard disk (during inference, the pre-computed embedding file is still used), so it seems there is nothing changed in the model ?

@wondervictor

Hudaodao99 commented 4 months ago

@Hudaodao99 It's my fault, I should start a branch to avoid misleading. The prompt tuning only optimizes the embeddings while the all finetuning optimizes all parameters without the need of a text encoder.

Thanks for your answer!

wondervictor commented 4 months ago

But, parameters of backbone, head, and neck are all frozen, and the only updated parameters 'embeddings' are not saved into hard disk (during inference, the pre-computed embedding file is still used), so it seems there is nothing changed in the model ?

@taofuyu I'll check it.

Hudaodao99 commented 4 months ago

@taofuyu I met the same problem. But in prompt tuning on my custom dataset(10 class), I find if I write the number of prompt text less than 10, it has error. like this: (I just write 2 prompt text, which all are not in my dataset, but class is more than 2)

class= [1 2 4 4 3 6] confidence= [0.97107273 0.90503085 0.8864812 0.86314565 0.32898653 0.20567985] Traceback (most recent call last): File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 198, in inference_detector(runner, File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 108, in inference_detector labels = [ File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 109, in f"{texts[class_id][0]} {confidence:0.2f}" for class_id, confidence in IndexError: list index out of range

Have you met the same question?

taofuyu commented 4 months ago

@taofuyu I met the same problem. But in prompt tuning on my custom dataset(10 class), I find if I write the number of prompt text less than 10, it has error. like this: (I just write 2 prompt text, which all are not in my dataset, but class is more than 2)

class= [1 2 4 4 3 6] confidence= [0.97107273 0.90503085 0.8864812 0.86314565 0.32898653 0.20567985] Traceback (most recent call last): File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 198, in inference_detector(runner, File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 108, in inference_detector labels = [ File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 109, in f"{texts[class_id][0]} {confidence:0.2f}" for class_id, confidence in IndexError: list index out of range

Have you met the same question?

detections的结果还是config里embeddings\num_classes的设置的样子,而texts是你命令行里直接输入的,数量不一样就导致维度不匹配了。 正确的做法应该是你测试的时候,需要哪几个类别,就生成哪几类的新的embeddings 并修改对应的num_classes,并与命令行的texts保持一致。

Hudaodao99 commented 4 months ago

detections的结果还是config里embeddings\num_classes的设置的样子,而texts是你命令行里直接输入的,数量不一样就导致维度不匹配了。 正确的做法应该是你测试的时候,需要哪几个类别,就生成哪几类的新的embeddings 并修改对应的num_classes,并与命令行的texts保持一致。

Thanks!

taofuyu commented 4 months ago

I attempt to find a way out this issue thus going to learn more about OVD algorithms. In MM-grounding-DINO, it mentions that close-set fine-tuing will lose OVD generality. Maybe this is the reason why my model can not detect these 3 new classes. I'm not sure. You can take this as a reference. @wondervictor

taofuyu commented 4 months ago

Furthermore, it mentions that, mix COCO data with some of the pre-trained data will improve performance on the COCO dataset as much as possible without compromising generalization. My experiments demonstrate it is right. I mix flicker30k/QGA with my custom data to train YOLOWorldDetector, the model can detect my categories and remain OVD-ability. But, if so, it means YOLOWorldPromptDetecor, can only be fine-tuned as a close-set detector, cause grounding data can not be used during training YOLOWorldPromptDetecor.

wondervictor commented 4 months ago

We did not expect it, the original intention of prompt tuning is to retain the zero-shot capability and generalization and to achieve stronger performance on custom datasets.

wondervictor commented 4 months ago

Hi @taofuyu, it seems that the configs in configs/prompt_tuning_coco wrongly use base_lr=2e-3. It's a mistake I've made. For fine-tuning all modules, the base_lr should be set to 2e-4. As for training prompts only, I'm going to check again.

taofuyu commented 4 months ago

Hi @taofuyu, it seems that the configs in configs/prompt_tuning_coco wrongly use base_lr=2e-3. It's a mistake I've made. For fine-tuning all modules, the base_lr should be set to 2e-4. As for training prompts only, I'm going to check again.

thanks, I already changed the lr to 2e-4 during my fine-tuning.

Hudaodao99 commented 4 months ago

@Hudaodao99 It's my fault, I should start a branch to avoid misleading. The prompt tuning only optimizes the embeddings while the all finetuning optimizes all parameters without the need of a text encoder.

@wondervictor Hi! I'm not quite sure what the difference is between the purpose of all-tuning and prompt-tuning? Can all-tuning achieve open-vocabulary detection and custom detection together, like prompt-tuning? Also, through the prompt-tuning, can we generate and export our own custom npy file?

mio410 commented 4 months ago

@taofuyu 你好,请问你把学习率调整为2e-4后微调效果如何呢?能否解决微调后失去开集检测能力的问题呢?

xiyangyang99 commented 4 months ago

我也有同样的 问题,我在本地微调自己的数据集之后,自己的数据集20个类,每个类有不同的text prompt,我想在微调自己数据集之后,保留原始预训练权重的clip的zeroshot能力。但是似乎结果不是这样的。比如常用的peroson、people、human都可以检测,但是自己的 数据集中,不同文本就检测不了。

taofuyu commented 3 months ago

@mio410 No @xiyangyang99 same question @wondervictor Hello, any updates on this question ?

wondervictor commented 3 months ago

Hi @taofuyu, @xiyangyang99, @Hudaodao99, and @mio410, sorry for the delay. I'll check it and provide solutions asap. Please stay tuned and please let me know if you have any updates.

Yindong-Zhang commented 3 months ago

Can separate inference solve the problem. It comes to me that some interference between each prompts may cause the problem.@taofuyu

taofuyu commented 3 months ago

Can separate inference solve the problem. It comes to me that some interference between each prompts may cause the problem.@taofuyu

sorry, could you please explain this in detail ?

Yindong-Zhang commented 3 months ago

One text prompt may interfere the inference process of the other, you can refer to the text-guided CSPlayer in the paper. I would also like use the prompt tuning technic, hope to solve this issue. like mentioned in : https://github.com/AILab-CVC/YOLO-World/issues/154#issuecomment-2006452067 if separate inference and evaluation is correct, it may overpass the problem.

Yindong-Zhang commented 3 months ago

@taofuyu any update?in case you don't notice the answer above.

wondervictor commented 3 months ago

@Yindong-Zhang, ongoing

taofuyu commented 3 months ago

I think, just tuning custom data with GoldG is fine. Model can detect custom categories and retain OVD ability at the same time.

wondervictor commented 3 months ago

Adding VG(or GoldG) for fine-tuning does maintain the zero-shot performance. I'm now seeking more efficient ways such as regularization for efficient fine-tuning.

mandyxiaomeng commented 2 months ago

Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets?

Thank you!

taofuyu commented 2 months ago

Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets?

Thank you!

tuning custom data with GoldG

trihook commented 2 months ago

Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets? Thank you!

tuning custom data with GoldG

does mean the grounding dataset is key to build open-vocabulary/zero-shot ability.

taofuyu commented 2 months ago

Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets? Thank you!

tuning custom data with GoldG

does mean the grounding dataset is key to build open-vocabulary/zero-shot ability.

Yes, I think so

Ricardoluffy commented 1 week ago

Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets? Thank you!

tuning custom data with GoldG

你好,我使用COCO+GQA来进行微调,但是遇到一个问题,无论如何设置参数,训练了几个epoch之后,grad_norm开始变得很大,loss也变得很大,随后就一直为0,想请教下这是什么原因?

Ricardoluffy commented 1 week ago

我使用的配置文件如下:

base = ('../../third_party/mmyolo/configs/yolov8/' 'yolov8_l_syncbn_fast_8xb16-500e_coco.py') custom_imports = dict(imports=['yolo_world'], allow_failed_imports=False)

hyper-parameters

num_classes = 80 num_training_classes = 80 max_epochs = 30 # Maximum training epochs close_mosaic_epochs = 30 save_epoch_intervals = 2 text_channels = 512 neck_embed_channels = [128, 256, base.last_stage_out_channels // 2] neck_num_heads = [4, 8, base.last_stage_out_channels // 2 // 32] base_lr = 1e-4 weight_decay = 0.05 train_batch_size_per_gpu = 8

load_from = '/mnt/sdc/lishen/yolo-world-model/yolo_world_v2_l_obj365v1_goldg_cc3mlite_pretrain-ca93cd1f.pth' text_model_name = 'openai/clip-vit-base-patch32'

model = dict( type='YOLOWorldDetector', mm_neck=True, num_train_classes=num_training_classes, num_test_classes=num_classes, data_preprocessor=dict(type='YOLOWDetDataPreprocessor'), backbone=dict( delete=True, type='MultiModalYOLOBackbone', image_model={{base.model.backbone}}, text_model=dict( type='HuggingCLIPLanguageBackbone', model_name=text_model_name, frozen_modules=['all'])), neck=dict(type='YOLOWorldPAFPN', guide_channels=text_channels, embed_channels=neck_embed_channels, num_heads=neck_num_heads, block_cfg=dict(type='MaxSigmoidCSPLayerWithTwoConv')), bbox_head=dict(type='YOLOWorldHead', head_module=dict(type='YOLOWorldHeadModule', use_bn_head=True, embed_dims=text_channels, num_classes=num_training_classes)), train_cfg=dict(assigner=dict(num_classes=num_training_classes)))

text_transform = [ dict(type='RandomLoadText', num_neg_samples=(num_classes, num_classes), max_num_samples=num_training_classes, padding_to_max=True, padding_value=''), dict(type='mmdet.PackDetInputs', meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip', 'flip_direction', 'texts')) ]

train_pipeline = [ base.pre_transform, dict(type='MultiModalMosaic', img_scale=base.img_scale, pad_val=114.0, pre_transform=base.pre_transform), dict( type='YOLOv5RandomAffine', max_rotate_degree=0.0, max_shear_degree=0.0, scaling_ratio_range=(1 - base.affine_scale, 1 + base.affine_scale), max_aspect_ratio=base.max_aspect_ratio, border=(-base.img_scale[0] // 2, -base.img_scale[1] // 2), border_val=(114, 114, 114)), base.last_transform[:-1], text_transform, ] train_pipeline_stage2 = [base.train_pipeline_stage2[:-1], *text_transform]

mg_train_dataset = dict(type='YOLOv5MixedGroundingDataset', data_root='/mnt/sdc/lishen/Dataset/GQA', ann_file='annotations/final_mixed_train_no_coco.json', data_prefix=dict(img='images/'), filter_cfg=dict(filter_empty_gt=False, min_size=32), pipeline=train_pipeline)

coco_train_dataset = dict( type='MultiModalDataset', dataset=dict( type='YOLOv5CocoDataset', data_root='/mnt/sdc/Datasets/public/COCO', ann_file='annotations/instances_train2017.json', data_prefix=dict(img='train2017/'), filter_cfg=dict(filter_empty_gt=False, min_size=32)), class_text_path='data/texts/coco_class_texts.json', pipeline=train_pipeline)

train_dataloader = dict(batch_size=train_batch_size_per_gpu, collate_fn=dict(type='yolow_collate'), dataset=dict(delete=True, type='ConcatDataset', datasets=[ mg_train_dataset, coco_train_dataset ], ignore_keys=['classes', 'palette']))

test_pipeline = [ *base.test_pipeline[:-1], dict(type='LoadText'), dict(type='mmdet.PackDetInputs', meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'scale_factor', 'pad_param', 'texts')) ] coco_val_dataset = dict( delete=True, type='MultiModalDataset', dataset=dict( type='YOLOv5CocoDataset', data_root='/mnt/sdc/Datasets/public/COCO', test_mode=True, ann_file='annotations/instances_val2017.json', data_prefix=dict(img='val2017/'), batch_shapes_cfg=None), class_text_path='data/texts/coco_class_texts.json', pipeline=test_pipeline)

val_dataloader = dict(dataset=coco_val_dataset) test_dataloader = val_dataloader

val_evaluator = dict(delete=True, type='mmdet.CocoMetric', proposal_nums=(100, 1, 10), ann_file='/mnt/sdc/Datasets/public/COCO/annotations/instances_val2017.json', metric='bbox') test_evaluator = val_evaluator

default_hooks = dict( param_scheduler=dict( scheduler_type='linear', lr_factor=0.01, max_epochs=max_epochs), checkpoint=dict( max_keep_ckpts=-1, save_best=None, interval=save_epoch_intervals)) custom_hooks = [ dict( type='EMAHook', ema_type='ExpMomentumEMA', momentum=0.0001, update_buffers=True, strict_load=False, priority=49), dict( type='mmdet.PipelineSwitchHook', switch_epoch=max_epochs - close_mosaic_epochs, switch_pipeline=train_pipeline_stage2) ] train_cfg = dict( max_epochs=max_epochs, val_interval=2, dynamic_intervals=[((max_epochs - close_mosaic_epochs), base.val_interval_stage2)])

optim_wrapper = dict(optimizer=dict( delete=True, type='SGD', lr=base_lr, momentum=0.937, nesterov=True, weight_decay=weight_decay, batch_size_per_gpu=train_batch_size_per_gpu), paramwise_cfg=dict( custom_keys={ 'backbone.text_model': dict(lr_mult=0.01), 'logit_scale': dict(weight_decay=0.0) }), constructor='YOLOWv5OptimizerConstructor')

lvke9529 commented 23 hours ago

@taofuyu

I think, just tuning custom data with GoldG is fine. Model can detect custom categories and retain OVD ability at the same time.我认为,只需使用 GoldG 调整自定义数据就可以了。模型可以检测自定义类别,同时保留 OVD 能力。

您好,您的意思是只使用带有goldG的配置文件训练自定义数据集就可以了吗,比如yolo_world_v2_l_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_1280ft_lvis_minival.py 不需要其他数据集比如flik,QGA这些数据集混合训练?