Open taofuyu opened 8 months ago
Hi @taofuyu, you need to freeze all parameters (backbone, head, and neck) except the embeddings. However, I need to double-check whether all layers are frozen.
Ok, I will have a try and update the result.
You can evaluate the 4-category detection and 3-category detection separately and then perform the joint evaluation.
But, parameters of backbone, head, and neck are all frozen, and the only updated parameters 'embeddings' are not saved into hard disk (during inference, the pre-computed embedding file is still used), so it seems there is nothing changed in the model ?
It seems to validate my idea. After running 10 epochs now, the model can only detect 'car', which appears in the pre-trained datasets, other new categories can not be detected (can be detected when not freeze the model)
@taofuyu Do you know the difference between all_fine_tuning and prompt tuning? I'm not clear about the config file of all_fine_tuning
@taofuyu Do you know the difference between all_fine_tuning and prompt tuning? I'm not clear about the config file of all_fine_tuning
you can compare these two files, by VSCode or something. The main difference is the value of freeze_all, True or False
@Hudaodao99 It's my fault, I should start a branch to avoid misleading. The prompt tuning only optimizes the embeddings while the all finetuning
optimizes all parameters without the need of a text encoder.
But, parameters of backbone, head, and neck are all frozen, and the only updated parameters 'embeddings' are not saved into hard disk (during inference, the pre-computed embedding file is still used), so it seems there is nothing changed in the model ?
@wondervictor
@Hudaodao99 It's my fault, I should start a branch to avoid misleading. The prompt tuning only optimizes the embeddings while the
all finetuning
optimizes all parameters without the need of a text encoder.
Thanks for your answer!
But, parameters of backbone, head, and neck are all frozen, and the only updated parameters 'embeddings' are not saved into hard disk (during inference, the pre-computed embedding file is still used), so it seems there is nothing changed in the model ?
@taofuyu I'll check it.
@taofuyu I met the same problem. But in prompt tuning on my custom dataset(10 class), I find if I write the number of prompt text less than 10, it has error. like this: (I just write 2 prompt text, which all are not in my dataset, but class is more than 2)
class= [1 2 4 4 3 6]
confidence= [0.97107273 0.90503085 0.8864812 0.86314565 0.32898653 0.20567985]
Traceback (most recent call last):
File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 198, in
Have you met the same question?
@taofuyu I met the same problem. But in prompt tuning on my custom dataset(10 class), I find if I write the number of prompt text less than 10, it has error. like this: (I just write 2 prompt text, which all are not in my dataset, but class is more than 2)
class= [1 2 4 4 3 6] confidence= [0.97107273 0.90503085 0.8864812 0.86314565 0.32898653 0.20567985] Traceback (most recent call last): File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 198, in inference_detector(runner, File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 108, in inference_detector labels = [ File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 109, in f"{texts[class_id][0]} {confidence:0.2f}" for class_id, confidence in IndexError: list index out of range
Have you met the same question?
detections的结果还是config里embeddings\num_classes的设置的样子,而texts是你命令行里直接输入的,数量不一样就导致维度不匹配了。 正确的做法应该是你测试的时候,需要哪几个类别,就生成哪几类的新的embeddings 并修改对应的num_classes,并与命令行的texts保持一致。
detections的结果还是config里embeddings\num_classes的设置的样子,而texts是你命令行里直接输入的,数量不一样就导致维度不匹配了。 正确的做法应该是你测试的时候,需要哪几个类别,就生成哪几类的新的embeddings 并修改对应的num_classes,并与命令行的texts保持一致。
Thanks!
I attempt to find a way out this issue thus going to learn more about OVD algorithms. In MM-grounding-DINO, it mentions that close-set fine-tuing will lose OVD generality. Maybe this is the reason why my model can not detect these 3 new classes. I'm not sure. You can take this as a reference. @wondervictor
Furthermore, it mentions that, mix COCO data with some of the pre-trained data
will improve performance on the COCO dataset as much as possible without compromising generalization
.
My experiments demonstrate it is right. I mix flicker30k/QGA with my custom data to train YOLOWorldDetector, the model can detect my categories and remain OVD-ability.
But, if so, it means YOLOWorldPromptDetecor, can only be fine-tuned as a close-set detector, cause grounding data can not be used during training YOLOWorldPromptDetecor.
We did not expect it, the original intention of prompt tuning is to retain the zero-shot capability and generalization and to achieve stronger performance on custom datasets.
Hi @taofuyu, it seems that the configs in configs/prompt_tuning_coco
wrongly use base_lr=2e-3
. It's a mistake I've made. For fine-tuning all modules, the base_lr
should be set to 2e-4
. As for training prompts only, I'm going to check again.
Hi @taofuyu, it seems that the configs in
configs/prompt_tuning_coco
wrongly usebase_lr=2e-3
. It's a mistake I've made. For fine-tuning all modules, thebase_lr
should be set to2e-4
. As for training prompts only, I'm going to check again.
thanks, I already changed the lr to 2e-4 during my fine-tuning.
@Hudaodao99 It's my fault, I should start a branch to avoid misleading. The prompt tuning only optimizes the embeddings while the
all finetuning
optimizes all parameters without the need of a text encoder.
@wondervictor Hi! I'm not quite sure what the difference is between the purpose of all-tuning
and prompt-tuning
? Can all-tuning
achieve open-vocabulary detection and custom detection together, like prompt-tuning
? Also, through the prompt-tuning
, can we generate and export our own custom npy file
?
@taofuyu 你好,请问你把学习率调整为2e-4后微调效果如何呢?能否解决微调后失去开集检测能力的问题呢?
我也有同样的 问题,我在本地微调自己的数据集之后,自己的数据集20个类,每个类有不同的text prompt,我想在微调自己数据集之后,保留原始预训练权重的clip的zeroshot能力。但是似乎结果不是这样的。比如常用的peroson、people、human都可以检测,但是自己的 数据集中,不同文本就检测不了。
@mio410 No @xiyangyang99 same question @wondervictor Hello, any updates on this question ?
Hi @taofuyu, @xiyangyang99, @Hudaodao99, and @mio410, sorry for the delay. I'll check it and provide solutions asap. Please stay tuned and please let me know if you have any updates.
Can separate inference solve the problem. It comes to me that some interference between each prompts may cause the problem.@taofuyu
Can separate inference solve the problem. It comes to me that some interference between each prompts may cause the problem.@taofuyu
sorry, could you please explain this in detail ?
One text prompt may interfere the inference process of the other, you can refer to the text-guided CSPlayer in the paper. I would also like use the prompt tuning technic, hope to solve this issue. like mentioned in : https://github.com/AILab-CVC/YOLO-World/issues/154#issuecomment-2006452067 if separate inference and evaluation is correct, it may overpass the problem.
@taofuyu any update?in case you don't notice the answer above.
@Yindong-Zhang, ongoing
I think, just tuning custom data with GoldG is fine. Model can detect custom categories and retain OVD ability at the same time.
Adding VG(or GoldG) for fine-tuning does maintain the zero-shot performance. I'm now seeking more efficient ways such as regularization for efficient fine-tuning.
Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets?
Thank you!
Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets?
Thank you!
tuning custom data with GoldG
Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets? Thank you!
tuning custom data with GoldG
does mean the grounding dataset is key to build open-vocabulary/zero-shot ability.
Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets? Thank you!
tuning custom data with GoldG
does mean the grounding dataset is key to build open-vocabulary/zero-shot ability.
Yes, I think so
Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets? Thank you!
tuning custom data with GoldG
你好,我使用COCO+GQA来进行微调,但是遇到一个问题,无论如何设置参数,训练了几个epoch之后,grad_norm开始变得很大,loss也变得很大,随后就一直为0,想请教下这是什么原因?
我使用的配置文件如下:
base = ('../../third_party/mmyolo/configs/yolov8/' 'yolov8_l_syncbn_fast_8xb16-500e_coco.py') custom_imports = dict(imports=['yolo_world'], allow_failed_imports=False)
num_classes = 80 num_training_classes = 80 max_epochs = 30 # Maximum training epochs close_mosaic_epochs = 30 save_epoch_intervals = 2 text_channels = 512 neck_embed_channels = [128, 256, base.last_stage_out_channels // 2] neck_num_heads = [4, 8, base.last_stage_out_channels // 2 // 32] base_lr = 1e-4 weight_decay = 0.05 train_batch_size_per_gpu = 8
load_from = '/mnt/sdc/lishen/yolo-world-model/yolo_world_v2_l_obj365v1_goldg_cc3mlite_pretrain-ca93cd1f.pth' text_model_name = 'openai/clip-vit-base-patch32'
model = dict( type='YOLOWorldDetector', mm_neck=True, num_train_classes=num_training_classes, num_test_classes=num_classes, data_preprocessor=dict(type='YOLOWDetDataPreprocessor'), backbone=dict( delete=True, type='MultiModalYOLOBackbone', image_model={{base.model.backbone}}, text_model=dict( type='HuggingCLIPLanguageBackbone', model_name=text_model_name, frozen_modules=['all'])), neck=dict(type='YOLOWorldPAFPN', guide_channels=text_channels, embed_channels=neck_embed_channels, num_heads=neck_num_heads, block_cfg=dict(type='MaxSigmoidCSPLayerWithTwoConv')), bbox_head=dict(type='YOLOWorldHead', head_module=dict(type='YOLOWorldHeadModule', use_bn_head=True, embed_dims=text_channels, num_classes=num_training_classes)), train_cfg=dict(assigner=dict(num_classes=num_training_classes)))
text_transform = [ dict(type='RandomLoadText', num_neg_samples=(num_classes, num_classes), max_num_samples=num_training_classes, padding_to_max=True, padding_value=''), dict(type='mmdet.PackDetInputs', meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip', 'flip_direction', 'texts')) ]
train_pipeline = [ base.pre_transform, dict(type='MultiModalMosaic', img_scale=base.img_scale, pad_val=114.0, pre_transform=base.pre_transform), dict( type='YOLOv5RandomAffine', max_rotate_degree=0.0, max_shear_degree=0.0, scaling_ratio_range=(1 - base.affine_scale, 1 + base.affine_scale), max_aspect_ratio=base.max_aspect_ratio, border=(-base.img_scale[0] // 2, -base.img_scale[1] // 2), border_val=(114, 114, 114)), base.last_transform[:-1], text_transform, ] train_pipeline_stage2 = [base.train_pipeline_stage2[:-1], *text_transform]
mg_train_dataset = dict(type='YOLOv5MixedGroundingDataset', data_root='/mnt/sdc/lishen/Dataset/GQA', ann_file='annotations/final_mixed_train_no_coco.json', data_prefix=dict(img='images/'), filter_cfg=dict(filter_empty_gt=False, min_size=32), pipeline=train_pipeline)
coco_train_dataset = dict( type='MultiModalDataset', dataset=dict( type='YOLOv5CocoDataset', data_root='/mnt/sdc/Datasets/public/COCO', ann_file='annotations/instances_train2017.json', data_prefix=dict(img='train2017/'), filter_cfg=dict(filter_empty_gt=False, min_size=32)), class_text_path='data/texts/coco_class_texts.json', pipeline=train_pipeline)
train_dataloader = dict(batch_size=train_batch_size_per_gpu, collate_fn=dict(type='yolow_collate'), dataset=dict(delete=True, type='ConcatDataset', datasets=[ mg_train_dataset, coco_train_dataset ], ignore_keys=['classes', 'palette']))
test_pipeline = [ *base.test_pipeline[:-1], dict(type='LoadText'), dict(type='mmdet.PackDetInputs', meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'scale_factor', 'pad_param', 'texts')) ] coco_val_dataset = dict( delete=True, type='MultiModalDataset', dataset=dict( type='YOLOv5CocoDataset', data_root='/mnt/sdc/Datasets/public/COCO', test_mode=True, ann_file='annotations/instances_val2017.json', data_prefix=dict(img='val2017/'), batch_shapes_cfg=None), class_text_path='data/texts/coco_class_texts.json', pipeline=test_pipeline)
val_dataloader = dict(dataset=coco_val_dataset) test_dataloader = val_dataloader
val_evaluator = dict(delete=True, type='mmdet.CocoMetric', proposal_nums=(100, 1, 10), ann_file='/mnt/sdc/Datasets/public/COCO/annotations/instances_val2017.json', metric='bbox') test_evaluator = val_evaluator
default_hooks = dict( param_scheduler=dict( scheduler_type='linear', lr_factor=0.01, max_epochs=max_epochs), checkpoint=dict( max_keep_ckpts=-1, save_best=None, interval=save_epoch_intervals)) custom_hooks = [ dict( type='EMAHook', ema_type='ExpMomentumEMA', momentum=0.0001, update_buffers=True, strict_load=False, priority=49), dict( type='mmdet.PipelineSwitchHook', switch_epoch=max_epochs - close_mosaic_epochs, switch_pipeline=train_pipeline_stage2) ] train_cfg = dict( max_epochs=max_epochs, val_interval=2, dynamic_intervals=[((max_epochs - close_mosaic_epochs), base.val_interval_stage2)])
optim_wrapper = dict(optimizer=dict( delete=True, type='SGD', lr=base_lr, momentum=0.937, nesterov=True, weight_decay=weight_decay, batch_size_per_gpu=train_batch_size_per_gpu), paramwise_cfg=dict( custom_keys={ 'backbone.text_model': dict(lr_mult=0.01), 'logit_scale': dict(weight_decay=0.0) }), constructor='YOLOWv5OptimizerConstructor')
@taofuyu
I think, just tuning custom data with GoldG is fine. Model can detect custom categories and retain OVD ability at the same time.我认为,只需使用 GoldG 调整自定义数据就可以了。模型可以检测自定义类别,同时保留 OVD 能力。
您好,您的意思是只使用带有goldG的配置文件训练自定义数据集就可以了吗,比如yolo_world_v2_l_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_1280ft_lvis_minival.py 不需要其他数据集比如flik,QGA这些数据集混合训练?
@taofuyu
I think, just tuning custom data with GoldG is fine. Model can detect custom categories and retain OVD ability at the same time.我认为,只需使用 GoldG 调整自定义数据就可以了。模型可以检测自定义类别,同时保留 OVD 能力。
您好,您的意思是只使用带有goldG的配置文件训练自定义数据集就可以了吗,比如yolo_world_v2_l_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_1280ft_lvis_minival.py 不需要其他数据集比如flik,QGA这些数据集混合训练?
goldG就是flickr那几个grounding数据集的总称
Hi all, I wonder if there is any update? How may I fine-tune to retain the zero-shot capability and generalization and still to achieve a stronger performance on my custom datasets? Thank you!
tuning custom data with GoldG
你好,我使用COCO+GQA来进行微调,但是遇到一个问题,无论如何设置参数,训练了几个epoch之后,grad_norm开始变得很大,loss也变得很大,随后就一直为0,想请教下这是什么原因?
config看着没什么问题,具体原因不太清楚了。。
但是,backbone、head 和 neck 的参数都被冻结了,唯一更新的参数 'embeddings' 没有保存到硬盘中(在推理过程中,仍然使用预先计算的嵌入文件),所以模型中似乎没有任何变化?
我会检查一下。
你好,这里的embeddings指的是text embeddings嘛?他是如何更新的呢?通过I-Pooling Attention?
@taofuyu Hello, I also want to add the gold dataset to keep zero-shot, but don't know how to set it up. Would you like to show me your relevant profile? If you can, I hope you send it to this email: mr.pengc@foxmail.com. Thank you very much for sharing
Furthermore, it mentions that,
mix COCO data with some of the pre-trained data
willimprove performance on the COCO dataset as much as possible without compromising generalization
.此外,它提到,mix COCO data with some of the pre-trained data
将improve performance on the COCO dataset as much as possible without compromising generalization
. My experiments demonstrate it is right. I mix flicker30k/QGA with my custom data to train YOLOWorldDetector, the model can detect my categories and remain OVD-ability.我的实验证明这是正确的。我将 flicker30k/QGA 与我的自定义数据混合以训练 YOLOWorldDetector,该模型可以检测我的类别并保持 OVD 能力。 But, if so, it means YOLOWorldPromptDetecor, can only be fine-tuned as a close-set detector, cause grounding data can not be used during training YOLOWorldPromptDetecor.但是,如果是这样,则意味着 YOLOWorldPromptDetecor 只能作为紧密设置的检测器进行微调,因为在训练 YOLOWorldPromptDetecor 期间无法使用接地数据。
I rush into the same question like before, #71 , #78 . I modify the config in configs/prompt_tuning_coco/, generate custom embedding file, to fine-tune my dataset which has 4 categories. When inference, I generate a new embedding file which has 7 categories(4 old classes seen in training and 3 new classes) and replace the old embedding file in the config. These 3 new classes CAN NOT be detected, even setting score threshold to 0.01 It seems like losing open-vocabulary/zero-shot ability.