Question about onze-shot inference with image prompts

SimonWulp commented 5 months ago

Hello, thanks for your amazing work on the model. I am currently trying to use a custom image to prompt the model in a zero-shot manner, but I unfortunately fail to successfully do so. Hopefully you can help me.

I obtain my image embedding using generate_image_prompts.py, but I fail to understand how and where to use this embeddings file.

My config file looks like this:

_base_ = ('../../third_party/mmyolo/configs/yolov8/'
          'yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py')
custom_imports = dict(imports=['yolo_world'], allow_failed_imports=False)

# hyper-parameters
num_classes = 80
num_training_classes = 80
text_channels = 512
neck_embed_channels = [128, 256, _base_.last_stage_out_channels // 2]
neck_num_heads = [4, 8, _base_.last_stage_out_channels // 2 // 32]

# model settings
model = dict(type='SimpleYOLOWorldDetector',
             mm_neck=True,
             num_train_classes=num_training_classes,
             num_test_classes=num_classes,
             embedding_path='/yw_tests/work/YOLO-World/data/embeddings/clip_vit_b32_beagle_embeddings.npy',
             prompt_dim=text_channels,
             num_prompts=1,
             data_preprocessor=dict(type='YOLOv5DetDataPreprocessor'),
             backbone=dict(_delete_=True,
                           type='MultiModalYOLOBackbone',
                           text_model=None,
                           image_model={{_base_.model.backbone}},
                           frozen_stages=4,
                           with_text_model=False),
             neck=dict(type='YOLOWorldPAFPN',
                       freeze_all=True,
                       guide_channels=text_channels,
                       embed_channels=neck_embed_channels,
                       num_heads=neck_num_heads,
                       block_cfg=dict(type='MaxSigmoidCSPLayerWithTwoConv')),
             bbox_head=dict(type='YOLOWorldHead',
                            head_module=dict(
                                type='YOLOWorldHeadModule',
                                freeze_all=True,
                                use_bn_head=True,
                                embed_dims=text_channels,
                                num_classes=num_training_classes)),
             train_cfg=dict(assigner=dict(num_classes=num_training_classes)))

coco_val_dataset = dict(type='YOLOv5CocoDataset',
                        data_root='data/coco',
                        ann_file='annotations/instances_val2017.json',
                        data_prefix=dict(img='val2017/'),
                        filter_cfg=dict(filter_empty_gt=False, min_size=32),
                        pipeline=_base_.test_pipeline)
test_dataloader = dict(dataset=coco_val_dataset)

Where clip_vit_b32_beagle_embeddings.npy is the embeddings file obtained from generate_image_prompts.py. When I test this on a new image, the model fails to make accurate predictions.

Do I need to use more images when generating the embeddings in generate_image_prompts.py? Or is there some other step I'm missing such as what weights to use, or other settings that should be configured?

Thanks in advance!

wondervictor commented 5 months ago

Hi @SimonWulp, sorry for the late reply and thanks for your interest in YOLO-World! Firstly, you need to change the detector to SimpleYOLOWorldDetector and you can refer to the config: prompt_tuning_coco/yolo_world_v2_l_vlpan_bn_2e-4_80e_8gpus_prompt_tuning_coco.py. Secondly, due to the weak alignment between the CLIP image and vision encoder, directly using a vision encoder to encode image prompts now tends to offer inferior results compared to text prompts. We are working on it. However, you can try to increase the image resolution, each image only contains one object. Using multiple prompts for one class is better than one image prompt per class.

vincentlux commented 3 months ago

hi @wondervictor , thanks a lot for the work! This is amazing. I wonder if there is any plan to improve the performance of image prompt in the near future? I tried to use image prompt directly but the result is not as good as using text as guidance. Do you think there is room for improvement? Thanks!

AILab-CVC / YOLO-World

Question about onze-shot inference with image prompts #374