Open SimonWulp opened 5 months ago
Hi @SimonWulp, sorry for the late reply and thanks for your interest in YOLO-World!
Firstly, you need to change the detector to SimpleYOLOWorldDetector
and you can refer to the config: prompt_tuning_coco/yolo_world_v2_l_vlpan_bn_2e-4_80e_8gpus_prompt_tuning_coco.py.
Secondly, due to the weak alignment between the CLIP image and vision encoder, directly using a vision encoder to encode image prompts now tends to offer inferior results compared to text prompts. We are working on it.
However, you can try to increase the image resolution, each image only contains one object. Using multiple prompts for one class is better than one image prompt per class.
hi @wondervictor , thanks a lot for the work! This is amazing. I wonder if there is any plan to improve the performance of image prompt in the near future? I tried to use image prompt directly but the result is not as good as using text as guidance. Do you think there is room for improvement? Thanks!
Hello, thanks for your amazing work on the model. I am currently trying to use a custom image to prompt the model in a zero-shot manner, but I unfortunately fail to successfully do so. Hopefully you can help me.
I obtain my image embedding using
generate_image_prompts.py
, but I fail to understand how and where to use this embeddings file.My config file looks like this:
Where
clip_vit_b32_beagle_embeddings.npy
is the embeddings file obtained fromgenerate_image_prompts.py
. When I test this on a new image, the model fails to make accurate predictions.Do I need to use more images when generating the embeddings in
generate_image_prompts.py
? Or is there some other step I'm missing such as what weights to use, or other settings that should be configured?Thanks in advance!