How can I use CLIP model to obtain the image and text embeddings?

AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection

https://www.yoloworld.cc

GNU General Public License v3.0

4.45k stars 435 forks source link

How can I use CLIP model to obtain the image and text embeddings? #260

Open zhujiajian98 opened 5 months ago

zhujiajian98 commented 5 months ago

Now I need to try to use the prompt training scheme to finetune the yolo-world. But according to the docs/prompt_yolo_world.md, I need to extract the image embedding/text embedding by using the clip model and save the corresponding npy file. So now how can I use CLIP model to obtains the image and text embeddings?

wondervictor commented 5 months ago

Hi @zhujiajian98, text embeddings can be extracted by generate_text_prompts.py, I'll provide the scripts for image embeddings in a day (before the next Monday).

liping-ren commented 5 months ago

Hello Author, I am interested in using images for fine tuning and would like to ask if you can provide the script that generates the image embedding? Thank you very much.

wondervictor commented 5 months ago

Hi @liping-ren, it has been added to tools/generate_image_prompts.py.