AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection
https://www.yoloworld.cc
GNU General Public License v3.0
4.06k stars 391 forks source link

How to detect any object using caption inputs ? #334

Open LuletterSoul opened 2 months ago

LuletterSoul commented 2 months ago

Hello, thank you for sharing excellent work. Currently, the model calculates the similarity between text tokens and image features, selecting the top1 as its class. If I input a single sentence as text information (similar to Grounding DINO), will the model still work correctly? If so, how should it be modified?

wondervictor commented 2 months ago

You can find it at: https://github.com/AILab-CVC/YOLO-World/issues/315#issuecomment-2114844157. I'll add the caption input in the demo considering many requests about caption input.