Karine-Huang / T2I-CompBench

[Neurips 2023] T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
https://arxiv.org/pdf/2307.06350.pdf
MIT License
189 stars 6 forks source link

Other potential detection models? #4

Open yinanyz opened 1 year ago

yinanyz commented 1 year ago

I'm curious about the choice of detection model (i.e. UniDet); how did you choose it and by chance, have you tried other detection models and compared with it? Thanks!

yinanyz commented 1 year ago

another question related to object detection - what if there're multiple objects detected in the image? e.g. "a horse to the right of a person", how do you handle the case where there're multiple horses in the image? as I saw that in the code you're doing obj1_pos = obj.index(obj1) # 物体1的位置 so I'm wondering what if there're multiple obj1 ?

Karine-Huang commented 1 year ago

Thanks for the question! For Q1: We try multi-modal models, such as miniGPT4, mPlug-Owl, MultiModal-GPT, InternChat, BLIP, may not perform well in spatial understanding. Therefore, a more accurate and intuitive approach like object detection is selected. UniDet is suitable for the current task because of its strong performance on standard object detection benchmarks, like COCO, makes it a suitable choice for tasks that require accurately detecting a wide range of objects. Other object detection methods might also be able to accomplish this task. For Q2: If there are multiple objects detected in an image, we determine the main object and prioritize one object over others: first detect objects in the image, and then make probability ranking, select the object with the highest probability as the main object. For the main object we've chosen, compare its spatial position with other objects in the image.