Open yinanyz opened 1 year ago
another question related to object detection - what if there're multiple objects detected in the image? e.g. "a horse to the right of a person", how do you handle the case where there're multiple horses in the image? as I saw that in the code you're doing
obj1_pos = obj.index(obj1) # 物体1的位置
so I'm wondering what if there're multiple obj1 ?
Thanks for the question! For Q1: We try multi-modal models, such as miniGPT4, mPlug-Owl, MultiModal-GPT, InternChat, BLIP, may not perform well in spatial understanding. Therefore, a more accurate and intuitive approach like object detection is selected. UniDet is suitable for the current task because of its strong performance on standard object detection benchmarks, like COCO, makes it a suitable choice for tasks that require accurately detecting a wide range of objects. Other object detection methods might also be able to accomplish this task. For Q2: If there are multiple objects detected in an image, we determine the main object and prioritize one object over others: first detect objects in the image, and then make probability ranking, select the object with the highest probability as the main object. For the main object we've chosen, compare its spatial position with other objects in the image.
I'm curious about the choice of detection model (i.e. UniDet); how did you choose it and by chance, have you tried other detection models and compared with it? Thanks!