Karine-Huang / T2I-CompBench

[Neurips 2023] T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
https://arxiv.org/pdf/2307.06350.pdf
MIT License
214 stars 7 forks source link

About the GLIP install #3

Open superhero-7 opened 1 year ago

superhero-7 commented 1 year ago

I wonder the which pytorch version should be used. I install python 1.12.1 + cu113,but it fail in GLIP install. Error is: fatal error: THC/THC.h: No such file or directory; Any suggetions? Thanks in advance.

Karine-Huang commented 1 year ago

Thanks for the feedback! I have updated requirements.txt, and delete the GLIP install (-e git+https://github.com/microsoft/GLIP.git@24ec0ddd8c61534ad5b17e4144864df7003dc7ef#egg=maskrcnn_benchmark), try to " pip install -r requirements.txt ". By the way, torch version I use is 2.0.1. Please let me know if there is any problem.

superhero-7 commented 1 year ago

I see. I noticed you did not use GLIP, so I give up to install GLIP finally. By the way, pytorch 1.12.1 is also support the BLIP and CLIP. So far, I can use your code normally. Thanks for you nice job!

superhero-7 commented 1 year ago

By the way, why not use Grounding DINO as a detector?

Karine-Huang commented 1 year ago

Karine-Huang/T2I-CompBench#4 We try multi-modal models, such as miniGPT4, mPlug-Owl, MultiModal-GPT, InternChat, BLIP, may not perform well in spatial understanding. Therefore, a more accurate and intuitive approach like object detection is selected. UniDet is more suitable for the current task because of its strong performance on standard object detection benchmarks, like COCO and PASCAL VOC, makes it a suitable choice for tasks that require accurately detecting a wide range of objects. Other object detection might also be able to accomplish this task.

superhero-7 commented 1 year ago

Thank you! But I'm curious about recent developments, such as whether GroundingDINO would yield better results?

Karine-Huang commented 12 months ago

Thanks for the question! The strengths of recent models often lie in their ability to ground visual elements effectively. However, when it comes to challenges in spatial relationships, the key factor often involves fundamental spatial relationship understanding (such as distinguishing the basic position, left, right, etc.). While a powerful detector can contribute to overall results, it's important to note that the primary challenge in spatial relationships still lies in the model's understanding rather than its detection ability. Of course, with more delicate and complex spatial relationships, powerful detectors have the potential to enhance overall performance in visual tasks.