X2FD / LVIS-INSTRUCT4V

MIT License
131 stars 0 forks source link

Nice Work! Question about caption #4

Closed BlueBlueFF closed 1 year ago

BlueBlueFF commented 1 year ago

LVIS dataset only have category and bbox info,do you use caption(from coco) in generate instrutions?

wdrink commented 1 year ago

No, for the conversational question-answer data, we only feed GPT-4V with the image (as well as a carefully designed prompt); while for the high-quality image descriptions, we input both the image and its box annotations in LVIS (prompt also) to GPT-4V. Please refer to the Appendix of our arXiv paper (https://arxiv.org/abs/2311.07574) for more details.