Closed BlueBlueFF closed 1 year ago
No, for the conversational question-answer data, we only feed GPT-4V with the image (as well as a carefully designed prompt); while for the high-quality image descriptions, we input both the image and its box annotations in LVIS (prompt also) to GPT-4V. Please refer to the Appendix of our arXiv paper (https://arxiv.org/abs/2311.07574) for more details.
LVIS dataset only have category and bbox info,do you use caption(from coco) in generate instrutions?