Closed Hambaobao closed 10 months ago
Hi there, as introduced in our paper, our benchmark consists of 420 open-ended questions. To conduct an evaluation, you'll need to use GPT-4V, feeding it with an image, a question and a pair of answers that you intend to compare.
Hello, thanks for your great work. May I ask where can I find answers to these questions to evaluate my model? 🤥