Open UltraEval opened 6 days ago
Hi @UltraEval ,
Which model are trying to reproduce the results on?
Hi @UltraEval ,
Here is the prompt that gave us best performance most of the times:
text_content = f"""{question} Select one option from the provided choices.\n{choices}.
This allows the model to get better performance on our string-matching eval metric. Please let us know if you have any more doubts.
Thanks for your reply!
Hi @UltraEval ,
Which model are trying to reproduce the results on?
with the prompt
text_content = f"""{question} Select one option from the provided choices.\n{choices}.
The performance of Qwen-Audio-Chat, when we teste on Test-mini, is 36.9% lower than the declared 43.1%. Conversely, Qwen2-Audio-Instruct demonstrates a performance that is 54.0% higher than the declared 49.20%😂😂.
here is detail: qwen_chat_mmau.json qwen2_mmau.json
The paper declare best result of model with a prompt. Can release them? cannot reproduce experimental results