Request for Complete Test Script for Qwen2-Audio on AIR Bench

whwu95 commented 3 months ago

Hi,

I'm currently trying to replicate the performance of Qwen2-Audio on the AIR Bench. However, I noticed that the repository at AIR-Bench doesn't provide the complete test script. It only includes the inference script and the GPT-4 evaluation generation script.

Could you please clarify how the scores for the Speech, Sound, Music, and Mixed Audio metrics are obtained? It would be very helpful if you could provide the complete test script for these metrics.

Thank you for your assistance!

qyang1021 commented 3 months ago

First, you need to download the evaluation dataset that I made public in issue; Second, you need to get the output of the model to be evaluated based on the input in the dataset (here I recommend you run inference in the official repo of Qwen2Audio); Finally, after getting the response, you can refer to the code I gave to get the GPT evaluation score.

whwu95 commented 3 months ago

Thank you for your reply. Actually, I was able to get the GPT score after modifying some request codes. However, after completing step2 in the score_chat.py script, there doesn't seem to be a third step provided to generate the final scores for the Speech, Sound, Music, and Mixed Audio metrics.

Could you please share the complete script used to obtain these metrics?

whwu95 commented 3 months ago

I tried writing the metrics code myself and used Assistant 2's answer as the score. Below are my results:

-------------------- Evaluation --------------------
speech_dialogue_QA 385 6.916883116883117
speech_QA 387 7.529715762273902
sound_QA 388 6.688144329896907
sound_generation_QA 99 7.151515151515151
music_QA 391 6.57544757033248
music_generation_analysis_QA 99 7.121212121212121
speech_and_sound_QA 193 6.4974093264248705
speech_and_music_QA 196 5.428571428571429
total sample: 2138

Following the Table2 of the paper, I attempted to merge the 8 categories into 4 categories as follows.

merged_data = {
'speech': data['speech_dialogue_QA'] + data['speech_QA'], 
'sound': data['sound_QA']
'music': data['music_QA']
'Mixed Audio': data['speech_and_sound_QA'] + data['speech_and_music_QA'] 
}

However, I found that the results do not align with those in the paper, mainly due to significant differences in the Mixed Audio category(5.95 vs 6.77). The differences in other categories are more acceptable.

-------------------- Merged Evaluation --------------------
Speech 772 7.224093264248705
Sound 388 6.688144329896907
Music 391 6.57544757033248
Mixed Audio 389 5.958868894601542
1940 Sample Average: 6.732474226804124

-------------------- Official Qwen2-audio Results --------------------
Speech 800 7.18
Sound 400 6.99
Music 400 6.79
Mixed Audio 400 6.77
Average: 2000 6.93

Could you help me check if there's an issue with the way I merged the categories? Or do you have any suggestions?

qyang1021 commented 3 months ago

As you did, the final score is the average of each dataset. Thanks for your tip, I have added a simple summary code. The difference in the Qwen2-Audio score you tested (mainly Mixed-Audio) comes from the performance degradation caused by the model conversion when we convert huggingface. You can see in the official GitHub of Qwen2Audio that there is a table next to the table in the paper showing the score after converting HF. Of course, this is still different from your score. You forgot to swap the positions of Assistant 1 and 2 (Note in this GitHub, for fairness), and taking the average again is the final result.

whwu95 commented 3 months ago

Thank you for your response. You mentioned, "You forgot to swap the positions of Assistant 1 and 2 (Note in this GitHub, for fairness), and taking the average again is the final result." Are you referring to the GPT judge phase or the calculation of the final score?

speech: Sum=772, Win_Rate=0.20854922279792745, gpt4_avg_score=8.2279792746114, llm_avg_score=7.224093264248705 sound: Sum=487, Win_Rate=0.19096509240246407, gpt4_avg_score=8.17659137577002, llm_avg_score=6.782340862422998 music: Sum=490, Win_Rate=0.22653061224489796, gpt4_avg_score=8.104081632653061, llm_avg_score=6.685714285714286 speech_and_sound: Sum=193, Win_Rate=0.20207253886010362, gpt4_avg_score=8.44041450777202, llm_avg_score=6.4974093264248705 speech_and_music: Sum=196, Win_Rate=0.1377551020408163, gpt4_avg_score=8.566326530612244, llm_avg_score=5.428571428571429

I used your cal_score.py script, and the results matched my calculations, so I believe the order should be correct. However, in your script, "sound" includes 'sound_generation_QA', and "music" includes 'music_generation_analysis_QA', which seems inconsistent with the paper.

Additionally, does "Mixed Audio" refer to the combination of speech_and_music_QA and speech_and_sound_QA, or just speech_and_sound_QA? The latter seems closer to the "Mixed Audio" results you provided.

qyang1021 commented 3 months ago

In the GPT judge phase, the two model- responses should be swapped to eliminate the position bias of GPT. In the 24 lines of my script, sound includes 'sound_QA' and 'sound_generation_QA'; Mixed-Audio is the average of speech_and_music_QA and speech_and_sound_QA

qyang1021 commented 3 months ago

In the GPT judge phase, the two model- responses should be swapped to eliminate the position bias of GPT. In the 24 lines of my script, sound includes 'sound_QA' and 'sound_generation_QA'; Mixed-Audio is the average of speech_and_music_QA and speech_and_sound_QA

------------------ 原始邮件 ------------------ 发件人: "OFA-Sys/AIR-Bench" @.>; 发送时间: 2024年8月16日(星期五) 下午4:40 @.>; @.**@.>; 主题: Re: [OFA-Sys/AIR-Bench] Request for Complete Test Script for Qwen2-Audio on AIR Bench (Issue #3)

Thank you for your response. You mentioned, "You forgot to swap the positions of Assistant 1 and 2 (Note in this GitHub, for fairness), and taking the average again is the final result." Are you referring to the GPT judge phase or the calculation of the final score?

speech: Sum=772, Win_Rate=0.20854922279792745, gpt4_avg_score=8.2279792746114, llm_avg_score=7.224093264248705 sound: Sum=487, Win_Rate=0.19096509240246407, gpt4_avg_score=8.17659137577002, llm_avg_score=6.782340862422998 music: Sum=490, Win_Rate=0.22653061224489796, gpt4_avg_score=8.104081632653061, llm_avg_score=6.685714285714286 speech_and_sound: Sum=193, Win_Rate=0.20207253886010362, gpt4_avg_score=8.44041450777202, llm_avg_score=6.4974093264248705 speech_and_music: Sum=196, Win_Rate=0.1377551020408163, gpt4_avg_score=8.566326530612244, llm_avg_score=5.428571428571429

I used your cal_score.py script, and the results matched my calculations, so I believe the order should be correct. However, in your script, "sound" includes 'sound_generation_QA', and "music" includes 'music_generation_analysis_QA', which seems inconsistent with the paper.

Additionally, does "Mixed Audio" refer to the combination of speech_and_music_QA and speech_and_sound_QA, or just speech_and_sound_QA? The latter seems closer to the "Mixed Audio" results you provided.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

OFA-Sys / AIR-Bench

Request for Complete Test Script for Qwen2-Audio on AIR Bench #3