Thank you for your code. Could you please provide the code for evaluating using GPT-4 on the MT-Bench and Vicuna datasets?

Chaos96 / fourierft

131 stars 8 forks source link

Thank you for your code. Could you please provide the code for evaluating using GPT-4 on the MT-Bench and Vicuna datasets? #12

Open LiZhangMing opened 5 months ago

LiZhangMing commented 5 months ago

duzhekai commented 2 months ago

Hi, I've encountered this issue as well. I think the authors might use this benchmark code: https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge, but after my testing (using the hyperparameters from the author's paper for fine-tuning), the results are significantly different. The score of LLaMA1-7b on MT-Bench is 3.6. I'm not sure where the problem lies.