Open LiZhangMing opened 5 months ago
Hi, I've encountered this issue as well. I think the authors might use this benchmark code: https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge, but after my testing (using the hyperparameters from the author's paper for fine-tuning), the results are significantly different. The score of LLaMA1-7b on MT-Bench is 3.6. I'm not sure where the problem lies.