Open ashmalvayani opened 8 months ago
Please note that when I changed the model to google/gemma-2b then the accuracy was computing. I think it's the problem with the few open-source models. Can you point to any corrections in the codebase that can handle it?
Thank you for reaching out and for your efforts in reproducing the results.
You can indeed use the lm-evaluation-harness
script to evaluate the dataset, but please note that the results may differ from ours. Our custom script is closer to "HELM".
Regarding the evaluation with Jais-13b:
For the errors encountered with models like Phi-2 and MobiLlama-1B:
We appreciate your understanding and patience as we work through these challenges.
Thank you for your response.
Is there a reason for such a difference in accuracy? lm-evaluation-harness is often considered as a benchmark and their ammlu task has quoted AceGPT's ammlu dataset as a contributor. I thought the results would be identical on them.
The numbers you quoted in the paper are implemented using the same evaluation script that's given in the eval folder? Do you have the inference scripts for them? It'll be great if you can provide that.
The errors are the same as I shared the screenshot of, it's mainly around "if 'eos_token_id' is defined then make that that 'pad_token_id' is also defined" Looks like that's the issue with the generate function. People solved it by passing (pad_token_id = tokenizer.eos_token_id) in the generate function. But since you're using accelerator.unwrap(self.model).generate(), and the tokenizer isn't passed in the argument, I'm not sure if this should go here. Can you please take a quick look at it and fix it for models like Phi-2, MobiLlama etc?
Thanks for the great work. I'm trying to reproduce the results and facing following errors:
Can I use lm-evaluation-harness script instead of yours to evaluate the results? When I used lm-harness ammlu dataset, I got 34.1 accuracy as compared to yours 37. What could be the difference?
How to use this script for another model's evaluation? i. When I changed the model to jais-13b it gave 0% accuracy on Ammlu (all the responses are empty string). ii. On any other model such as Phi-2, MobiLlama-1B, I am getting the following error:
below are the changes I made to config.yaml:
and in ArabicMMLU_few_shots.sh, I changed the model id to Phi-2B-base. Can you please tell me the solution of this?