Evaluation reproducing issues

ashmalvayani commented 8 months ago

Thanks for the great work. I'm trying to reproduce the results and facing following errors:

Can I use lm-evaluation-harness script instead of yours to evaluate the results? When I used lm-harness ammlu dataset, I got 34.1 accuracy as compared to yours 37. What could be the difference?
How to use this script for another model's evaluation? i. When I changed the model to jais-13b it gave 0% accuracy on Ammlu (all the responses are empty string). ii. On any other model such as Phi-2, MobiLlama-1B, I am getting the following error:

below are the changes I made to config.yaml:

and in ArabicMMLU_few_shots.sh, I changed the model id to Phi-2B-base. Can you please tell me the solution of this?

ashmalvayani commented 8 months ago

Please note that when I changed the model to google/gemma-2b then the accuracy was computing. I think it's the problem with the few open-source models. Can you point to any corrections in the codebase that can handle it?

hhwer commented 8 months ago

Thank you for reaching out and for your efforts in reproducing the results.

You can indeed use the lm-evaluation-harness script to evaluate the dataset, but please note that the results may differ from ours. Our custom script is closer to "HELM".
Regarding the evaluation with Jais-13b:
- I guess you are using fp16 for Jais. We found that jais should work with fp32. But recently we also meet some problem when we try to reproduce our evaluation result on Jais. Although we use fp32, Jais always reply with something like "I can't answer" rather than to follow our script. We guess this phenomelon maybe caused by the training about safety?
- Due to uncertain reasons, we have not reproduced the results of our previous testing of Jais ourselves.
For the errors encountered with models like Phi-2 and MobiLlama-1B:
- Could you please share the specific error messages you're encountering? Detailed error reports would greatly help us in understanding the problem more precisely. Our resources are currently limited, and it may take some time for machines to become idle before we can attempt to solve this problem

We appreciate your understanding and patience as we work through these challenges.

ashmalvayani commented 8 months ago

Thank you for your response.

Is there a reason for such a difference in accuracy? lm-evaluation-harness is often considered as a benchmark and their ammlu task has quoted AceGPT's ammlu dataset as a contributor. I thought the results would be identical on them.
The numbers you quoted in the paper are implemented using the same evaluation script that's given in the eval folder? Do you have the inference scripts for them? It'll be great if you can provide that.
The errors are the same as I shared the screenshot of, it's mainly around "if 'eos_token_id' is defined then make that that 'pad_token_id' is also defined" Looks like that's the issue with the generate function. People solved it by passing (pad_token_id = tokenizer.eos_token_id) in the generate function. But since you're using accelerator.unwrap(self.model).generate(), and the tokenizer isn't passed in the argument, I'm not sure if this should go here. Can you please take a quick look at it and fix it for models like Phi-2, MobiLlama etc?

hhwer commented 8 months ago

The difference between harness and HELM can be found here.
There are some mistake about the number in the paper, and we will update it in arxiv after NAACL. The result of AceGPT could be reproduced by the script in this repository and summarized with the function. And we are trying to find the script for JAIS.
I thought we had defined pad_token_id here, maybe you can try pass (pad_token_id = tokenizer.eos_token_id) in generate function.

ashmalvayani commented 8 months ago

Yes although it's defined here, but when it goes to the transformer's function code, the error pops up. The generate function it not passed with the tokenizer as argument. Looks like it would require a few changes.

FreedomIntelligence / AceGPT

Evaluation reproducing issues #12