Closed wang-debug closed 4 days ago
Thank you for your interest in our work!
Yes, in the demo we directly download model results. If you want to generate your results for the datasets we worked with in the paper, please check more instructions for each data set here https://huggingface.co/tinyBenchmarks. If you are working with a new dataset, you would have to set your own function and criteria to create y
.
Hi, thank you very much for your response. I have referenced the instructions in your HuggingFace's TinyBenchmark repository. However, during testing, I found that the estimated results differ between the test dataset directly downloaded from the HuggingFace LLM leaderboard(fig2) and the locally tested dataset(fig1). Is this difference worth resolving? I suspect it might be a version issue with the LM-harness.
Hello!!
After open-sourcing our code, we realized that different versions of LM-harness produce different correctness vectors. The vectors are usually similar though. If you check the "About" tab on the leaderboard webpage, they specify the version they used. You can try using that version. Please let me know if that solves!
I followed the specified lm-harness version according to the llm leaderboard, but there still seems to be a gap.
Hey @wang-debug,
happy to hear that you are testing out tinyMMLU. :tada:
I think the issue here is that the old lm-harness shuffles the data points, while the latest version does not (see line 213 in the evaluator.py script).
However, the post-hoc calculations in tinyBenchmarks rely on the correct ordering of the data points in your results vector (which explains the extremely bad estimations you got). To obtain the same results as with the huggingface data, you have to order your results vector in the same way as the data points are ordered in tinyMMLU.
Lmk if there is something unclear about this.
Thank you very much for your suggestion. After removing the randomness, the gap has indeed reduced, but there still exists a gap. Is this acceptable?
Hey again,
It is good to see that you got a reduction in error. :)
We looked into this ourselves and found that evaluations from the eval-harness are not entirely deterministic (even though every intuition says they should). We got different results on different hardware using the same code.
Here is some code where you can check it with scores from the MMLU subscenario anatomy:
from datasets import load_dataset
import json
MODEL_NAME = 'UCLA-AGI/zephyr-7b-sft-full-SPIN-iter0'
hf_path, hf_model_name = MODEL_NAME.split('/')
output_file_harness = 'hendrycksTest-anatomy_write_out_info.json'
subscenario_name = 'harness_hendrycksTest_anatomy_5'
# Load the leaderboard results and harness scores
lb_results = load_dataset(f'open-llm-leaderboard/details_{hf_path}__{hf_model_name}', subscenario_name)['latest']
lb_acc_norms = [item['acc_norm'] for item in lb_results['metrics']]
with open(output_file_harness, 'r') as file:
harness_data = json.load(file)
harness_acc_norm = [float(item['acc_norm']) for item in harness_data]
# Compare scores
score_match = [score == lb_acc_norms[i] for i, score in enumerate(harness_acc_norm)]
print(f'{sum(score_match)}/{len(score_match)} scores match')
For this specific example, I got 130/135 matching scores.
PS: we now indicate the version of the evaluation harness to use with the tinyBenchmarks on huggingface, thanks to your issue!
Thank you very much for your work! I'd like to ask, how can I evaluate a local HuggingFace model? I see that you directly download the models for each test data from the OpenLLM Leaderboard, but how should I generate the predictions (y) for a local model? Could you please indicate if there is any convenient pipeline available?
Looking forward to your reply.