felipemaiapolo / tinyBenchmarks

Evaluating LLMs with fewer examples
MIT License
107 stars 11 forks source link

Local model estimation #2

Closed wang-debug closed 4 days ago

wang-debug commented 4 months ago

Thank you very much for your work! I'd like to ask, how can I evaluate a local HuggingFace model? I see that you directly download the models for each test data from the OpenLLM Leaderboard, but how should I generate the predictions (y) for a local model? Could you please indicate if there is any convenient pipeline available?

Looking forward to your reply.

felipemaiapolo commented 4 months ago

Thank you for your interest in our work!

Yes, in the demo we directly download model results. If you want to generate your results for the datasets we worked with in the paper, please check more instructions for each data set here https://huggingface.co/tinyBenchmarks. If you are working with a new dataset, you would have to set your own function and criteria to create y.

wang-debug commented 4 months ago

Hi, thank you very much for your response. I have referenced the instructions in your HuggingFace's TinyBenchmark repository. However, during testing, I found that the estimated results differ between the test dataset directly downloaded from the HuggingFace LLM leaderboard(fig2) and the locally tested dataset(fig1). Is this difference worth resolving? I suspect it might be a version issue with the LM-harness.

Screenshot 2024-02-28 at 13 45 47 Screenshot 2024-02-28 at 13 45 55
felipemaiapolo commented 4 months ago

Hello!!

After open-sourcing our code, we realized that different versions of LM-harness produce different correctness vectors. The vectors are usually similar though. If you check the "About" tab on the leaderboard webpage, they specify the version they used. You can try using that version. Please let me know if that solves!

wang-debug commented 4 months ago

I followed the specified lm-harness version according to the llm leaderboard, but there still seems to be a gap.

Screenshot 2024-02-29 at 12 50 52
LucWeber commented 3 months ago

Hey @wang-debug,

happy to hear that you are testing out tinyMMLU. :tada:

I think the issue here is that the old lm-harness shuffles the data points, while the latest version does not (see line 213 in the evaluator.py script).

However, the post-hoc calculations in tinyBenchmarks rely on the correct ordering of the data points in your results vector (which explains the extremely bad estimations you got). To obtain the same results as with the huggingface data, you have to order your results vector in the same way as the data points are ordered in tinyMMLU.

Lmk if there is something unclear about this.

wang-debug commented 3 months ago

Thank you very much for your suggestion. After removing the randomness, the gap has indeed reduced, but there still exists a gap. Is this acceptable?

Screenshot 2024-03-06 at 17 37 37
LucWeber commented 3 months ago

Hey again,

It is good to see that you got a reduction in error. :)

We looked into this ourselves and found that evaluations from the eval-harness are not entirely deterministic (even though every intuition says they should). We got different results on different hardware using the same code.

Here is some code where you can check it with scores from the MMLU subscenario anatomy:

from datasets import load_dataset
import json

MODEL_NAME = 'UCLA-AGI/zephyr-7b-sft-full-SPIN-iter0'

hf_path, hf_model_name = MODEL_NAME.split('/')
output_file_harness = 'hendrycksTest-anatomy_write_out_info.json'
subscenario_name = 'harness_hendrycksTest_anatomy_5'

# Load the leaderboard results and harness scores
lb_results = load_dataset(f'open-llm-leaderboard/details_{hf_path}__{hf_model_name}', subscenario_name)['latest']
lb_acc_norms = [item['acc_norm'] for item in lb_results['metrics']]

with open(output_file_harness, 'r') as file:
    harness_data = json.load(file)
harness_acc_norm = [float(item['acc_norm']) for item in harness_data]

# Compare scores
score_match = [score == lb_acc_norms[i] for i, score in enumerate(harness_acc_norm)]

print(f'{sum(score_match)}/{len(score_match)} scores match')

For this specific example, I got 130/135 matching scores.

PS: we now indicate the version of the evaluation harness to use with the tinyBenchmarks on huggingface, thanks to your issue!