bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
698 stars 180 forks source link

Add a new dataset Mercury #238

Closed Elfsong closed 1 month ago

Elfsong commented 1 month ago
accelerate  launch --main_process_port 30000  main.py  \
    --model bigcode/starcoder2-7b   \
    --load_in_4bit   \
    --max_length_generation 2048   \
    --tasks mercury    \
    --n_samples 5  \
    --temperature 0.2  \
    --batch_size 5   \
    --allow_code_execution  \
    --save_generations  \
    --metric_output_path starcoder2-7b-mercury-result.json
Elfsong commented 1 month ago

@SivilTaram FYI

Elfsong commented 1 month ago

@loubnabnl Thank you so much for reviewing this code:)

did you make sure the current implementation matches the scores reported in your paper for one of the public LLMs?

Yes. The scores reported in our paper are based on this implementation. We are also working on publishing a public leaderboard page.

can you add some documentation about how to use the benchmark in the docs https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs

Sure. The instructions have been added. See this commit.