bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
825 stars 219 forks source link

Add a new dataset Mercury #238

Closed Elfsong closed 5 months ago

Elfsong commented 6 months ago
accelerate  launch --main_process_port 30000  main.py  \
    --model bigcode/starcoder2-7b   \
    --load_in_4bit   \
    --max_length_generation 2048   \
    --tasks mercury    \
    --n_samples 5  \
    --temperature 0.2  \
    --batch_size 5   \
    --allow_code_execution  \
    --save_generations  \
    --metric_output_path starcoder2-7b-mercury-result.json
Elfsong commented 6 months ago

@SivilTaram FYI

Elfsong commented 5 months ago

@loubnabnl Thank you so much for reviewing this code:)

did you make sure the current implementation matches the scores reported in your paper for one of the public LLMs?

Yes. The scores reported in our paper are based on this implementation. We are also working on publishing a public leaderboard page.

can you add some documentation about how to use the benchmark in the docs https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs

Sure. The instructions have been added. See this commit.