Add a new dataset Mercury

Elfsong commented 1 month ago

Motivation first:

TL;DR: Mercury is a dataset for evaluating computational efficiency of Python code generation.

Amidst the recent strides in evaluating Large Language Models for Code (Code-LLMs), existing benchmarks have mainly focused on functional correctness, overlooking the importance of computational efficiency.

To fill the gap, we present Mercury, the first computational efficiency benchmark for Code-LLMs. It comprises 1,889 Python tasks, each with adequate solutions to support a runtime distribution. Based on the distribution, we introduce a new metric Beyond, which computes a runtime-percentile-weighted Pass score to reflect functional correctness and computational efficiency simultaneously.

On Mercury, leading Code-LLMs can achieve 65% on Pass, while less than 50% on Beyond. Given that an ideal Beyond score would be aligned with the Pass score, it indicates that while Code-LLMs exhibit impressive capabilities in generating functionally correct code, there remains a notable gap in their efficiency. Finally, our empirical experiments reveal that Direct Preference Optimization (DPO) serves as a robust baseline for enhancing computational efficiency compared with Supervised Fine Tuning (SFT), which paves a promising avenue for future exploration of efficient code generation.\footnote{Our code and data are available on GitHub: [](https://github.com/Elfsong/Mercury)[https://github.com/Elfsong/Mercury](https://github.com/Elfsong/Mercury).
Write a full paragraph describing the feature;

In this work, we introduce Mercury, a novel code generation benchmark designed to assess and improve Code-LLM computational efficiency. It comprises 1,889 Python programming tasks with three difficulty stratification, which is divided into two datasets for model evaluation and fine-tuning separately. For each evaluation task, we assign a test case generator to remedy the shortfall of test case coverage. In measuring computational efficiency, the primary challenge stems from normalizing the absolute runtime across tasks that have diverse runtime ranges. Thus, we collect and locally execute numerous historical solutions for each task to form a runtime distribution and leverage the runtime percentile of LLM-generated code on the distribution instead of the absolute runtime to evaluate computational efficiency. Furthermore, to mitigate performance discrepancies attributed to irrelevant processes and diverse hardware configurations, we set up an isolated sandbox environment for task execution to establish local runtime distributions. More details can be found in the paper: https://arxiv.org/abs/2402.07844
Provide a code snippet that demonstrates its future use;

accelerate  launch --main_process_port 30000  main.py  \
    --model bigcode/starcoder2-7b   \
    --load_in_4bit   \
    --max_length_generation 2048   \
    --tasks mercury    \
    --n_samples 5  \
    --temperature 0.2  \
    --batch_size 5   \
    --allow_code_execution  \
    --save_generations  \
    --metric_output_path starcoder2-7b-mercury-result.json

In case this is related to a paper, please attach a link;

More details can be found in the paper: https://arxiv.org/abs/2402.07844

Elfsong commented 1 month ago

@SivilTaram FYI

Elfsong commented 1 month ago

@loubnabnl Thank you so much for reviewing this code:)

did you make sure the current implementation matches the scores reported in your paper for one of the public LLMs?

Yes. The scores reported in our paper are based on this implementation. We are also working on publishing a public leaderboard page.

can you add some documentation about how to use the benchmark in the docs https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs

Sure. The instructions have been added. See this commit.

bigcode-project / bigcode-evaluation-harness

Add a new dataset Mercury #238