bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
818 stars 218 forks source link

Suggest tasks for the Evaluation Harness #16

Closed harm-devries closed 1 year ago

harm-devries commented 2 years ago

Creating an Evaluation Harness for code LLMs

We are working on an Evaluation Harness that covers a wide array of coding tasks and programming languages. We'd appreciate your input!

Existing list

Please take a look at the existing sheet of evaluation benchmarks here.

Contribute

Please use the following template to suggest new tasks for the Evaluation Harness.

Name Link Number of samples Languages Available on the HF Hub
HumanEval https://github.com/openai/human-eval 164 Python Yes

Here's the Markdown snippet that you can copy/paste:

|Name|Link|Number of samples| Languages |Available on the HF Hub|
|:-|:-|:-|:-|:-
| | | | | | | |
hajipour commented 2 years ago
Name Link Number of samples Languages Available on the HF Hub
Python Programming Puzzles (P3) https://github.com/microsoft/PythonProgrammingPuzzles 397 Python No
arjunguha commented 2 years ago
Name Link Number of samples Languages Available on the HF Hub
MultiPL-E https://huggingface.co/datasets/nuprl/MultiPL-E ~2,898 C++,C#,D,Go,Java,Julia,JavaScript,.Lua,PHP,Perl,Ruby,R,Racket,Rust,Scala,Bash,Swift,TypeScript Yes
moyix commented 2 years ago

These datasets are oriented at evaluating the security of code generated by an LLM, and come from the papers Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions and SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques.

Name Link Number of samples Languages Available on the HF Hub
AsleepAtTheKeyboard https://zenodo.org/record/5225651 89 C, Python, Verilog No
SecurityEval https://github.com/s2e-lab/SecurityEval 130 Python No
PhungVanDuy commented 2 years ago
This data came from DeepMind AlphaCode paper. To evaluate code generation models in Competitive Programming. Name Link Number of samples Languages Available on the HF Hub
CodeContest https://huggingface.co/datasets/deepmind/code_contests train(13328) - dev(117) - test(165) Python, C++ Yes
danishcontractor commented 2 years ago

Idea: An eval/experiment that could perhaps be interesting: (i) Use instruction prompts from https://instructions.apps.allenai.org/ (ii) Use BigCode to generate pseudo code of these instructions in few shot settings [Q: Could it? -- maybe this needs another eval task] (iii) Assess if BigCode can outperform existing LLMs in a setup like this [ https://arxiv.org/pdf/2210.07128.pdf ] -- this paper does not do step (ii) and relies on intermediate representations (eg: https://aclanthology.org/2021.findings-emnlp.184.pdf)

JeanKaddour commented 1 year ago
Name Link Number of samples Languages Available on the HF Hub
DS-1000 Data Science Code Generation https://ds1000-code-gen.github.io/ 1000 Python No
shrivastavadisha commented 1 year ago
Name Link Number of samples Languages Available on the HF Hub
CodeXGlue https://github.com/microsoft/CodeXGLUE 14 datasets for 10 diversified code intelligence tasks (including code-code, code-test, text-code, text-text), Train/Val/Test splits for each dataset in the range of 1k-100k samples Java, C++, Python, Javascript, C, PHP, Ruby, Go, C# Not sure