bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.

Apache License 2.0

818 stars 218 forks source link

Suggest tasks for the Evaluation Harness #16

Closed harm-devries closed 1 year ago

harm-devries commented 2 years ago

Creating an Evaluation Harness for code LLMs

We are working on an Evaluation Harness that covers a wide array of coding tasks and programming languages. We'd appreciate your input!

Existing list

Please take a look at the existing sheet of evaluation benchmarks here.

Contribute

Please use the following template to suggest new tasks for the Evaluation Harness.

Name	Link	Number of samples	Languages	Available on the HF Hub
HumanEval	https://github.com/openai/human-eval	164	Python	Yes

Here's the Markdown snippet that you can copy/paste:

|Name|Link|Number of samples| Languages |Available on the HF Hub|
|:-|:-|:-|:-|:-
| | | | | | | |

hajipour commented 2 years ago

Name	Link	Number of samples	Languages	Available on the HF Hub
Python Programming Puzzles (P3)	https://github.com/microsoft/PythonProgrammingPuzzles	397	Python	No

arjunguha commented 2 years ago

Name	Link	Number of samples	Languages	Available on the HF Hub
MultiPL-E	https://huggingface.co/datasets/nuprl/MultiPL-E	~2,898	C++,C#,D,Go,Java,Julia,JavaScript,.Lua,PHP,Perl,Ruby,R,Racket,Rust,Scala,Bash,Swift,TypeScript	Yes

moyix commented 2 years ago

These datasets are oriented at evaluating the security of code generated by an LLM, and come from the papers Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions and SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques.

Name	Link	Number of samples	Languages	Available on the HF Hub
AsleepAtTheKeyboard	https://zenodo.org/record/5225651	89	C, Python, Verilog	No
SecurityEval	https://github.com/s2e-lab/SecurityEval	130	Python	No

PhungVanDuy commented 2 years ago

This data came from DeepMind AlphaCode paper. To evaluate code generation models in Competitive Programming.	Name	Link	Number of samples	Languages	Available on the HF Hub
CodeContest	https://huggingface.co/datasets/deepmind/code_contests	train(13328) - dev(117) - test(165)	Python, C++	Yes

danishcontractor commented 2 years ago

Idea: An eval/experiment that could perhaps be interesting: (i) Use instruction prompts from https://instructions.apps.allenai.org/ (ii) Use BigCode to generate pseudo code of these instructions in few shot settings [Q: Could it? -- maybe this needs another eval task] (iii) Assess if BigCode can outperform existing LLMs in a setup like this [ https://arxiv.org/pdf/2210.07128.pdf ] -- this paper does not do step (ii) and relies on intermediate representations (eg: https://aclanthology.org/2021.findings-emnlp.184.pdf)

JeanKaddour commented 1 year ago

Name	Link	Number of samples	Languages	Available on the HF Hub
DS-1000 Data Science Code Generation	https://ds1000-code-gen.github.io/	1000	Python	No

shrivastavadisha commented 1 year ago

Name	Link	Number of samples	Languages	Available on the HF Hub
CodeXGlue	https://github.com/microsoft/CodeXGLUE	14 datasets for 10 diversified code intelligence tasks (including code-code, code-test, text-code, text-text), Train/Val/Test splits for each dataset in the range of 1k-100k samples	Java, C++, Python, Javascript, C, PHP, Ruby, Go, C#	Not sure