Closed harm-devries closed 1 year ago
Name | Link | Number of samples | Languages | Available on the HF Hub | ||
---|---|---|---|---|---|---|
Python Programming Puzzles (P3) | https://github.com/microsoft/PythonProgrammingPuzzles | 397 | Python | No |
Name | Link | Number of samples | Languages | Available on the HF Hub |
---|---|---|---|---|
MultiPL-E | https://huggingface.co/datasets/nuprl/MultiPL-E | ~2,898 | C++,C#,D,Go,Java,Julia,JavaScript,.Lua,PHP,Perl,Ruby,R,Racket,Rust,Scala,Bash,Swift,TypeScript | Yes |
These datasets are oriented at evaluating the security of code generated by an LLM, and come from the papers Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions and SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques.
Name | Link | Number of samples | Languages | Available on the HF Hub | ||
---|---|---|---|---|---|---|
AsleepAtTheKeyboard | https://zenodo.org/record/5225651 | 89 | C, Python, Verilog | No | ||
SecurityEval | https://github.com/s2e-lab/SecurityEval | 130 | Python | No |
This data came from DeepMind AlphaCode paper. To evaluate code generation models in Competitive Programming. | Name | Link | Number of samples | Languages | Available on the HF Hub | |
---|---|---|---|---|---|---|
CodeContest | https://huggingface.co/datasets/deepmind/code_contests | train(13328) - dev(117) - test(165) | Python, C++ | Yes |
Idea: An eval/experiment that could perhaps be interesting: (i) Use instruction prompts from https://instructions.apps.allenai.org/ (ii) Use BigCode to generate pseudo code of these instructions in few shot settings [Q: Could it? -- maybe this needs another eval task] (iii) Assess if BigCode can outperform existing LLMs in a setup like this [ https://arxiv.org/pdf/2210.07128.pdf ] -- this paper does not do step (ii) and relies on intermediate representations (eg: https://aclanthology.org/2021.findings-emnlp.184.pdf)
Name | Link | Number of samples | Languages | Available on the HF Hub | ||
---|---|---|---|---|---|---|
DS-1000 Data Science Code Generation | https://ds1000-code-gen.github.io/ | 1000 | Python | No |
Name | Link | Number of samples | Languages | Available on the HF Hub | ||
---|---|---|---|---|---|---|
CodeXGlue | https://github.com/microsoft/CodeXGLUE | 14 datasets for 10 diversified code intelligence tasks (including code-code, code-test, text-code, text-text), Train/Val/Test splits for each dataset in the range of 1k-100k samples | Java, C++, Python, Javascript, C, PHP, Ruby, Go, C# | Not sure |
Creating an Evaluation Harness for code LLMs
We are working on an Evaluation Harness that covers a wide array of coding tasks and programming languages. We'd appreciate your input!
Existing list
Please take a look at the existing sheet of evaluation benchmarks here.
Contribute
Please use the following template to suggest new tasks for the Evaluation Harness.
Here's the Markdown snippet that you can copy/paste: