RealCode_eval is a benchmark to perform execution-based evaluation of LLM code generation for real github repositories.
RealCode is a dataset of 219 Python functions* from 22 github repositories published between June and August on 2023. All these functions are covered with tests in their respective repositories. \ * our term "function" also includes methods in classes
The task for a model in RealCode_eval is to write the body of a function that is declared in a file within one of the repositories. The benchmark supplies the model with the remainder of the file or even the entire repository. If the number of tests passed with the generated body matches the precomputed number of passed tests for the repository, the generation is deemed correct. The Pass@k metric (from Codex paper) is employed for evaluation.
Every repository in RealCode has dependencies and, as a result, necessitates properly configured environments. We utilize Conda to create distinct environments for each repository.
[!NOTE] These results were obtained on the pre-release version of the dataset, which contained two more functions (221 instead of 219)
+model | size | Pass@1 |
---|---|---|
starcoder | 1b | 0.3873 |
starcoder | 7b | 0.4814 |
codellama | 7b | 0.4760 |
codellama | 13b | 0.4841 |
codellama | 34b | 0.4932 |
phi1 | 1b | 0.3529 |
mistral | 7b | 0.4208 |
deepseek-coder | 1.3b | 0.4144 |
deepseek-coder | 5.7bmqa | 0.4669 |
deepseek-coder | 6.7b | 0.4914 |
deepseek-coder | 33b | 0.4932 |
+model | size | Pass@1 |
---|---|---|
codellama | 7b | 0.4941 |
codellama | 13b | 0.5339 |
deepseek-coder | 1.3b | 0.3113 |
deepseek-coder | 5.7bmqa | 0.5330 |
deepseek-coder | 6.7b | 0.4832 |
deepseek-coder | 33b | 0.5484 |
starcoder | 1b | 0.4506 |
starcoder | 7b | 0.5149 |
starcoder | 15b | 0.5248 |
[!NOTE] If an "oracle" takes max Pass@1 for each function from the configurations presented in LM and Infill tables, he would score Pass@1=0.7085
Repository-level mode*, 15k tokens in context, 13.5k in left context for infill
model generator_mode size Pass@1 deepseek lm 1.3b 0.5438 deepseek lm 5.7bmqa 0.5601 deepseek infill 5.7bmqa 0.5891 deepseek lm 6.7b 0.5954 deepseek infill 6.7b 0.5809
* both the files imported by current file and the files that import current file are added to the left context
Prerequisites:
df -hi
)Install requirements in your main environment
pip install -r requirements.txt
Clone repositories and build environments (takes about an hour)
cd prepare_data
python run.py
cd ..
Check installation
pytest tests/test_evaluator.py
[!NOTE] Number of passed tests in the repositories may vary depending on your system. If this test fails on your system feel free to open an issue. We need your feedback to create a more stable version of the benchmark.
CUDA_VISIBLE_DEVICES=0 python main.py +model=codeparrot generation_params.max_new_tokens=512 max_context_length=500
[!WARNING] Generated code is executed without any isolation! Use at your own risk!
./results/
CUDA_VISIBLE_DEVICES=0 python main.py +model=codellama size=7b max_context_length=1024
CUDA_VISIBLE_DEVICES=0 python main.py +model=starcoder size=3b generator_mode=infill max_context_length=1024
CUDA_VISIBLE_DEVICES=0 python main.py +model=starcoder size=3b max_context_length=1000 left_context_ratio=3
/downloaded/checkpoints/my_ckpt
CUDA_VISIBLE_DEVICES=0 python main.py \
+model=local model_base_path=/downloaded/checkpoints model_short_name=my_ckpt max_context_length=1024
CUDA_VISIBLE_DEVICES=0 python main.py \
+model=deepseek size=1.3b +context_parser=import_copy \
generator_mode=infill max_context_length=14000 left_context_ratio=19
CUDA_VISIBLE_DEVICES=0 python main.py \
+model=codellama size=7b,13b generator_mode=lm,infill max_context_length=2048 --multirun
See config/config.yaml
for other options
device_map='auto'
, if you wish to use specific GPUs set CUDA_VISIBLE_DEVICES
, as in the examplesIf you find RealCode_eval useful please consider giving a star to the repositories used for evaluation: \ https://github.com/Jakob-98/openai-functools \ https://github.com/biobootloader/mentat \ https://github.com/causalens/cai-causal-graph \ https://github.com/modelscope/modelscope-agent \ https://github.com/simonmesmith/agentflow \ https://github.com/defog-ai/sql-eval \ https://github.com/Wyvern-AI/wyvern \ https://github.com/danielbeach/tinytimmy \ https://github.com/a-r-r-o-w/stablefused \ https://github.com/langchain-ai/permchain \ https://github.com/NullPyDev/beholder \ https://github.com/opencopilotdev/opencopilot \ https://github.com/AgentOps-AI/agentops \ https://github.com/TengHu/ActionWeaver \ https://github.com/fynnfluegge/doc-comments.ai \ https://github.com/Tinny-Robot/DimSense \ https://github.com/mljar/plotai \ https://github.com/juliendenize/eztorch \ https://github.com/yihong0618/epubhv \ https://github.com/simonw/llm-cluster \ https://github.com/Pennyw0rth/NetExec \ https://github.com/Vaultexe/server