CodeEditorBench / CodeEditorBench

Apache License 2.0
28 stars 1 forks source link


This is the formal repo for paper: "CodeEditorBench: Evaluating Code Editing Capability of Large Language Models"

šŸ“¢ News: We are currently testing and supplementing new experimental results, and will optimize the entire evaluation process. Stay tuned for updates!


Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, a pioneering evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development.

We curated diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluating 17 LLMs revealed that closed-source models, particularly Gemini-Ultra and GPT-4, outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem type and prompt sensitivity. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. We will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs. By introducing CodeEditorBench, we contribute to the advancement of LLMs in code editing and provide a valuable resource for researchers and practitioners in the field.


Quick Start

Set Environment

Download Data

Our datasets are available on CodeEditorBench.

To organize the datasets, you can create a folder named data by mkdir data, and then move the datasets into this data/ folder.

Download Models

Before inferencing with open models, make sure you have download all of them from HuggingFace.

We suggest you using huggingface-cli to acclerate your downloading process.

huggingface-cli download --resume-download deepseek-ai/deepseek-coder-33b-instruct --local-dir ./model/deepseek-coder-33b-instruct


We use vllm for inferencing with open models. You can simply run bash to inference with all open models we supported. Make sure you have created the output folder.

mkdir -p greedy_result/{code_debug,code_translate,code_polishment,code_switch}

Here is a demo code snippet used to explain the hyperparameters.

python \
    --base_model "$base_model" \
    --dataset "$dataset" \
    --input_data_dir "./data/" \
    --output_data_dir "./greedy_result/" \
    --batch_size 64 \
    --num_of_sequences 1 \
    --num_gpus 8 \
    --prompt_type "zero" \
    --start_idx 0 \
    --end_idx -1

Remember that to fully understand these hyperparameters, you should consult the source code of

Filter Result

We have provided an initial filtering script (the results inferred by code LLMs are usually not pure code data, but pure code data must be used in our OJ system for evaluation). Due to the different preferences of different models' outputs, filtering is quite challenging. The usage scope of this filtering script is limited to the models we evaluate (or models of the same series). We will further improve the script's extensibility in the future.

You can use the inference script as follows:


Please note that the paths of the files to be processed are hard-coded on lines 237 and the path of the output directory is on 243 of the file. If your files are in other paths, you need to modify them accordingly.


Evaluation is performed within Docker. To perform evaluation on CodeEditorBench, please refer to Evaluation for more details.

We have conducted secondary development on HUSTOJ, the content within the evaluation module adheres to the GPL-2.0 license.


We propose evaluating LLMs across four scenarios capturing various code editing capabilities, namely code debug, code translate, code polish, and code requirement switch.The figure depicts various model performances across the four scenarios available in CodeEditorBench_Plus in a radial plot ā€“ highlighting how relative differences across models change across the scenarios. We also give the Performance of open-source and closed-source models on CodeEditorBench_Plus in zero-shot evaluated through win_rate.


      title={CodeEditorBench: Evaluating Code Editing Capability of Large Language Models}, 
      author={Jiawei Guo and Ziming Li and Xueling Liu and Kaijing Ma and Tianyu Zheng and Zhouliang Yu and Ding Pan and Yizhi LI and Ruibo Liu and Yue Wang and Shuyue Guo and Xingwei Qu and Xiang Yue and Ge Zhang and Wenhu Chen and Jie Fu},
