The Zero-Shot Replication Framework is a tool designed to replicate zero-shot results from recent academic papers or model reports. Additionally, it aims to extend evaluations to better understand the strengths and weaknesses of various approaches. The framework currently supports OpenAI, Anthropic, and HuggingFace models.
To better understand these results, please check the notes below
Category | gpt-3.5-turbo-0301 | gpt-3.5-turbo-0613 | claude-2 | gpt-4-0314 | gpt-4-0613 | gpt-4 Baseline | Sources |
---|---|---|---|---|---|---|---|
Standard Bench | |||||||
HumanEval | 67.0 | 61.5 | 65.2 | 86.0 | 84.1 | 67.0 | [1] |
HumanEval+ | 59.1 | 54.2 | 54.9 | 80.5 | 74.4 | N/A | |
MATH | 35.4 | 37.2 | 17.6 | 51.6 | 50.3 | 42.2 | [3] |
LeetCodeSparks | [1,2] | ||||||
Easy | 60.0 | 76.2 | 52.4 | 76.2 | 61.2 | 68.2-75.6 | [1,2]* |
Medium | 15.0 | 22.0 | 9.8 | 19.5 | 31.7 | 26.7-40.0 | [1,2]* |
Hard | 0.0 | 0.0 | 0.0 | 4.6 | 13.6 | 6.6-10.7 | [1,2]* |
LeetCode100 | |||||||
Easy | 83.0 | 80.0 | 73.0 | 91.0 | 88.0 | N/A | |
Medium | 16.0 | 16.0 | 16.0 | 26.0 | 21.0 | N/A | |
Hard | 1.0 | 3.0 | 2.0 | 6.0 | 6.0 | N/A |
Category | code-llama-34b | wizard-coder-34b | phind-v2-34b |
---|---|---|---|
Standard Bench | |||
HumanEval | 56.7 | 69.5 | 75.0 |
HumanEval+ | 48.2 | 60.3 | 70.1 |
LeetCodeSparks | |||
Easy | 33.3 | 42.9 | 52.4 |
Medium | 2.4 | 12.2 | 7.3 |
Hard | 0.0 | 0.0 | 0.0 |
LeetCode100 | |||
Easy | 53.0 | 68.0 | 63.0 |
Medium | 3.0 | 9.0 | 5.0 |
Hard | 0.0 | 0.0 | 3.0 |
# Repository setup
git clone https://github.com/your-username/zero-shot-replication.git
cd zero-shot-replication
git submodule update --init --recursive
# Install dependencies
poetry install
vllm_support
: For VLLM functionalities, required for WizardCoder
model.automata
: For automata
agent evaluations.python-leetcode
: For leetcode
evaluations.evalplus
: For HumanEval
and HumanEval+
evaluations.quantized_support
: For running 4 or 8 bit models.I sometimes see that setting torch==2.0.1
results in issues with the cuda environment initialization on my remote machine. One workaround was to first install torch=2.0.0
, which requires commenting out of vllm, and to then increment the torch version and uncoment vllm. This may solve some user issues.
For additional features, you can install the optional dependencies:
poetry install -E <extra_name>
vllm_support
transformers must be installed from git (currently by hand)
automata
python-leetcode
You can run the zero-shot replication by executing the runner.py
file with various command-line arguments.
poetry run python runner.py --provider openai --dataset human-eval --model gpt-4-0613 --temperature 0.7
--provider
: Which provider to use for zero-shot completions (default: "openai").--dataset
: Which dataset to run on (default: "human-eval").--model
: Model name to load from the provider (default: "gpt-3.5-turbo").--temperature
: Temperature parameter for the provided model (default: 0.7).--output_file_name
: Filename to override the default output file name with.To see explicit commands ran to generate the reported results, check out the commands.md menu.
This project is licensed under the Apache-2.0 License.