Hi, we have just uploaded HumanEval+ to hugging face: https://huggingface.co/datasets/evalplus/humanevalplus in a format that is 100% compatible with the original HumanEval. I am thinking of having this supported in bigcode-evaluation-harness as well due to the HumanEval-compatible format.
# Load from huggingface
from datasets import load_dataset
load_dataset("evalplus/humanevalplus")
We also manually tested the validity and got the exact scores compared to using evalplus:
A few notes regarding integration:
Add a bit more timeouts given we have more tests: empirically I can repro the results by using max 8-10s testing time for each task
The testing code would require numpy as an additional package in the execution environment.
We are also going to support MBPP+ in a compatible format soon.
Hi, we have just uploaded HumanEval+ to hugging face: https://huggingface.co/datasets/evalplus/humanevalplus in a format that is 100% compatible with the original HumanEval. I am thinking of having this supported in
bigcode-evaluation-harness
as well due to the HumanEval-compatible format.We also manually tested the validity and got the exact scores compared to using
evalplus
:A few notes regarding integration:
numpy
as an additional package in the execution environment.We are also going to support MBPP+ in a compatible format soon.