Supporting HumanEval+ dataset

Hi, we have just uploaded HumanEval+ to hugging face: https://huggingface.co/datasets/evalplus/humanevalplus in a format that is 100% compatible with the original HumanEval. I am thinking of having this supported in bigcode-evaluation-harness as well due to the HumanEval-compatible format.

# Load from huggingface
from datasets import load_dataset
load_dataset("evalplus/humanevalplus")

We also manually tested the validity and got the exact scores compared to using evalplus:

A few notes regarding integration:

Add a bit more timeouts given we have more tests: empirically I can repro the results by using max 8-10s testing time for each task
The testing code would require numpy as an additional package in the execution environment.

We are also going to support MBPP+ in a compatible format soon.

bigcode-project / bigcode-evaluation-harness

Supporting HumanEval+ dataset #186