Closed phqtuyen closed 7 months ago
Hi, to use an instruction version of HumanEval prompt, you can use the HumenEvalSynthesize task (the one used for instruction models in the leaderboard when evaluating on Python), for example:
accelerate launch main.py \
--model bigcode/octocoder \
--tasks humanevalsynthesize-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt octocoder \
--save_generations_path generations_humanevalsynthesizepython_octocoder.json \
--metric_output_path evaluation_humanevalsynthesizepython_octocoder.json \
--max_length_generation 2048 \
--precision bf16
To change how the instruction prompt is built you can update --prompt
argument check the code for the list of options (i.e the transformations that we apply to HumanEval prompts to make them instruction friendly)
Thanks @loubnabnl , do we have to specify instruction token for this task? Much appreciated.
Also, do you mind telling the exact setting to replicate the codellama instruct performance? Thank you so much.
If your model uses different tokens you'll need to build a new prompt
and update the code. See this PR for adding codellama prompt: https://github.com/bigcode-project/bigcode-evaluation-harness/pull/130/files
Ah I just want to replicate the performance of codellama-intruct in HF leaderboard https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard . Do you know what config/args that they run the evaluation with? Also, is the reported number for "humanevalsynthesize-python"? Thanks.
You can use:
accelerate launch main.py \
--model codellama/CodeLlama-7b-Instruct-hf \
--tasks humanevalsynthesize-python \
--do_sample False \
--batch_size 1 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt codellama \
--save_generations_path generations_humanevalsynthesizepython_codellama.json \
--metric_output_path evaluation_humanevalsynthesizepython_codellama.json \
--max_length_generation 2048 \
--precision fp16
Thank you, another minor detail, in the leaderboard HF says that this is the setting which they use "All models were evaluated with the bigcode-evaluation-harness with top-p=0.95, temperature=0.2, max_length_generation 512, and n_samples=50.", here you use fp16, is this correct? Much appreciated.
The displayed models were indeed evaluated in that setting, but we've found greedy to give results close to top-p sampling with 50 samples so you can use greedy to speed-up the evaluation. HumanEvalSynthesize requires sequence length of 2048 though not 512.
In the https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard evaluation, in the model submission, there is an option which differentiate
based
vsinstruction tuned
models. Is there such an option here in bigcode-evaluation-harness? Am I correct to assume that for models we need to evaluate with instruction prompt? Much appreciated.