bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
702 stars 180 forks source link

Evaluation of instruct model #158

Closed phqtuyen closed 7 months ago

phqtuyen commented 8 months ago

In the https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard evaluation, in the model submission, there is an option which differentiate based vs instruction tuned models. Is there such an option here in bigcode-evaluation-harness? Am I correct to assume that for models we need to evaluate with instruction prompt? Much appreciated.

loubnabnl commented 8 months ago

Hi, to use an instruction version of HumanEval prompt, you can use the HumenEvalSynthesize task (the one used for instruction models in the leaderboard when evaluating on Python), for example:

accelerate launch main.py \
--model bigcode/octocoder  \
--tasks humanevalsynthesize-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt octocoder \
--save_generations_path generations_humanevalsynthesizepython_octocoder.json \
--metric_output_path evaluation_humanevalsynthesizepython_octocoder.json \
--max_length_generation 2048 \
--precision bf16

To change how the instruction prompt is built you can update --prompt argument check the code for the list of options (i.e the transformations that we apply to HumanEval prompts to make them instruction friendly)

phqtuyen commented 8 months ago

Thanks @loubnabnl , do we have to specify instruction token for this task? Much appreciated.

phqtuyen commented 8 months ago

Also, do you mind telling the exact setting to replicate the codellama instruct performance? Thank you so much.

loubnabnl commented 8 months ago

If your model uses different tokens you'll need to build a new prompt and update the code. See this PR for adding codellama prompt: https://github.com/bigcode-project/bigcode-evaluation-harness/pull/130/files

phqtuyen commented 8 months ago

Ah I just want to replicate the performance of codellama-intruct in HF leaderboard https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard . Do you know what config/args that they run the evaluation with? Also, is the reported number for "humanevalsynthesize-python"? Thanks.

loubnabnl commented 8 months ago

You can use:

accelerate launch main.py \
--model codellama/CodeLlama-7b-Instruct-hf  \
--tasks humanevalsynthesize-python \
--do_sample False \
--batch_size 1 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt codellama \
--save_generations_path generations_humanevalsynthesizepython_codellama.json \
--metric_output_path evaluation_humanevalsynthesizepython_codellama.json \
--max_length_generation 2048 \
--precision fp16
phqtuyen commented 8 months ago

Thank you, another minor detail, in the leaderboard HF says that this is the setting which they use "All models were evaluated with the bigcode-evaluation-harness with top-p=0.95, temperature=0.2, max_length_generation 512, and n_samples=50.", here you use fp16, is this correct? Much appreciated.

loubnabnl commented 8 months ago

The displayed models were indeed evaluated in that setting, but we've found greedy to give results close to top-p sampling with 50 samples so you can use greedy to speed-up the evaluation. HumanEvalSynthesize requires sequence length of 2048 though not 512.