abacaj / code-eval

Run evaluation on LLMs using human-eval benchmark
MIT License
362 stars 34 forks source link

Performance of llama-2 #10

Closed junzhang-zj closed 9 months ago

junzhang-zj commented 11 months ago

Why am I getting low scores on llama-2-13b, pass@1: 3.05%, pass@10: 19.51%, are you applying any other fine prompts to this setup or are the scores related to batch decoding, my setup is such that I need to generate the samples sequentially and can't perform batch decoding.

abacaj commented 11 months ago

That seems pretty low for pass@1, this is how I loaded the 7B llama-2 model:

    tokenizer = LlamaTokenizer.from_pretrained(
        "meta-llama/Llama-2-7b-hf",
        use_auth_token=TOKEN,
    )

    model = torch.compile(
        LlamaForCausalLM.from_pretrained(
            "meta-llama/Llama-2-7b-hf",
            torch_dtype=torch.bfloat16,
            use_auth_token=TOKEN,
        )
        .eval()
        .to("cuda")
    )

No changes were made to prompts found in the current eval_llama.py file

junzhang-zj commented 11 months ago

Thanks, I need to check my code carefully. Also, is the sample temperature 0.2? In my test, I set it to 0.8 to achieve better results (the above scores), so 0.2 won't work for my setting.

abacaj commented 11 months ago

The recommended temps from the paper were 0.2 for pass@1, 0.6 for pass@10 and 0.8 for pass@100

BaoBaoGitHub commented 11 months ago

Thanks for your hard work! The prompt and hyper-parameter confused me a lot! But could the temperature be 0.1 instead of 0.2 according table 21 from llama2 paper?

nicoladainese96 commented 10 months ago

@junzhang-zj have you checked the pad token for the model? Took me a while to figure out that setting tokenizer.pad_token = tokenizer.eos_token breaks the performance in the batch generation. Rather, try to set tokenizer.pad_token = '[PAD]' and see if it makes any difference.

junzhang-zj commented 10 months ago

@nicoladainese96 Since I found that the code I generated was missing the '\t' at the beginning of the sentence, I added it to all the answers and the results made sense. I'm sorry for not updating my solution in time. I will also try your solution. Thank you very much!

Cooperx521 commented 9 months ago

@junzhang-zj Hi, I encountered the same issue while testing with llama2, so I added a '\t' to each output. However, what puzzles me is that upon reviewing the code in this repository, it doesn't seem to add a '\t' to the llama output by default, yet the results appear to be quite satisfactory. Have you identified the reason for this?

junzhang-zj commented 9 months ago

@Cooperx521 I haven't checked the reason carefully yet.