Closed junzhang-zj closed 9 months ago
That seems pretty low for pass@1, this is how I loaded the 7B llama-2 model:
tokenizer = LlamaTokenizer.from_pretrained(
"meta-llama/Llama-2-7b-hf",
use_auth_token=TOKEN,
)
model = torch.compile(
LlamaForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.bfloat16,
use_auth_token=TOKEN,
)
.eval()
.to("cuda")
)
No changes were made to prompts found in the current eval_llama.py
file
Thanks, I need to check my code carefully. Also, is the sample temperature 0.2? In my test, I set it to 0.8 to achieve better results (the above scores), so 0.2 won't work for my setting.
The recommended temps from the paper were 0.2 for pass@1, 0.6 for pass@10 and 0.8 for pass@100
Thanks for your hard work! The prompt and hyper-parameter confused me a lot! But could the temperature be 0.1 instead of 0.2 according table 21 from llama2 paper?
@junzhang-zj have you checked the pad token for the model? Took me a while to figure out that setting tokenizer.pad_token = tokenizer.eos_token breaks the performance in the batch generation. Rather, try to set tokenizer.pad_token = '[PAD]' and see if it makes any difference.
@nicoladainese96 Since I found that the code I generated was missing the '\t' at the beginning of the sentence, I added it to all the answers and the results made sense. I'm sorry for not updating my solution in time. I will also try your solution. Thank you very much!
@junzhang-zj Hi, I encountered the same issue while testing with llama2, so I added a '\t' to each output. However, what puzzles me is that upon reviewing the code in this repository, it doesn't seem to add a '\t' to the llama output by default, yet the results appear to be quite satisfactory. Have you identified the reason for this?
@Cooperx521 I haven't checked the reason carefully yet.
Why am I getting low scores on llama-2-13b, pass@1: 3.05%, pass@10: 19.51%, are you applying any other fine prompts to this setup or are the scores related to batch decoding, my setup is such that I need to generate the samples sequentially and can't perform batch decoding.