Open chunniunai220ml opened 1 month ago
Hi! You should use the Instruct
version of Llama3 8B and set max_length to 8k.
Please use the chat template of Llama3.
Hi,
Thank you for providing the codes and tips for reproducing LLaMA 3 results!
I modified the LLaMA 2 codes based on your suggestions:
message = ""
message += "<|begin_of_text|><|start_header_id|>system<|end_header_id|>"
message += "\n" + sys_prompt
message += "<|eot_id|><|start_header_id|>user<|end_header_id|>"
message += "\n" + context
message += "<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
The results I got for the six tasks are
Llama3-8b | TOEFL | QuALITY | Coursera | SFiction | GSM | CodeU |
---|---|---|---|---|---|---|
Your Results | 82.89 | 64.85 | 53.77 | 69.53 | 79.00 | 2.22 |
My Reproduction | 81.04 | 61.88 | 52.62 | 71.09 | 29.00 | 4.44 |
Results on most datasets are within an acceptable gap to your results while the GSM100k result I got is somehow very bad. Could you please help me check if my prompt is the same as yours? Or do you have any other suggestions for reproducing the results (such as tuning the decoding hyperparameters)? Thank you very much.
Hi! I suggest using:
message = ""
message += "<|begin_of_text|><|start_header_id|>system<|end_header_id|>"
message += "\n" + sys_prompt
message += "<|eot_id|><|start_header_id|>user<|end_header_id|>"
message += "\n" + context + "\nAnswer:"
The Question
and Answer
pair is needed to achieve high performance on math tasks.
Thanks for your reply.
I found the role special tokens have to be added to all the examples in GSM100k, such as
context = document + "\n\n" + inst
context = context.replace(
"Question:",
"<|eot_id|><|start_header_id|>user<|end_header_id|>\nQuestion:"
)
context = context.replace(
"Let's think step by step",
"Let's think step by step\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
)
message = ""
message += "<|begin_of_text|><|start_header_id|>system<|end_header_id|>"
message += "\n" + sys_prompt
message += context
Then the accuracy will be 78!
There is also the other option not to use any chat format
message = sys_prompt + "\n" + context
the model will act like a pre-trained language model and keep outputting self-curated questions and answers after the CoT and answer for the original question. If we parse the first answer that the model generates (which has been done in the current code), the accuracy is 80.
i can not reproduce llama3-8b result according ur advice, just got {'exact_match': 53.9604, 'num_predicted': 202, 'mean_prediction_length_characters': 1.0, 'LEval_score': 53.9604, 'display_keys': ['exact_match'], 'display': [53.9604]}
here is my codes: python Baselines/llama2-chat-test.py --metric exam_eval --task_name quality --max_length 4k
and change llama2-chat-test.py elif args.metric == "exam_eval": context = "Document is as follows. {document} \nQuestion: {inst}. Please directly give the answer without any additional output or explanation "
message="<|begin_of_text|>"+sys_prompt # B_INST + B_SYS + sys_prompt + E_SYS + context + E_INST message += "\nAnswer:"