OpenLMLab / LEval

[ACL'24 Oral] Data and code for L-Eval, a comprehensive long context language models evaluation benchmark
GNU General Public License v3.0
314 stars 13 forks source link

failed reproduce llama3-8b result #14

Open chunniunai220ml opened 1 month ago

chunniunai220ml commented 1 month ago

i can not reproduce llama3-8b result according ur advice, just got {'exact_match': 53.9604, 'num_predicted': 202, 'mean_prediction_length_characters': 1.0, 'LEval_score': 53.9604, 'display_keys': ['exact_match'], 'display': [53.9604]}

here is my codes: python Baselines/llama2-chat-test.py --metric exam_eval --task_name quality --max_length 4k

and change llama2-chat-test.py elif args.metric == "exam_eval": context = "Document is as follows. {document} \nQuestion: {inst}. Please directly give the answer without any additional output or explanation "

message="<|begin_of_text|>"+sys_prompt # B_INST + B_SYS + sys_prompt + E_SYS + context + E_INST message += "\nAnswer:"

ChenxinAn-fdu commented 1 month ago

Hi! You should use the Instruct version of Llama3 8B and set max_length to 8k. Please use the chat template of Llama3.

ylsung commented 5 days ago

Hi,

Thank you for providing the codes and tips for reproducing LLaMA 3 results!

I modified the LLaMA 2 codes based on your suggestions:

  1. Use the LLaMA3-Instruct model
  2. Set the max_length to 8k
  3. Use the llama3 template (as shown below)
    message = ""
    message += "<|begin_of_text|><|start_header_id|>system<|end_header_id|>"
    message += "\n" + sys_prompt
    message += "<|eot_id|><|start_header_id|>user<|end_header_id|>"
    message += "\n" + context
    message += "<|eot_id|><|start_header_id|>assistant<|end_header_id|>"

The results I got for the six tasks are

Llama3-8b TOEFL QuALITY Coursera SFiction GSM CodeU
Your Results 82.89 64.85 53.77 69.53 79.00 2.22
My Reproduction 81.04 61.88 52.62 71.09 29.00 4.44

Results on most datasets are within an acceptable gap to your results while the GSM100k result I got is somehow very bad. Could you please help me check if my prompt is the same as yours? Or do you have any other suggestions for reproducing the results (such as tuning the decoding hyperparameters)? Thank you very much.

ChenxinAn-fdu commented 5 days ago

Hi! I suggest using:

message = ""
message += "<|begin_of_text|><|start_header_id|>system<|end_header_id|>"
message += "\n" + sys_prompt
message += "<|eot_id|><|start_header_id|>user<|end_header_id|>"
message += "\n" + context + "\nAnswer:"

The Question and Answer pair is needed to achieve high performance on math tasks.

ylsung commented 3 days ago

Thanks for your reply.

I found the role special tokens have to be added to all the examples in GSM100k, such as

context = document + "\n\n" + inst

context = context.replace(
   "Question:", 
   "<|eot_id|><|start_header_id|>user<|end_header_id|>\nQuestion:"
)

context = context.replace(
    "Let's think step by step", 
    "Let's think step by step\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
)

message = ""
message += "<|begin_of_text|><|start_header_id|>system<|end_header_id|>"
message += "\n" + sys_prompt
message += context

Then the accuracy will be 78!

There is also the other option not to use any chat format

message = sys_prompt + "\n" + context

the model will act like a pre-trained language model and keep outputting self-curated questions and answers after the CoT and answer for the original question. If we parse the first answer that the model generates (which has been done in the current code), the accuracy is 80.