OpenLMLab / LEval

[ACL'24 Outstanding] Data and code for L-Eval, a comprehensive long context language models evaluation benchmark
GNU General Public License v3.0
349 stars 14 forks source link

How to Reproduce Results on Llama3-8b? #13

Closed Ocean-627 closed 4 months ago

Ocean-627 commented 4 months ago

Excellent work! I noticed that the README provides results for Llama3-8b. However, I used meta-llama/Meta-Llama-3-8B-Instruct with llama2-chat-test.py and replaced LlamaTokenizer with AutoTokenizer, but I couldn't reproduce the results shown in the table. Could you please provide the reproduction code and commands for achieving the results on Llama3-8b? Thank you very much!

ChenxinAn-fdu commented 4 months ago

Hi! You should follow the chat format of Llama3 (click here)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ L-Eval system_prompt}}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{L-Eval Long context + Question }}<|eot_id|><|start_header_id|>assistant<|end_header_id|> \nAnswer:

If you still can not reproduce the results please kindly leave a comment.

Ocean-627 commented 4 months ago

Thank you for your kind reply!

chunniunai220ml commented 3 months ago

@ChenxinAn-fdu i can not reproduce llama3-8b result according ur advice, just got {'exact_match': 53.9604, 'num_predicted': 202, 'mean_prediction_length_characters': 1.0, 'LEval_score': 53.9604, 'display_keys': ['exact_match'], 'display': [53.9604]}

here is my codes: python Baselines/llama2-chat-test.py \ --metric exam_eval \ --task_name quality \ --max_length 4k

and change llama2-chat-test.py elif args.metric == "exam_eval": context = "Document is as follows. {document} \nQuestion: {inst}. Please directly give the answer without any additional output or explanation "

message = B_INST + B_SYS + sys_prompt + E_SYS + context + E_INST

                message="<|begin_of_text|>"+sys_prompt
                message += "\nAnswer:"