Closed matouk98 closed 2 months ago
Hi, there are two differences (1) RLHFlow/LLaMA3-iterative-DPO-final is not our final official checkpoint. We are still working on releasing the official one waiting for internal approvals. (2) The evaluation of MT-bench follows the format
register_conv_template(
Conversation(
name="myllama3",
system_template="<|start_header_id|>system<|end_header_id|>\n{system_message}",
system_message="You are CMB, an AI assistant known for its intelligence and expertise across all fields of knowledge. You are designed to provide detailed and helpful responses to a wide range of user inquiries, ensuring clarity and accuracy in every interaction.",
roles=["<|start_header_id|>user<|end_header_id|>", "<|start_header_id|>assistant<|end_header_id|>"],
sep_style=SeparatorStyle.CHATML,
sep="<|eot_id|>",
stop_token_ids=[128009, 128001, 128006],
)
)
Thanks a lot for your response! With the new template, I get score of 8.28 for LLaMA3-iterative-DPO-final, is it expected? Btw, will it be possible to provide the generation code for Alpaca Eval? I try to use the inference code in step 3.1 but not sure how to apply the chat template? Thank you!
Hi, I think that is reasonable as RLHFlow/LLaMA3-iterative-DPO-final
is more similar to the concise version.
For me, I use vllm to inference AlpacaEval, which is fast and more convenient.
But the reported numbers are computed by huggingface generation, I will double check the eval algorithm of reported numbers.
What does "computed by huggingface generation" mean, could you please give a reference link/code?
You can refer to evaluate_from_model
at https://github.com/tatsu-lab/alpaca_eval
.
I double checked the evaluation with my collaborator and the reported numbers are also computed by vllm. I will retrieve the code and provide more details😂.
Hi, I am trying to evaluate the model RLHFlow/LLaMA3-iterative-DPO-final with MT Bench. I use the inference environment in ReadME and follow the scripts from https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge#mt-bench. However, I only get 8.09 result. Can I get some suggestions on how to debug? Or will it be possible to share the evaluation code? Thanks a lot!