Model evaluation issue - Githubissues

RLHFlow / Online-RLHF

A recipe for online RLHF and online iterative DPO.

https://rlhflow.github.io/

384 stars 44 forks source link

Model evaluation issue #6

Closed matouk98 closed 2 months ago

matouk98 commented 4 months ago

Hi, I am trying to evaluate the model RLHFlow/LLaMA3-iterative-DPO-final with MT Bench. I use the inference environment in ReadME and follow the scripts from https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge#mt-bench. However, I only get 8.09 result. Can I get some suggestions on how to debug? Or will it be possible to share the evaluation code? Thanks a lot!

hendrydong commented 4 months ago

Hi, there are two differences (1) RLHFlow/LLaMA3-iterative-DPO-final is not our final official checkpoint. We are still working on releasing the official one waiting for internal approvals. (2) The evaluation of MT-bench follows the format

register_conv_template(
    Conversation(
        name="myllama3",
        system_template="<|start_header_id|>system<|end_header_id|>\n{system_message}",
        system_message="You are CMB, an AI assistant known for its intelligence and expertise across all fields of knowledge. You are designed to provide detailed and helpful responses to a wide range of user inquiries, ensuring clarity and accuracy in every interaction.",
        roles=["<|start_header_id|>user<|end_header_id|>", "<|start_header_id|>assistant<|end_header_id|>"],
        sep_style=SeparatorStyle.CHATML,
        sep="<|eot_id|>",
        stop_token_ids=[128009, 128001, 128006],
    )
)

matouk98 commented 4 months ago

Thanks a lot for your response! With the new template, I get score of 8.28 for LLaMA3-iterative-DPO-final, is it expected? Btw, will it be possible to provide the generation code for Alpaca Eval? I try to use the inference code in step 3.1 but not sure how to apply the chat template? Thank you!

hendrydong commented 4 months ago

Hi, I think that is reasonable as RLHFlow/LLaMA3-iterative-DPO-final is more similar to the concise version. For me, I use vllm to inference AlpacaEval, which is fast and more convenient. But the reported numbers are computed by huggingface generation, I will double check the eval algorithm of reported numbers.

matouk98 commented 4 months ago

What does "computed by huggingface generation" mean, could you please give a reference link/code?

hendrydong commented 4 months ago

You can refer to evaluate_from_model at https://github.com/tatsu-lab/alpaca_eval. I double checked the evaluation with my collaborator and the reported numbers are also computed by vllm. I will retrieve the code and provide more details😂.