RLHFlow / RLHF-Reward-Modeling

Recipes to train reward model for RLHF.
https://rlhflow.github.io/
Apache License 2.0
608 stars 51 forks source link

Training and evaluating for pair_pm model. #21

Open t-sifanwu opened 1 month ago

t-sifanwu commented 1 month ago

Hi,

I have replicated the training and evaluation for the pair_rm model, but I haven't achieved the results reported in Table 2 of the paper. The best results I obtained were with pm_models/llama3-8b-it_bs128_lr1e-5/checkpoint-1306:

Chat: 63.55 Chat Hard: 63.27 Safety: 82.59 Reasoning: 53.53 The main difference I've noticed in your script is that the base_model in your pair_pm/llama3-8b-it.yaml is /home/wx/axtool/models/llama3_it_with_padding_token. However, I couldn't find this model on Hugging Face or anywhere else. Therefore, I trained the pair_pm with meta-llama/Meta-Llama-3-8B-Instruct.

Another difference is in eval_reward_bench_pm.py. Similarly, you are using /home/cyeab/axtool/models/llama3_it_427_update for tokenizer and tokenizer_plain, while I used meta-llama/Meta-Llama-3-8B-Instruct instead.

Could you please share the llama3_it_with_padding_token and llama3_it_427_update models with me? Additionally, could you provide details on how you trained them?

Thank you!

WayXG commented 1 month ago

I think the llama3 with padding is obtained by adding a pad token to the original llama model. This can be done by calling the pair-pm/prepare_model.py script. I did so and the resulting model is as expected.

axoltol will mask some tokens and stop the gradients and the model's padding token should be set appropriately to get the expected performance I think.

t-sifanwu commented 1 month ago

Thanks for your reply! I still have another question about the training of bradley-terry-rm models. In the file of bradley-terry-rm/llama3_rm.py, you are using the dataset "hendrydong/preference_700K", is that the same as the mix2 you mentioned in the paper?

WeiXiongUST commented 1 month ago

yes, you can use henrydong/preference_700K and the script we provide to process it into the format used by pairwise preference dataset!

t-sifanwu commented 1 month ago

yes, you can use henrydong/preference_700K and the script we provide to process it into the format used by pairwise preference dataset!

Thanks for your reply! Since the provided data process file takes the input for standard format. Is that possible to provide the data process script to extract pairs? For example sharing the script transforming from the original ultrafeedback 63k dataset to RLHF format 340k standard dataset.

WeiXiongUST commented 1 month ago

yes, you can use henrydong/preference_700K and the script we provide to process it into the format used by pairwise preference dataset!

Thanks for your reply! Since the provided data process file takes the input for standard format. Is that possible to provide the data process script to extract pairs? For example sharing the script transforming from the original ultrafeedback 63k dataset to RLHF format 340k standard dataset.

Hi, you can check the dataset we provide in the huggingface RLHFlow organization. We provide the script for each dataset in the dataset card.