Open Loong435 opened 4 months ago
@YangRui2015 could you look into this?
I tried to reproduce your gemma2B reward model training again and found that the reward model architecture fine tuned with internlm2 had an output header of 1. I downloaded your GRM-Gemma-2B-Sftrug reward model and found that there were two linear values output in the end. During BT model training, I debugged and found that the final linear output of the reward model structure trained by your code was also 1. Also, during debugging, I found that the training script also separated 'chosen' and 'rejected' to obtain separate reward values for loss calculation. I would like to ask how your GRM-Gemma-2B-Sftrug reward model was trained, and after evaluation, I felt that these two linear values output a 'chosen' score and a 'rejected' score. It's a rejected score, could you explain it to me?
Hi, the model Ray2333/GRM-Gemma-2B-sftreg
outputs only one value and does not follow the original AutoModelForSequenceClassification
class. It seems you may not have loaded it correctly. Please refer to the example here for the correct loading procedure.
I tried to reproduce your gemma2B reward model training again and found that the reward model architecture fine tuned with internlm2 had an output header of 1. I downloaded your GRM-Gemma-2B-Sftrug reward model and found that there were two linear values output in the end. During BT model training, I debugged and found that the final linear output of the reward model structure trained by your code was also 1. Also, during debugging, I found that the training script also separated 'chosen' and 'rejected' to obtain separate reward values for loss calculation. I would like to ask how your GRM-Gemma-2B-Sftrug reward model was trained, and after evaluation, I felt that these two linear values output a 'chosen' score and a 'rejected' score. It's a rejected score, could you explain it to me?