Training details for reproducing UltraCM

Thank you so much for sharing the data. It's very helpful for the RLHF community!

I found some hyper-parameters for training UltraCM in your paper, but I am also confused by the following questions:

How do you prepare the training examples? It seems that the instruction, completion, the feedback, and the overall score are filled into the ultracm_instruction_template as defined in your demo page. But I'm not sure...
How is the loss calculated? Did you apply masking to the input content, including the instruction and completion?
Did you compare tuning a critique model from an SFT model versus a pretrained checkpoint?

Thanks again for your efforts!

OpenBMB / UltraFeedback