GanjinZero / RRHF

[NIPS2023] RRHF & Wombat
780 stars 49 forks source link

The generation config for evaluation #39

Closed stevie1023 closed 11 months ago

stevie1023 commented 11 months ago

Hi there, I'm writing to ask about the generation config(temperature, max_new_tokens, etc.) for reward score and ppl calculation, since I have difficulty reproducing the results given in the paper( table 2).

Thanks for your help.

Yuanhy1997 commented 11 months ago

For evaluation, we just use a greedy sampling procedure with max new token set to 128. For PPL calculation, we use the GPT2-medium as the scoring model as told in the paper and the PPL script from Huggingface. For reward scores, we use this model Dahoas/gptj-rm-static from Huggingface hub.

GanjinZero commented 11 months ago

If you cannot align with our results, you can post your results with your generation config.

stevie1023 commented 11 months ago

Thanks for your response, I'll reset the config parameters and try it again.

stevie1023 commented 11 months ago

Hi, thanks for your reply and my question about the generation config has been solved, but I am still confused with regard to the scores in table 9. Are the scores given by chatgpt using the same prompt for wombat training data in Appendix F, which means the total socre for each test sample should be in range[0,20]? or you evaluate wombat with other prompts? Thanks a lot in advance.

GanjinZero commented 11 months ago

Hi, thanks for your reply and my question about the generation config has been solved, but I am still confused with regard to the scores in table 9. Are the scores given by chatgpt using the same prompt for wombat training data in Appendix F, which means the total socre for each test sample should be in range[0,20]? or you evaluate wombat with other prompts? Thanks a lot in advance.

We generate train data by prompt in Appendix F and evaluate using Vicuna test set and prompts from Vicuna.

stevie1023 commented 11 months ago

Hi, thanks for your reply and my question has been resolved. Best,