Closed stevie1023 closed 11 months ago
For evaluation, we just use a greedy sampling procedure with max new token set to 128. For PPL calculation, we use the GPT2-medium as the scoring model as told in the paper and the PPL script from Huggingface. For reward scores, we use this model Dahoas/gptj-rm-static from Huggingface hub.
If you cannot align with our results, you can post your results with your generation config.
Thanks for your response, I'll reset the config parameters and try it again.
Hi, thanks for your reply and my question about the generation config has been solved, but I am still confused with regard to the scores in table 9. Are the scores given by chatgpt using the same prompt for wombat training data in Appendix F, which means the total socre for each test sample should be in range[0,20]? or you evaluate wombat with other prompts? Thanks a lot in advance.
Hi, thanks for your reply and my question about the generation config has been solved, but I am still confused with regard to the scores in table 9. Are the scores given by chatgpt using the same prompt for wombat training data in Appendix F, which means the total socre for each test sample should be in range[0,20]? or you evaluate wombat with other prompts? Thanks a lot in advance.
We generate train data by prompt in Appendix F and evaluate using Vicuna test set and prompts from Vicuna.
Hi, thanks for your reply and my question has been resolved. Best,
Hi there, I'm writing to ask about the generation config(temperature, max_new_tokens, etc.) for reward score and ppl calculation, since I have difficulty reproducing the results given in the paper( table 2).
Thanks for your help.