Open Silin159 opened 1 year ago
I think you can select the checkpoints around 100000 training steps using the validation data. The number on the paper is out-of-date, the new results is a bit lower and is 0.84 in BLEU. BTW CCD is a pretty bizarre datasets in a way that it easily overfit the training data and the outputs actually require commonsense knowledges. (Diffuseq only achieved BLEU around 1. This means the outputs barely correlate with the labels.)
Hi, I try to use your script (ccd.sh) to reproduce the Table 1 results on Commonsense Conversation Dataset, but it turns out that my reproduced results (BLEU: 0.154, Rouge-L: 6.38) are far below your reported values (BLEU: 1.02, Rouge-L: 8.59). Could you check whether the hyperparameters in ccd.sh are the optimal ones that you use? It would be better if you could also provide the evaluation scripts for producing BLEU and Rouge-L (currently the inference_scripts only save the testing outputs but no metrics evaluation results if I run it right)? Besides, are there any model checkpoints and testing outputs available?