Yuanhy1997 / SeqDiffuSeq

Text Diffusion Model with Encoder-Decoder Transformers for Sequence-to-Sequence Generation [NAACL 2024]
https://arxiv.org/abs/2212.10325
82 stars 13 forks source link

Issues of reproducing Table 1 results on Commonsense Conversation Dataset (CCD) #13

Open Silin159 opened 1 year ago

Silin159 commented 1 year ago

Hi, I try to use your script (ccd.sh) to reproduce the Table 1 results on Commonsense Conversation Dataset, but it turns out that my reproduced results (BLEU: 0.154, Rouge-L: 6.38) are far below your reported values (BLEU: 1.02, Rouge-L: 8.59). Could you check whether the hyperparameters in ccd.sh are the optimal ones that you use? It would be better if you could also provide the evaluation scripts for producing BLEU and Rouge-L (currently the inference_scripts only save the testing outputs but no metrics evaluation results if I run it right)? Besides, are there any model checkpoints and testing outputs available?

Yuanhy1997 commented 1 year ago

I think you can select the checkpoints around 100000 training steps using the validation data. The number on the paper is out-of-date, the new results is a bit lower and is 0.84 in BLEU. BTW CCD is a pretty bizarre datasets in a way that it easily overfit the training data and the outputs actually require commonsense knowledges. (Diffuseq only achieved BLEU around 1. This means the outputs barely correlate with the labels.)