Closed so-hyeun closed 3 years ago
Yes, the scores are test F1 values at highest validation F1 values.
Yes exactly. If you have a look at the Section 5.1 and Figure 12 in our paper, you would notice that the performance is heavily dependent on the batch size. In DailyDialog, smaller batch size results in poorer performance. I would also suggest that you run each experiment several times and take the average of the results to obtain results closer to ours.
Thanks for the kind and quick reply.
Hi. I have a question for reproducing performance for dailydialog.
1) In the photo, @Best Valid F1 values are Test F1 values when validation F1 is the highest?
2) I train the model with batch size =1 due to the computing power problem,. Can this be the cause of the difference between the performance of the paper (59.50) and my performance (57.5)?