ZHAOTING / dialog-processing

NLG and NLU for dialogue processing
Apache License 2.0
43 stars 10 forks source link

Reproduce the performance of RUBER-unref in paper #4

Closed ddehun closed 3 years ago

ddehun commented 4 years ago

Hi, @ZHAOTING !

I tried to reproduce the performance of RUBER-unref model in your ACL 2020 paper using your dataset, but I failed.

More specifically, I hope to reproduce the RUBER-ref model by excluding the ground-truth in the right part of TABLE 1, which shows about .43 Pearson and .39 Spearman correlation score.

I struggled with either of my custom implementations and using this repository, but none of them shows a similar performance.

When I train RUBER in this code, the best performance is about 0.21 Pearson and 0.25 spearman correlation. And when the epoch is 8, the learning rate converged into 1e-7 (stop condition).

I used the below command line for training. The hyperparameters are the same as the value in the Appendix of the paper.

python -m tasks.response_eval.train_unsupervised --model ruber --corpus dd --tokenizer ws --enable_log True --save_model True --batch_size 30 --init_lr 0.0001 --n_epochs 30

Could you give me some tips to improve the performance of RUBER-unref model? Even when I replace the [word embedding + GRU] into [BERT-freeze and mean pooling], followed by this paper, the best correlation is only about 0.2.

Thanks!

ZHAOTING commented 4 years ago

In my experience, the RUBER and ADEM models are sort of unstable to train.

1) Make sure you have initialized the model with HRED pretrained on response generation task.

2) I would suggest you try different random seeds (e.g. with argument "--seed 42") as it gave me really different results.

3) BTW, I have been using floor(speaker) encoder all the time (with argument "--floor_encoder rel"), so I don't know if that is a factor.