Open nxpeng9235 opened 1 year ago
Hi, thanks for your interest! My best guess will be this is an optimization difference between training with "multiple machines" and "accumulating gradients within a single machine". For the T5-base, we used multi-GPUs and I honestly can't remember the exact configs we used.
Hi,
Congrats on being accepted in EMNLP 2021 as a concise and solid work! I am currently following your research and trying to reproduce the experimental results in the original paper using your codes. However, I have met some trouble in aligning the same JGA scores.
My experiments were all on MultiWOZ v2.2, with domain and slot descriptions. Here are my hyperparameter settings and corresponding results.
I am wondering if there is some other tricks to achieve a better results. If so, is it okay to share? So much appreciated! Looking forward to your reply :-D
Best