awslabs / pptod

Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System (ACL 2022)
https://arxiv.org/abs/2109.14739
Apache License 2.0
157 stars 27 forks source link

Issues with E2E modelling #2

Closed NLP-hua closed 2 years ago

NLP-hua commented 2 years ago

Hi, thanks for releasing your code.

I am following your work. But I have a problem with the end-to-end modelling.

(1) I pretraining the pptod checkpoints using the scripts in the Pretraining folder with pretraining_pptod_small.sh (2) I training the E2E model following the instructions in the E2E_TOD/sh_folder/small/training/pptod_small_train_full_training.sh

However, when eval the model using the test set, it can achive comparable results 82.9/72.4/18.93/97.08 (Inform/Success/BLEU/Combined Score) with T5-small (Plug-and-Play) shown in Table 6. It seems the pptod pretraining is not work?

I am wonder is there anything that i went wrong?

yxuansu commented 2 years ago

Hi, thank you for your question. Actually, the distribution of test set and dev set are a little bit dissimilar. Therefore, to validate the model performance, I suggest you can first change line 203 of learn.py from 'dev' to 'test' to directly observe the intermediate training results on test set.

Could you try to change that line and re-train pptod-small? Looking forward to your further updates. :)

NLP-hua commented 2 years ago

Hi, I've tried as you said: change line 203 of learn.py from 'dev' to 'test'. However, I just got similar results as 'dev' set. The performance on test set is 85.5/75.0/18.38/98.63. It's also a little bit far from the performance reported in Table 2 (87.80/75.30/19.89/101.44).

In fact, due the the memory, my settings is: --gradient_accumulation_steps 4\ --number_of_gpu 2\ --batch_size_per_gpu 16

the batch size is 4 X 2 X 16 = 128. does the gradient_accumulation_steps or batch_sizer_per_gpu matter the performance so much? Is there any suggestions?

NLP-hua commented 2 years ago

And further, i just download the pretrained pptod_small using the script download_pptod_small.sh, then fine-tuning the E2E model. I cannot obtain the comparable results reported Table 2.

yxuansu commented 2 years ago

Hi, I've tried as you said: change line 203 of learn.py from 'dev' to 'test'. However, I just got similar results as 'dev' set. The performance on test set is 85.5/75.0/18.38/98.63. It's also a little bit far from the performance reported in Table 2 (87.80/75.30/19.89/101.44).

In fact, due the the memory, my settings is: --gradient_accumulation_steps 4 --number_of_gpu 2 --batch_size_per_gpu 16

the batch size is 4 X 2 X 16 = 128. does the gradient_accumulation_steps or batch_sizer_per_gpu matter the performance so much? Is there any suggestions?

Personally, I think we allow some variations in the model result. And 98.63 combined score seems comparable to 101.44. I suggest you to fine-tune the model from scratch 2 or 3 times (with different randoms). I think you should be able to get a result that is closer to the one reported in the paper. Actually, I just ran the experiment again, and I can get a similar result. Hope this can help you. Looking forward to your update!

yxuansu commented 2 years ago

Hi, I've tried as you said: change line 203 of learn.py from 'dev' to 'test'. However, I just got similar results as 'dev' set. The performance on test set is 85.5/75.0/18.38/98.63. It's also a little bit far from the performance reported in Table 2 (87.80/75.30/19.89/101.44).

In fact, due the the memory, my settings is: --gradient_accumulation_steps 4 --number_of_gpu 2 --batch_size_per_gpu 16

the batch size is 4 X 2 X 16 = 128. does the gradient_accumulation_steps or batch_sizer_per_gpu matter the performance so much? Is there any suggestions?

By the way, I ran the experiments with the provided script in the repo. I am not sure whether the parameter gradient_accumulation_steps or batch_sizer_per_gpu would cause much difference in the training.

NLP-hua commented 2 years ago

Hi, I've tried as you said: change line 203 of learn.py from 'dev' to 'test'. However, I just got similar results as 'dev' set. The performance on test set is 85.5/75.0/18.38/98.63. It's also a little bit far from the performance reported in Table 2 (87.80/75.30/19.89/101.44). In fact, due the the memory, my settings is: --gradient_accumulation_steps 4 --number_of_gpu 2 --batch_size_per_gpu 16 the batch size is 4 X 2 X 16 = 128. does the gradient_accumulation_steps or batch_sizer_per_gpu matter the performance so much? Is there any suggestions?

Personally, I think we allow some variations in the model result. And 98.63 combined score seems comparable to 101.44. I suggest you to fine-tune the model from scratch 2 or 3 times (with different randoms). I think you should be able to get a result that is closer to the one reported in the paper. Actually, I just ran the experiment again, and I can get a similar result. Hope this can help you. Looking forward to your update!

Thanks for your reply. Actually, after I fine-tune the model for 3 or 4 times, I finally got the comparable results 87.1/77.1/18.05/100.15. And I have tried another 2 random seeds which is 557 and 42, and the results is slight lower than 100.15. Do you have any suggestions about the seed?

yxuansu commented 2 years ago

Hi, I've tried as you said: change line 203 of learn.py from 'dev' to 'test'. However, I just got similar results as 'dev' set. The performance on test set is 85.5/75.0/18.38/98.63. It's also a little bit far from the performance reported in Table 2 (87.80/75.30/19.89/101.44). In fact, due the the memory, my settings is: --gradient_accumulation_steps 4 --number_of_gpu 2 --batch_size_per_gpu 16 the batch size is 4 X 2 X 16 = 128. does the gradient_accumulation_steps or batch_sizer_per_gpu matter the performance so much? Is there any suggestions?

Personally, I think we allow some variations in the model result. And 98.63 combined score seems comparable to 101.44. I suggest you to fine-tune the model from scratch 2 or 3 times (with different randoms). I think you should be able to get a result that is closer to the one reported in the paper. Actually, I just ran the experiment again, and I can get a similar result. Hope this can help you. Looking forward to your update!

Thanks for your reply. Actually, after I fine-tune the model for 3 or 4 times, I finally got the comparable results 87.1/77.1/18.05/100.15. And I have tried another 2 random seeds which is 557 and 42, and the results is slight lower than 100.15. Do you have any suggestions about the seed?

I am glad to hear that you are able to replicate the results. Actually, I did not do too much investigation into the setting of random seeds. If you find any interesting suggestions about random seed, please let me know. Feel free to contact me if you have any further questions!