awslabs / pptod

Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System (ACL 2022)
https://arxiv.org/abs/2109.14739
Apache License 2.0
158 stars 27 forks source link

About domain overlapping #15

Closed Monstarrr closed 2 years ago

Monstarrr commented 2 years ago

Hi! I am really interested in your work. Anyway, I just found that in your preprocessing script there is nothing about removing the overlapping data between pretrain and finetuning. Will these overlapping domains affect the result of low-resource experiment? Looking forward to your reply!

yxuansu commented 2 years ago

Hi @Monstarrr,

Thank you for your interest in our work.

During pre-training, we use different datasets except for the ones (i.e. MultiWOZ 2.0 and 2.1) that we are testing on. (1) We assume that, in the pre-training corpus, there is no data that is exactly the same as the data from MultiWOZ. (2) In addition, our underlying assumption is that the pre-training corpus is, to some extent, similar to the end tasks in terms of data distribution. Therefore, we can acquire a model that possesses good generalization ability.

Please let me know if you have further questions :-)