ChunyuanLI / Optimus

Optimus: the first large-scale pre-trained VAE language model
368 stars 37 forks source link

DailyDialogue dataset #13

Open Rabona17 opened 3 years ago

Rabona17 commented 3 years ago

Where can I get the preprocessed dailydialog dataset used for spacefusion pretraining code? Any suggestion on how to preprocess the original dailydialog would be appreciated! Thanks

ChunyuanLI commented 3 years ago

I don't have the spacefusion pre-training code. On dailydialog dataset, we keep the history of a fixed sequence length. We tried to follow the original paper setting:

https://github.com/golsun/SpaceFusion

Rabona17 commented 3 years ago

Thanks, so where can I get the daily dialog dataset you used in run_dialog_spacefusion.sh (../data/datasets/dailydialog_data/train.txt)? Or should I preprocess it myself?

ChunyuanLI commented 3 years ago

I'm afraid you have to pre-process it on your own.

Rabona17 commented 3 years ago

Sure, so for DailyDialog, since spacefusion doesn't provide any preprocessing code for the dataset, what criteria did you use for src and trgt, or what procedure did you use to split the original dailydialog in to src and trgt? Thanks in advance!