autoliuweijie / K-BERT

Source code of K-BERT (AAAI2020)
https://ojs.aaai.org//index.php/AAAI/article/view/5681
949 stars 212 forks source link

pre-training corpus #80

Open Humorloos opened 2 years ago

Humorloos commented 2 years ago

Hello @autoliuweijie, thank you for your amazing and inspiring work!

I would like to pre-train a K-Bert model on an english language corpus and to make it work I am currently trying to get the function in train_and_validate() to run, with args.target set to "bert". I notice that with this setting, BertDataLoader will be used for loading the data, but I am not sure what exact format the dataset file at dataset_path has to be. From the code, I see that it has to be pickle file, but I am having trouble trying to reconstruct one that works with the data loader.

It would be very helpful to have access to the data file originally used for pre-training. Could you provide a link or instructions on how to construct it myself?