dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.56k stars 538 forks source link

Data for BERT, OpenAI GPT-2, BigBird (MT-DNN), ERNIE #638

Open szha opened 5 years ago

szha commented 5 years ago

Hi all. Let's use this issue to discuss approach and challenges for reproducing the large transformer-based models such as GPT-2 (#592), BigBird (MT-DNN) (#633), ERNIE, as well as exploring their limits. We can also use this thread to track and pool resources that people find.

Data

hankcs commented 5 years ago

ERNIE team has released the preprocessed task corpora there, though they didn't release their corpora for pre-training. Maybe Chinese Gigaword is a good substitute?

szha commented 5 years ago

Here's a cleaned-up corpus from Trinh et. al., consisting of stories https://console.cloud.google.com/storage/browser/commonsense-reasoning/reproduce/stories_corpus?pli=1 https://arxiv.org/abs/1806.02847

hankcs commented 5 years ago

Here are several large scale raw corpora for Chinese: https://github.com/brightmart/nlp_chinese_corpus