Closed ghost closed 1 year ago
Hi there,
Thanks to your contributions for Chinese NLP.
I have a question, how much corpus have you used? 10GB, 15GB, 20GB? And how many tokens in the corpus?
Hope to your generous reply, Thank you!
Ok, I found in the paper:
We train our models on the open source large-scale raw text, Chinese Wikipedia and a part of WuDaoCorpus.
The training data contains 200GB cleaned text ranges from different domains.
Best wishes!
Hi there,
Thanks to your contributions for Chinese NLP.
I have a question, how much corpus have you used? 10GB, 15GB, 20GB? And how many tokens in the corpus?
Hope to your generous reply, Thank you!