fastnlp / CPT

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation
481 stars 70 forks source link

How much corpus have you used? #59

Closed ghost closed 1 year ago

ghost commented 1 year ago

Hi there,

Thanks to your contributions for Chinese NLP.

I have a question, how much corpus have you used? 10GB, 15GB, 20GB? And how many tokens in the corpus?

Hope to your generous reply, Thank you!

ghost commented 1 year ago

Hi there,

Thanks to your contributions for Chinese NLP.

I have a question, how much corpus have you used? 10GB, 15GB, 20GB? And how many tokens in the corpus?

Hope to your generous reply, Thank you!

Ok, I found in the paper:

We train our models on the open source large-scale raw text, Chinese Wikipedia and a part of WuDaoCorpus. 
The training data contains 200GB cleaned text ranges from different domains. 

Best wishes!