didi / ChineseNLP

Datasets, SOTA results of every fields of Chinese NLP
https://chinesenlp.xyz
1.79k stars 273 forks source link

Language Modeling update with a new Common-Crawl derived open source corpus for future use. #19

Closed amittai closed 4 years ago

amittai commented 4 years ago

Added section for Clue Corpus 2020, a CommonCrawl-derived monolingual Chinese corpus intended for LM pre-training. There are no LM-specific perplexity results yet, though. Corpus access is via email, not public download. There is a smaller corpus for direct download, but it's data from https://github.com/brightmart/nlp_chinese_corpus