Added section for Clue Corpus 2020, a CommonCrawl-derived monolingual Chinese corpus intended for LM pre-training. There are no LM-specific perplexity results yet, though. Corpus access is via email, not public download. There is a smaller corpus for direct download, but it's data from https://github.com/brightmart/nlp_chinese_corpus
Added section for Clue Corpus 2020, a CommonCrawl-derived monolingual Chinese corpus intended for LM pre-training. There are no LM-specific perplexity results yet, though. Corpus access is via email, not public download. There is a smaller corpus for direct download, but it's data from https://github.com/brightmart/nlp_chinese_corpus