Corpus Size Tokens Vocabulary Size Description
Baidu Encyclopedia
百度百科 4.1G 745M 5422K Chinese Encyclopedia data from
https://baike.baidu.com/
Wikipedia_zh
中文维基百科 1.3G 223M 2129K Chinese Wikipedia data from
https://dumps.wikimedia.org/
People's Daily News
人民日报 3.9G 668M 1664K News data from People's Daily(1946-2017)
http://data.people.com.cn/
Sogou News
搜狗新闻 3.7G 649M 1226K News data provided by Sogou labs
http://www.sogou.com/labs/
Financial News
金融新闻 6.2G 1055M 2785K Financial news collected from multiple news websites
Zhihu_QA
知乎问答 2.1G 384M 1117K Chinese QA data from
https://www.zhihu.com/
Weibo
微博 0.73G 136M 850K Chinese microblog data provided by NLPIR Lab
http://www.nlpir.org/download/weibo.7z
Literature
文学作品 0.93G 177M 702K 8599 modern Chinese literature works
Mixed-large
综合 22.6G 4037M 10653K We build the large corpus by merging the above corpora.
Complete Library in Four Sections
四库全书 1.5G 714M 21.8K The largest collection of texts in pre-modern China.
您好,请问是否可以获取到列表所示的训练语料?
学生党,没有足够资源能力获取相关数据,冒昧问一下。感谢!
Corpus Size Tokens Vocabulary Size Description Baidu Encyclopedia 百度百科 4.1G 745M 5422K Chinese Encyclopedia data from https://baike.baidu.com/ Wikipedia_zh 中文维基百科 1.3G 223M 2129K Chinese Wikipedia data from https://dumps.wikimedia.org/ People's Daily News 人民日报 3.9G 668M 1664K News data from People's Daily(1946-2017) http://data.people.com.cn/ Sogou News 搜狗新闻 3.7G 649M 1226K News data provided by Sogou labs http://www.sogou.com/labs/ Financial News 金融新闻 6.2G 1055M 2785K Financial news collected from multiple news websites Zhihu_QA 知乎问答 2.1G 384M 1117K Chinese QA data from https://www.zhihu.com/ Weibo 微博 0.73G 136M 850K Chinese microblog data provided by NLPIR Lab http://www.nlpir.org/download/weibo.7z Literature 文学作品 0.93G 177M 702K 8599 modern Chinese literature works Mixed-large 综合 22.6G 4037M 10653K We build the large corpus by merging the above corpora. Complete Library in Four Sections 四库全书 1.5G 714M 21.8K The largest collection of texts in pre-modern China.