Embedding / Chinese-Word-Vectors

100+ Chinese Word Vectors 上百种预训练中文词向量
Apache License 2.0
11.82k stars 2.32k forks source link

是否可以获取训练用 Corpus 资源? #96

Closed shenxuhui closed 4 years ago

shenxuhui commented 5 years ago

您好,请问是否可以获取到列表所示的训练语料?

学生党,没有足够资源能力获取相关数据,冒昧问一下。感谢!

Corpus Size Tokens Vocabulary Size Description Baidu Encyclopedia 百度百科 4.1G 745M 5422K Chinese Encyclopedia data from https://baike.baidu.com/ Wikipedia_zh 中文维基百科 1.3G 223M 2129K Chinese Wikipedia data from https://dumps.wikimedia.org/ People's Daily News 人民日报 3.9G 668M 1664K News data from People's Daily(1946-2017) http://data.people.com.cn/ Sogou News 搜狗新闻 3.7G 649M 1226K News data provided by Sogou labs http://www.sogou.com/labs/ Financial News 金融新闻 6.2G 1055M 2785K Financial news collected from multiple news websites Zhihu_QA 知乎问答 2.1G 384M 1117K Chinese QA data from https://www.zhihu.com/ Weibo 微博 0.73G 136M 850K Chinese microblog data provided by NLPIR Lab http://www.nlpir.org/download/weibo.7z Literature 文学作品 0.93G 177M 702K 8599 modern Chinese literature works Mixed-large 综合 22.6G 4037M 10653K We build the large corpus by merging the above corpora. Complete Library in Four Sections 四库全书 1.5G 714M 21.8K The largest collection of texts in pre-modern China.

shenshen-hungry commented 4 years ago

43