brightmart / nlp_chinese_corpus

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
MIT License
9.39k stars 1.54k forks source link

不知道作者下一步有没有兴趣把数据规模提升到T级别? #36

Open ArrogantL opened 4 years ago

ArrogantL commented 4 years ago

Common Crawl包含了超过7年的网络爬虫数据集,包含原始网页数据、元数据提取和文本提取。 里面包含有大量中文文本以供提取 [1]Buck C, Heafield K, Van Ooyen B. N-gram Counts and Language Models from the Common Crawl[C]//LREC. 2014, 2: 4. [2]Smith J R, Saint-Amand H, Plamada M, et al. Dirt cheap web-scale parallel text from the common crawl[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2013, 1: 1374-1383. [3]Spiegler S. Statistcs of the common crawl corpus 2012[R]. Technical report, SwiftKey, 2013. [4]Mühleisen H, Bizer C. Web Data Commons-Extracting Structured Data from Two Large Web Corpora[J]. LDOW, 2012, 937: 133-145. [5]Bizer C, Eckert K, Meusel R, et al. Deployment of rdfa, microdata, and microformats on the web–a quantitative analysis[C]//International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013: 17-32.

brightmart commented 4 years ago

有兴趣。能否通过QQ群加我一下,我们一起搞一搞。欢迎加入中文预训练模型transform,群聊号码:836811304