Common Crawl包含了超过7年的网络爬虫数据集,包含原始网页数据、元数据提取和文本提取。
里面包含有大量中文文本以供提取
[1]Buck C, Heafield K, Van Ooyen B. N-gram Counts and Language Models from the Common Crawl[C]//LREC. 2014, 2: 4.
[2]Smith J R, Saint-Amand H, Plamada M, et al. Dirt cheap web-scale parallel text from the common crawl[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2013, 1: 1374-1383.
[3]Spiegler S. Statistcs of the common crawl corpus 2012[R]. Technical report, SwiftKey, 2013.
[4]Mühleisen H, Bizer C. Web Data Commons-Extracting Structured Data from Two Large Web Corpora[J]. LDOW, 2012, 937: 133-145.
[5]Bizer C, Eckert K, Meusel R, et al. Deployment of rdfa, microdata, and microformats on the web–a quantitative analysis[C]//International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013: 17-32.
Common Crawl包含了超过7年的网络爬虫数据集,包含原始网页数据、元数据提取和文本提取。 里面包含有大量中文文本以供提取 [1]Buck C, Heafield K, Van Ooyen B. N-gram Counts and Language Models from the Common Crawl[C]//LREC. 2014, 2: 4. [2]Smith J R, Saint-Amand H, Plamada M, et al. Dirt cheap web-scale parallel text from the common crawl[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2013, 1: 1374-1383. [3]Spiegler S. Statistcs of the common crawl corpus 2012[R]. Technical report, SwiftKey, 2013. [4]Mühleisen H, Bizer C. Web Data Commons-Extracting Structured Data from Two Large Web Corpora[J]. LDOW, 2012, 937: 133-145. [5]Bizer C, Eckert K, Meusel R, et al. Deployment of rdfa, microdata, and microformats on the web–a quantitative analysis[C]//International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013: 17-32.