commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
122 stars 24 forks source link

Integrate the new Common Crawl News dataset #63

Open sylvinus opened 7 years ago

sylvinus commented 7 years ago

http://commoncrawl.org/2016/10/news-dataset-available/

We should make sure it works with the current common crawl source

chaconnewu commented 7 years ago

The CC news dataset currently has some formatting issues, and the team is fixing it: https://github.com/commoncrawl/news-crawl/issues/11