commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
122 stars 24 forks source link

Updated Common crawl to Feb 2016 crawl #36

Closed vanhalt closed 8 years ago

vanhalt commented 8 years ago

This PR aims to fix issue #24 updating common crawl to February 2016 crawl (2016-07).

ls -R local-data/
common-crawl

local-data//common-crawl:
crawl-data     warc.paths.txt

local-data//common-crawl/crawl-data:
CC-MAIN-2016-07

local-data//common-crawl/crawl-data/CC-MAIN-2016-07:
segments

local-data//common-crawl/crawl-data/CC-MAIN-2016-07/segments:
1454701145519.33

local-data//common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33:
warc

local-data//common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33/warc:
CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.gz
sylvinus commented 8 years ago

Thanks @vanhalt! For future reference, it looks like Common Crawl may change their URL layout soon so it may be a bit more complicated next time.