commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
323 stars 35 forks source link

amazing dataset! #26

Closed randomgambit closed 6 years ago

randomgambit commented 6 years ago

Hello there,

I just discovered your news-crawler and I think this is an amazing idea!

Sorry if this is a very simple question, but is it possible to somehow download slices of the news-crawling data (possibly based on a keyword/regex/domain) without resorting to amazon AWS?

The ideas is that I have a very large cluster at my disposal already, so I would rather work with the raw data directly on my local cluster.

What do you think? Thank you for your help

sebastian-nagel commented 6 years ago

The CC-NEWS WARC files (on s3://commoncrawl/crawl-data/CC-NEWS/) is organized by time (fetch time which corresponds to publication time with some limitations). To get slices by URL or content you would need to run a filter over the entire data.