Closed randomgambit closed 6 years ago
The CC-NEWS WARC files (on s3://commoncrawl/crawl-data/CC-NEWS/) is organized by time (fetch time which corresponds to publication time with some limitations). To get slices by URL or content you would need to run a filter over the entire data.
Hello there,
I just discovered your
news-crawler
and I think this is an amazing idea!Sorry if this is a very simple question, but is it possible to somehow download slices of the news-crawling data (possibly based on a keyword/regex/domain) without resorting to amazon AWS?
The ideas is that I have a very large cluster at my disposal already, so I would rather work with the raw data directly on my local cluster.
What do you think? Thank you for your help