Open martintoreilly opened 5 years ago
Apache Tika is also interesting (it's Java-based).
@dongpng spotted that Common Crawl has a news dataset with crawler code.
I know that we've talked about Common Crawl at some point, but I just came across a specific news crawl by them, and I wasn't sure yet if you have seen this:
http://commoncrawl.org/2016/10/news-dataset-available/
Example data (one GB) is here: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/crawl-data/CC-NEWS/2019/01/CC-NEWS-20190108135010-00104.warc.gz
I did a quick count of the hosts of the websites included in this sample, and they seem to include various news sites, including (www.nytimes.com; 21), (www.wsj.com;30), (www.cnn.com; 62) etc. There are definitely sites that are not covered by them, and I think the content hasn't been cleanly extracted from the crawl, but wanted to forward this in case there is crawling code and or data that we could reuse. Their code is available on https://github.com/commoncrawl/news-crawl/
I've been planning to separate our crawl and article extraction steps as part of the "productionising" of the crawl pipeline at the start of the SPF work. For this I am planning to run the crawl using a crawler with a more fully featured user interface, so this looks like a good starting point. We should also consider asking the British Library to manage an ongoing crawl for us (I believe they use their own instance of Heterix).
The Common Crawl news crawl looks like it used RSS feeds as it's article source, which could be a good option for us to add as an alternative seed / start page for our crawl (though a quick check of the Washington Post Politics RSS feed shows it doesn't have (m)any more articles than the initial Politics page on the main site, which has more available behind a javascript "more" button).
Whichever crawl option we go for I think generating WARC files from the crawl and having the article extractor process these WARC files is the best option to allow our contributions to be integrated most easily with other web processing pipeline elements.
Just saw this: https://newsapi.org which seems extremely close to what we're currently doing.
Crossref PDF extractor: https://www.crossref.org/labs/pdfextract/
@andeElliott Spotted Spidermon for monitoring Scrapy crawlers (originally captured in issue #174).
Google Puppeteer as tool for heroes Chrome based automated browsing.
Could be a useful alternative to Selenium?
Might also be worth looking at the Archival Acid Test (article | code)
New browser based crawl project from webrecorder folk: https://github.com/webrecorder/browsertrix
Also worth looking at the UK Web Archive GitHub at https://github.com/ukwa
The BL run this and we should definitely talk to Andy Jackson there before we do a refactor.
List of candidate packages