alan-turing-institute / misinformation-crawler

Web crawler to collect snapshots of articles to web archive
MIT License
5 stars 2 forks source link

Review existing packages that do article extraction / crawl news sites #30

Open martintoreilly opened 5 years ago

martintoreilly commented 5 years ago

List of candidate packages

jemrobinson commented 5 years ago

Apache Tika is also interesting (it's Java-based).

martintoreilly commented 5 years ago

@dongpng spotted that Common Crawl has a news dataset with crawler code.

I know that we've talked about Common Crawl at some point, but I just came across a specific news crawl by them, and I wasn't sure yet if you have seen this:

http://commoncrawl.org/2016/10/news-dataset-available/

Example data (one GB) is here: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/crawl-data/CC-NEWS/2019/01/CC-NEWS-20190108135010-00104.warc.gz

I did a quick count of the hosts of the websites included in this sample, and they seem to include various news sites, including (www.nytimes.com; 21), (www.wsj.com;30), (www.cnn.com; 62) etc. There are definitely sites that are not covered by them, and I think the content hasn't been cleanly extracted from the crawl, but wanted to forward this in case there is crawling code and or data that we could reuse. Their code is available on https://github.com/commoncrawl/news-crawl/

I've been planning to separate our crawl and article extraction steps as part of the "productionising" of the crawl pipeline at the start of the SPF work. For this I am planning to run the crawl using a crawler with a more fully featured user interface, so this looks like a good starting point. We should also consider asking the British Library to manage an ongoing crawl for us (I believe they use their own instance of Heterix).

The Common Crawl news crawl looks like it used RSS feeds as it's article source, which could be a good option for us to add as an alternative seed / start page for our crawl (though a quick check of the Washington Post Politics RSS feed shows it doesn't have (m)any more articles than the initial Politics page on the main site, which has more available behind a javascript "more" button).

Whichever crawl option we go for I think generating WARC files from the crawl and having the article extractor process these WARC files is the best option to allow our contributions to be integrated most easily with other web processing pipeline elements.

jemrobinson commented 5 years ago

Just saw this: https://newsapi.org which seems extremely close to what we're currently doing.

martintoreilly commented 5 years ago

Crossref PDF extractor: https://www.crossref.org/labs/pdfextract/

martintoreilly commented 5 years ago

@andeElliott Spotted Spidermon for monitoring Scrapy crawlers (originally captured in issue #174).

martintoreilly commented 5 years ago

Google Puppeteer as tool for heroes Chrome based automated browsing.

Could be a useful alternative to Selenium?

martintoreilly commented 5 years ago

Might also be worth looking at the Archival Acid Test (article | code)

martintoreilly commented 5 years ago

New browser based crawl project from webrecorder folk: https://github.com/webrecorder/browsertrix

martintoreilly commented 5 years ago

Also worth looking at the UK Web Archive GitHub at https://github.com/ukwa

The BL run this and we should definitely talk to Andy Jackson there before we do a refactor.