Link Checker - Technical Proposal

richardhallett commented 6 years ago

A proposal for the technical implementation of the link checker, ideally considering the needs of various PID service providers to enable common adoption of one solution.

richardhallett commented 6 years ago

Aims:

Basic 404 checks
Basic metadata identifier checks
Generic across PID providers to just give links to check.

Technology:

Python 3 - Main programming language
Redis - For links to check and generic link result storage
Scrapy - Crawler framework

Scrapy Spider

Scrapy is a generic web crawler written in Python, it was primarily built to scrape individual sites and gather links and data as it goes through a site. We will use Scrapy to build the scraping of following links to PID landing pages to perform initial scraping and some initial analysis of the data.

Reasons to use:

Python support - Nice familiar language for many people
Has built in logic for handling politeness, e.g. Crawl delays, obeying robots.txt,
Extensible - At various parts you can hook in different middleware.
While open source and maintained as part of the community does have an organisation backingit - https://scrapinghub.com/ who provide their own services on top.
Built on top of Twisted.IO for asynchronous requests.

Redis

We will use Redis as a lightweight store for the pushing links for checking and for storing the results. Ideally we will re-use the ScrapyRedis plugin to help with this as it basically supports the workflow we want to achieve. There will need to probably be a slight modification as we will need to pass through a PID for later identification. PID providers are not just concerned about the link to check but also need to know how the PID behaves within the link.

Reasons to use:

Simple deployment, redis is pretty easy to deploy, either via docker hosts or build scripts
Numerous cloud offerings offer their own hosted scalable redis clusters, AWS included.

richardhallett commented 6 years ago

An implementation based upon the above proposal was created at https://github.com/datacite/pidcheck

datacite / freya