Closed richardhallett closed 6 years ago
Scrapy is a generic web crawler written in Python, it was primarily built to scrape individual sites and gather links and data as it goes through a site. We will use Scrapy to build the scraping of following links to PID landing pages to perform initial scraping and some initial analysis of the data.
Reasons to use:
We will use Redis as a lightweight store for the pushing links for checking and for storing the results. Ideally we will re-use the ScrapyRedis plugin to help with this as it basically supports the workflow we want to achieve. There will need to probably be a slight modification as we will need to pass through a PID for later identification. PID providers are not just concerned about the link to check but also need to know how the PID behaves within the link.
Reasons to use:
An implementation based upon the above proposal was created at https://github.com/datacite/pidcheck
A proposal for the technical implementation of the link checker, ideally considering the needs of various PID service providers to enable common adoption of one solution.