datacite / freya

Issues and milestones for the FREYA project
3 stars 11 forks source link

Link Checker - Technical Proposal #6

Closed richardhallett closed 6 years ago

richardhallett commented 6 years ago

A proposal for the technical implementation of the link checker, ideally considering the needs of various PID service providers to enable common adoption of one solution.

richardhallett commented 6 years ago

Aims:

Technology:

Scrapy Spider

Scrapy is a generic web crawler written in Python, it was primarily built to scrape individual sites and gather links and data as it goes through a site. We will use Scrapy to build the scraping of following links to PID landing pages to perform initial scraping and some initial analysis of the data.

Reasons to use:

Redis

We will use Redis as a lightweight store for the pushing links for checking and for storing the results. Ideally we will re-use the ScrapyRedis plugin to help with this as it basically supports the workflow we want to achieve. There will need to probably be a slight modification as we will need to pass through a PID for later identification. PID providers are not just concerned about the link to check but also need to know how the PID behaves within the link.

Reasons to use:

richardhallett commented 6 years ago

An implementation based upon the above proposal was created at https://github.com/datacite/pidcheck