datacite / freya

Issues and milestones for the FREYA project
3 stars 11 forks source link

Link Check - Validity / Health #4

Closed richardhallett closed 6 years ago

richardhallett commented 6 years ago

Given the differences between content providers and how content is presented, we need to be able to determine if the content linked to by a PID is considered correct and 'healthy'. So the primary question is "How do we determine what is a healthy PID?"

Questions:

richardhallett commented 6 years ago

A link is at various ways healthy or not, it may show various symptoms of what's wrong with it, but at each potential doesn't nescessarily mean it's actually dead.

Stages of checking

  1. Did the request time out?
  2. Did we get any HTTP errors, i.e. 404?
  3. Request body contents and again parse for metadata, validate DOI matches.
  4. Parse body contents to look for potential errors.
  5. Parse body contents for DOI matches i.e. string matches.

For an inital parse steps and 1 and 2 could cover depending on client a large amount of potential unhealthy links. For a more detailed analysis we go to step 3 and beyond

Classification of health

If we classify into a traffic light system of Green, Yellow, Red, it seperates out potentially really dead links and just possibles, this could be more granular but at the cost of complexity. Any health report that has done granular stage checks (like above) would give details of the error, i.e. "404 not found", "DOI didn't match found in body"

Going stale

Health includes links gradually going stale, this would be an additional potential marker, this is a last checked date and we define that, greater than X is too long and considered a potential unhealthy link.

mfenner commented 6 years ago

For parsing the landing page for DOI and metadata, see https://doi.org/10.1101/097196. Specifically:

The latter will often need a link checker that understands javascript.

And it means we need GET requests and can't do HEAD.

mfenner commented 6 years ago

Pangaea has implemented link headers, which provide more information in HEAD requests, and in a standardized way:

curl -I https://doi.pangaea.de/10.1594/PANGAEA.804876

See http://signposting.org/ for background info.

kjgarza commented 6 years ago

What happens with content that is changed over time, does this constitute an unhealthy link?

This is the part in which I think we would have accuracy issues (aka many false positives) and an increase of complexity in our code when using web-scraping (only). We only care if the metadata is different compared to our database or the previous state of the landingpage. But we do not care when (A) the HTML changed or (B) when the CSS changed. Additionally, we have a problem of diversity, which i think ranks-up the complexity of our code. There are 1400 datacenters, eahc of them with at least 1 repository. That is like more than 1,400 different types of pages. Like more than 1,400 different profiles. Both aspect, I think should make us consider in additional/complementary approaches to web-scrapping.

richardhallett commented 6 years ago

Based on a meeting with Freya WP2 members, some additional important things were mentioned in addition to the above.

In general I think the agreed outcome for the above was to go with something like I mentioned in my comment above regarding stages, doing basic checks of:

If PID service providers then want to go beyond this then then they can.