richardhallett commented 6 years ago

Given the differences between content providers and how content is presented, we need to be able to determine if the content linked to by a PID is considered correct and 'healthy'. So the primary question is "How do we determine what is a healthy PID?"

Questions:

Does a HTTP status code constitute enough of a answer for broken link?
Does content that resolves to a authentication/authorization error count as broken?
Redirects are considered broken? Differentiate between temporary and permanent?
What happens with content that is changed over time, does this constitute an unhealthy link?

richardhallett commented 6 years ago

A link is at various ways healthy or not, it may show various symptoms of what's wrong with it, but at each potential doesn't nescessarily mean it's actually dead.

Stages of checking

Did the request time out?
Did we get any HTTP errors, i.e. 404?
Request body contents and again parse for metadata, validate DOI matches.
Parse body contents to look for potential errors.
Parse body contents for DOI matches i.e. string matches.

For an inital parse steps and 1 and 2 could cover depending on client a large amount of potential unhealthy links. For a more detailed analysis we go to step 3 and beyond

Classification of health

If we classify into a traffic light system of Green, Yellow, Red, it seperates out potentially really dead links and just possibles, this could be more granular but at the cost of complexity. Any health report that has done granular stage checks (like above) would give details of the error, i.e. "404 not found", "DOI didn't match found in body"

Going stale

Health includes links gradually going stale, this would be an additional potential marker, this is a last checked date and we define that, greater than X is too long and considered a potential unhealthy link.

mfenner commented 6 years ago

For parsing the landing page for DOI and metadata, see https://doi.org/10.1101/097196. Specifically:

check for DOI using HTML meta tag using Dublin Core vocabulary: <meta name="DC.identifier" content="https://doi.org/10.5061/dryad.q447c/3">

check for metadata in schema.org json-ld:

<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Dataset",
"@id": "https://doi.org/10.5061/dryad.q447c/3",
"additionalType": "DataFile",
"name": "Sci-Hub publisher DOI prefixes",
"author": [{
        "@type": "Person",
        "name": "Alexandra Elbakyan",
        "givenName": "Alexandra",
        "familyName": "Elbakyan"
    },
    {
        "@type": "Person",
        "@id": "https://orcid.org/0000-0003-1247-7941",
        "name": "John Bohannon",
        "givenName": "John",
        "familyName": "Bohannon"
    }
],
"description": "Data scraped from the CrossRef website which can be used to replicate the analysis of downloads by publisher.",
"license": "http://creativecommons.org/publicdomain/zero/1.0",
"version": "1",
"keywords": "open access, scientific communication, Global",
"datePublished": "2016",
"schemaVersion": "http://datacite.org/schema/kernel-3",
"publisher": {
    "@type": "Organization",
    "name": "Dryad Digital Repository"
},
"provider": {
    "@type": "Organization",
    "name": "DataCite"
}
}
</script>

The latter will often need a link checker that understands javascript.

And it means we need GET requests and can't do HEAD.

mfenner commented 6 years ago

Pangaea has implemented link headers, which provide more information in HEAD requests, and in a standardized way:

curl -I https://doi.pangaea.de/10.1594/PANGAEA.804876

See http://signposting.org/ for background info.

kjgarza commented 6 years ago

What happens with content that is changed over time, does this constitute an unhealthy link?

This is the part in which I think we would have accuracy issues (aka many false positives) and an increase of complexity in our code when using web-scraping (only). We only care if the metadata is different compared to our database or the previous state of the landingpage. But we do not care when (A) the HTML changed or (B) when the CSS changed. Additionally, we have a problem of diversity, which i think ranks-up the complexity of our code. There are 1400 datacenters, eahc of them with at least 1 repository. That is like more than 1,400 different types of pages. Like more than 1,400 different profiles. Both aspect, I think should make us consider in additional/complementary approaches to web-scrapping.

richardhallett commented 6 years ago

Based on a meeting with Freya WP2 members, some additional important things were mentioned in addition to the above.

Using a fake browser might be more required for some providers than others, not just because of JS but because of the way the resources do complicated and varied result pages.
Some link checking is already taking place by PID providers, usually in the form of what I've mentioned above re basic checks, some scraping involved.
More concerns about the different types of potential landing pages of resources, can vary wildly, a good example was sometimes getting Search Result Pages not a 404 when the DOI wasn't found.

In general I think the agreed outcome for the above was to go with something like I mentioned in my comment above regarding stages, doing basic checks of:

HTTP Status Code
Content contains PID keyword (e.g. DOI)
Schema.org data exists

If PID service providers then want to go beyond this then then they can.

datacite / freya

Link Check - Validity / Health #4

Questions:

Stages of checking

Classification of health

Going stale