Closed richardhallett closed 6 years ago
A link is at various ways healthy or not, it may show various symptoms of what's wrong with it, but at each potential doesn't nescessarily mean it's actually dead.
For an inital parse steps and 1 and 2 could cover depending on client a large amount of potential unhealthy links. For a more detailed analysis we go to step 3 and beyond
If we classify into a traffic light system of Green, Yellow, Red, it seperates out potentially really dead links and just possibles, this could be more granular but at the cost of complexity. Any health report that has done granular stage checks (like above) would give details of the error, i.e. "404 not found", "DOI didn't match found in body"
Health includes links gradually going stale, this would be an additional potential marker, this is a last checked date and we define that, greater than X is too long and considered a potential unhealthy link.
For parsing the landing page for DOI and metadata, see https://doi.org/10.1101/097196. Specifically:
<meta name="DC.identifier" content="https://doi.org/10.5061/dryad.q447c/3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Dataset",
"@id": "https://doi.org/10.5061/dryad.q447c/3",
"additionalType": "DataFile",
"name": "Sci-Hub publisher DOI prefixes",
"author": [{
"@type": "Person",
"name": "Alexandra Elbakyan",
"givenName": "Alexandra",
"familyName": "Elbakyan"
},
{
"@type": "Person",
"@id": "https://orcid.org/0000-0003-1247-7941",
"name": "John Bohannon",
"givenName": "John",
"familyName": "Bohannon"
}
],
"description": "Data scraped from the CrossRef website which can be used to replicate the analysis of downloads by publisher.",
"license": "http://creativecommons.org/publicdomain/zero/1.0",
"version": "1",
"keywords": "open access, scientific communication, Global",
"datePublished": "2016",
"schemaVersion": "http://datacite.org/schema/kernel-3",
"publisher": {
"@type": "Organization",
"name": "Dryad Digital Repository"
},
"provider": {
"@type": "Organization",
"name": "DataCite"
}
}
</script>
The latter will often need a link checker that understands javascript.
And it means we need GET requests and can't do HEAD.
Pangaea has implemented link headers, which provide more information in HEAD requests, and in a standardized way:
curl -I https://doi.pangaea.de/10.1594/PANGAEA.804876
See http://signposting.org/ for background info.
What happens with content that is changed over time, does this constitute an unhealthy link?
This is the part in which I think we would have accuracy issues (aka many false positives) and an increase of complexity in our code when using web-scraping (only). We only care if the metadata is different compared to our database or the previous state of the landingpage. But we do not care when (A) the HTML changed or (B) when the CSS changed. Additionally, we have a problem of diversity, which i think ranks-up the complexity of our code. There are 1400 datacenters, eahc of them with at least 1 repository. That is like more than 1,400 different types of pages. Like more than 1,400 different profiles. Both aspect, I think should make us consider in additional/complementary approaches to web-scrapping.
Based on a meeting with Freya WP2 members, some additional important things were mentioned in addition to the above.
In general I think the agreed outcome for the above was to go with something like I mentioned in my comment above regarding stages, doing basic checks of:
If PID service providers then want to go beyond this then then they can.
Given the differences between content providers and how content is presented, we need to be able to determine if the content linked to by a PID is considered correct and 'healthy'. So the primary question is "How do we determine what is a healthy PID?"
Questions: