Link Check - Frequency - Githubissues

richardhallett commented 6 years ago

Checking PID's (e.g. DOI's) have to be done regularly in order to maintain a reliable confidence in the health. One of the main considerations is that different content providers have varying scalability in how often you can request data from them. In general we can categorise the frequency into two parts, "Resolving content" and "Parsing content".

Resolving Content

This is when we need to go off to the linked content and obtain it, i.e. with a HTTP request. For resolving the content we should consider a sensible delay to avoid any negative affects on the providers servers. It is however reasonable that we can hit different servers in parallel.

It is also important to consider how often we need to re-check for link-rot.

Parsing content

This is when we have the data returned from the linked content and we need to process and determine how healthy it is. For parsing the content we don't need to consider any delays we just need to do this as fast as possible, depending on the scope of how we do any data parsing this process could be time intensive.

Questions

What is a reasonable delay to make requests to providers?
Do we need to consider different rates per provider?
When do we we re-check bad links?
How often do we need to re-check healthy links?

richardhallett commented 6 years ago

My thoughts are mostly on the technical nature of how we'll implement this, it would be good to understand any additional use cases that might affect frequency.

Delay

There has been some numbers passed about previously, I think there has been some numbers talked about before, to give an idea, here are some examples of different delays.

No. Links	Delay Per Request	Time To complete (Hours)	Time To complete (Days)
1000	5	1.38	0.05
1000000	5	1388	58
1000	2	0.5	0.02
1000000	2	555	23

I think 5 seconds while a considerable amount of time to complete for one server is probably a reasonable start.

Adaptive Delay

Another potential is to do adaptive delays, this could be based upon the latency of requests we are getting from a particular server. A potential technical solution and description of what a web crawler does is: https://doc.scrapy.org/en/1.4/topics/autothrottle.html#std:setting-AUTOTHROTTLE_TARGET_CONCURRENCY In general the idea is to use the latency of a request to scale if the server is getting busier, so if the server is choking are requests gradually back off based upon previous request times.

We might want to use this in combination with a fixed delay.

Additional questions:

Do we need to handle manual requests for link checking, i.e. client says please check my DOI's?

Technical considerations

Content may be dynamic in nature and dependant upon Javascript. - Use headless chrome
Bandwidth at the provider servers - See delay
Some servers may not support HEAD requests

kjgarza commented 6 years ago

Are we considering that every link should be checked at least three times?. I mean can we take for granted that when we get a 404 is actually a 404 due to link rotrather than because the datacentre was temporaly down?. Or the opposite, that if we get a 200 is consistently (within the realms of our capacity) a OK reponse ? If not, then we should consider to use a 2 out of 3 positive tests approach.

If that's the case things get more complicated in Frequency:

1000000 5 3472 hr 144 days

richardhallett commented 6 years ago

Based on discussions with the WP2 working group it was agreed that we should focus on building something realistic that gives us a good amount of coverage. This ties in with frequency because as mentioned above checking every resource is potentially a massive undertaking and there are numerous factors that come into play when checking all links.

Initially the goal is to limit to a smaller sampling of resources from different datacentres, the frequency can then be potentially every day a full recheck. This can be scaled up and considerations for this can be dependant per PID services provider.

datacite / freya

Link Check - Frequency #3

Resolving Content

Parsing content

Questions

Delay

Adaptive Delay

Additional questions:

Technical considerations