ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

Handle HTTP errors during download gracefully #156

Open sedimentation-fault opened 7 years ago

sedimentation-fault commented 7 years ago

An HTTP server may throw a lot of HTTP errors at a user agent's face - you can see a list of them at List of HTTP status codes (Wikipedia). IMHO a program like getpapers should be able to handle ALL of them - but I resist the temptation to file a bug for each one of them that has been left unhandled. Instead, I view this thread as the collective place to discuss them all.

It's not only "connection reset by peer" (see https://github.com/ContentMine/getpapers/issues/155) errors that I am concerned about, though. It's the whole gamut of HTTP errors that I wish to see handled in a graceful way.

What about, for example, the error:

500 Internal Server Error

or

429 Too Many Requests

which I do encounter every now and then? In case of the latter: is there a "backoff strategy" put in place, i.e. does getpapers retry after a variable time intervall (e.g. after 2, 4, 8, 16, 32 seconds...)? That is, 2 seconds after the first occurrence, 4 seconds after the second one in-a-row, 8 seconds after the third consecutive one and so on?

You can't hammer on a web server and expect it to "just work" - getpapers has to be as flexible as a human at handling HTTP errors. Pure retrying is not enough. You need an exponential backoff strategy for every error that is not permanent (most 5xx errors and some 4xx ones too, like the 429 error above).

And - believe me - ALL errors in the list do happen if you download often enough - as one does with getpapers.

sedimentation-fault commented 7 years ago

Another example: what happens if you had to interrupt getpapers with CTRL-C?

I have an intermittent internet connection - the router may simply reconnect (as a result of some other operation, accident, or just due to the provider's daily change of a dynamic IP address) at no notice to a client like getpapers. And what does getpapers do in such a case? It hangs indefinitely - NOT for a timeout! I have seen it hanging for hours! In such cases, only CTRL-C helps.

What is the status of the files downloaded so far then? Especially, how do you check the integrity of the file that getpapers was downloading when it received the CTRL-C interrupt? Do you compare sizes somehow? I doubt it.

You need to resume gracefully form the point you stopped. But for this to work, the web server has to support "byte ranges". If it does not, it will NOT respond with

206 Partial Content

-- in which case getpapers should delete the partially downloaded file and retry from scratch. And what do you do if the file was wholly downloaded in a previous run? Do you try to re-download it completely or partially? In the later case, you will get a

416 Range Not Satisfiable

error - which practically means "you have either this file completely, or some other file with this name, which is larger than this (or at least as large)".

I would like to know that all this is handled by getpapers gracefully - and have peace of mind while downloading.

sedimentation-fault commented 7 years ago

Yet another question in this context: suppose I did a query that got me a "result set" of X papers. Next time I try the same query the result set may be different for various reasons - a previous bug, or a change in the provider's search algorithms, to name just two. Let's call the new result set Y.

If you need a concrete example, see my 'mathematical economics' query in https://github.com/ContentMine/getpapers/issues/140 - how do you go about it if next time getpapers returns 31 results, instead of 1, maybe after a tweak or resolution of some bug? What happens to the 1 result I got previously? Will I end up with 32 results then (possibly 1 too many)?

How does getpapers go about mixing X and Y in the same output directory? Does it replace X with Y? Does it simply add Y to X? Does the user have any control on the policy?

This is related to the above, in the sense that it is a "canonical" generalization: in the previous post, we had a query that was interrupted and restarted - here we have a query that is re-executed at a later point after it was finished normally once.

sedimentation-fault commented 7 years ago

Whoever is assigned to this issue should not miss the discussion at https://github.com/ContentMine/getpapers/issues/157 where some concrete suggestions and insights are offered.

sedimentation-fault commented 7 years ago

...what do you do if the file was wholly downloaded in a previous run?

To answer my own question: this is a bug that I have now resolved in https://github.com/ContentMine/getpapers/issues/158 (see "Do not redownload papers if they are already there" under "Solution".).

This just checks for file presence though - file integrity (read: PDF validity) is a totally different story.