Open sedimentation-fault opened 7 years ago
To go past this error, one may try to execute curl (or a wrapper to curl that handles HTTP errors gracefully) directly, as shown in my workaround at https://github.com/ContentMine/getpapers/issues/152.
EDIT: For the current version of my curl (resp. curl wrapper) workaround, see https://github.com/ContentMine/getpapers/issues/157.
It works: you will go past this error - only to encounter a plethora of new ones...
I'll have a look and see if I can replicate this.
Is this occurring during the metadata or pdf download stage? To be honest little work has been done recently on arxiv (due to lack of interest) but I'd be keen to make it work nicely for you.
Slightly cleaner error handling would be nice; unfortunately in the end "connection reset by peer" is the best we can do though. (but without a stack trace and after having retried a few times)
In the past I've found you might want to check your wireless card isn't falling asleep. Alternatively if you're using our virtual machine image there seems to be a bug in the virtualbox drivers on some platforms that causes this.
This occurs during pdf download. No wifi or virtual images in use here.
Connection reset is quite a common error. Start hammering any web server and, after a few hundred "200 OK"'s, I bet you'll get a "connection reset by peer" error. arxiv even seems to be quite robust in doing so only after the first thousand downloads...
You should be able to replicate the error by using my example with the math.DG category and its 24000+ papers.
Note that this error prevents me from going past the first 1000-1500 papers. The
if (err) throw err;
line in the downloadURL function of download.js causes getpapers to stop as soon as it encounters it. Retrying is futile.
At 7% of downloading all 24818 papers of the math.DG category of arxiv.org with:
getpapers --api 'arxiv' --query 'cat:math.DG' --outdir arxiv/math.DG -p
I got:
The same error occurs after rerunning the same command as above, only at a different place (4% (1019/24818))...and it happens again and again, always at different places.
Reason
It seems that the ubiquitous
error is left unhandled by getpapers. This is a too common (I would almost say normal) error to be thrown unhandled at the user.
Solution
Do something, maybe along the lines of https://github.com/Vexera/retry-stream/blob/master/index.js