ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

HTTP error 429 - Too Many Requests #157

Closed sedimentation-fault closed 7 years ago

sedimentation-fault commented 7 years ago

Although I pledged in https://github.com/ContentMine/getpapers/issues/156 to resist the temptation of opening a new bug for each and every HTTP error I encounter, this one happens so often that it deserves special attention.

What happened so far

In my attempt to avoid the showstopper ECONNRESET error (see https://github.com/ContentMine/getpapers/issues/155), I applied my workaround described in https://github.com/ContentMine/getpapers/issues/152 to let my own curl wrapper do the work:

I commented the original code in _/usr/lib/nodemodules/getpapers/lib/download.js:

// //   rq = requestretry.get({url: url,
// //                    fullResponse: false,
// //                    headers: {'User-Agent': config.userAgent},
// //                    encoding: null
// //                   });
//   rq = requestretry.get(Object.assign({url: url, fullResponse: false}, options));
//   rq.then(handleDownload)
//   rq.catch(throwErr)

and appended this:

  // Alternative method: use 'exec' to run 'mycurl -o ...'
  // Compose the mycurl command
  var mycurl = 'mycurl -o \'' + base + rename + '\' \'' + url + '\'';
  log.debug('Executing: ' + mycurl);
  // excute mycurl using child_process' exec function
  var child = exec(mycurl, function(err, stdout, stderr) {
      // if (err) throw err;
      if (err) {
        log.error(err);
      }
      // else console.log(rename + ' downloaded to ' + base);
      else {
        // log.info(stdout);
        console.log(stdout);
        log.debug(rename + ' downloaded to ' + base);
      }
  });
  nextUrlTask(urlQueue);

Here, mycurl is just my own curl wrapper - it catches curl errors and implements various strategies depending on the error, the server, my daily mood and other obscure factors. ;-)

NOTE: You will also need to add something like

// Commented. Has issues with unhandled ECONNRESET errors.
// var requestretry = require('requestretry')
var exec = require('child_process').exec

at the top of download.js.

The problem now

My above 'hack around' (as @tarrow calls it in https://github.com/ContentMine/getpapers/issues/152) works smoothly - but every now and then (like every 10 downloads or so), it catches a 429 Too Many Requests error:

curl: (22) The requested URL returned error: 429 Too Many Requests
(curl --location --fail --progress-bar --connect-timeout 100 --max-time 300 -C - -o PMC3747277/fulltext.pdf http://europepmc.org/articles/PMC3747277?pdf=render)

    at ChildProcess.exithandler (child_process.js:206:12)
    at emitTwo (events.js:106:13)
    at ChildProcess.emit (events.js:191:7)
    at maybeClose (internal/child_process.js:877:16)
    at Socket.<anonymous> (internal/child_process.js:334:11)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
    at Pipe._handle.close [as _onclose] (net.js:498:12)

My curl wrapper catches this and it indeed retries a few times - but it seems that a more elaborate strategy is needed (most notably: a longer sleep interval between retries). The frequency of this error indicates that getpapers is hammering the server too fast.

I have not seen any way to throttle (the keyword phrase associated with error 429 is "rate limit") requests from getpapers. I therefore strongly suggest to introduce such an option - otherwise, the user has to run the script multiple times, not knowing for sure whether subsequent runs will correct failed downloads of previous runs (see https://github.com/ContentMine/getpapers/issues/156).

blahah commented 7 years ago

@sedimentation-fault yes, we need a rate-limit. A lot of the issues with ArXiv are caused by their strict rate-limit requirements that we are currently violating, and it seems EUPMC have now introduced restrictions too. As responsible web citizens we should be rate-limiting anyway.

sedimentation-fault commented 7 years ago

Adding rate-limit capability implies being able to sleep for a few seconds to slow things down. Thus, a starting point is to program a sleep function - something I was appalled to learn that it is far from trivial in JavaScript! :scream:

The following might be of help in this direction: JavaScript version of sleep().

blahah commented 7 years ago

@sedimentation-fault javascript generally works differently than most languages - it doesn't support sleep because it's asynchronous. Instead of sleep, it has something called an event loop and the ability to do something after a given time using setTimeout and/or setInterval.

This might seem unintuitive, but it's what allows nodeJS (and browsers) to be able to run many different logical threads without blocking the user interface.

tarrow commented 7 years ago

I'll give this a look later this week.

I think this error is caused by using the curl wrapper. (But I can't be confident without spending some time looking). Currently the throttling when retrying is handled by the requestretry module.

There is no rate limit (again: as far as I know) between sequential requests however all the requests for metadata should be in series not parallel. This was acceptable according to the people at EuPMC (when I asked ca. 1 year ago). I need to look at the policy of arxiv.

For downloading papers we currently us a pool of 10 connections (previously we used as many as possible). We could look at bringing this down or even having them in series with a minimum time before getting the next paper.

I think we probably don't want to be using curl in the master version of the code (because it harms portability). Can you comment if you manage to trigger HTTP err 429 when you're using code from master?

sedimentation-fault commented 7 years ago

The problem is that, with the master code, I don't get that far because of https://github.com/ContentMine/getpapers/issues/155.

The 429 error does not affect the metadata phase. It invariably happens during paper (PDF) download.

If you really use 10 connections to fire up just as many paper downloads asynchronously, then this could explain it. In my eyes, this is simply asking for trouble. At best, the number of concurrent connections to any given provider should be configurable by the user - preferably through a command-line option.

In the middle run, you might want to implement an options object for each one of the few providers - containing number of connections, timeouts, retries, maybe proxies, passwords etc. for that provider. Put them into some configuration (include?) file with sensible defaults and let the user change the values at will. Include the file at the right place in your code and no need to tweak options objects again.

You will have to do it in a security-aware way though - otherwise you will see your software featuring first page in next day's security advisories all over the world. "Privilege escalation vulnerability in ContentMine"... :smiling_imp:

sedimentation-fault commented 7 years ago

I changed

for(var i=0; i<10; i++) {

to

for(var i=0; i<2; i++) {

in _/usr/lib/nodemodules/getpapers/lib/download.js and gave it a try. This should reduce the connection number to 2. However, it just got worse: after a few successful downloads, I started getting "403 (Forbidden)" errors. And it is clearly a matter of NOT sleeping between downloads! :disappointed:

blahah commented 7 years ago

We definitely don't want to use a curl wrapper in the main code - nodeJS has extremely robust and battle-tested networking libs. Whatever is making things sometimes work with curl, we need to figure out what that is and configure the node requests to work the same way. I speak from experience when I say that trying to depend on command-line utilities is a great way to commit yourself to endless use support :)

blahah commented 7 years ago

@sedimentation-fault if you're testing against ArXiv, they will block you if you repeatedly query them without complying with their rate limit and other requirements. That might explain the 403 errors.

tarrow commented 7 years ago

While I would like to work on this today; realistically it is going to be a few days before I have time.

I think you should remember that you're not starting from a "clean" point each time. because arxiv have probably now have temporally marked you as a little bit non-compliant.

We should try and work out what is an acceptable rate limit for arXiv. I have looked and can't see one published. They actually recommend bulk getting papers from a "downloader pays" S3 bucket. I'm not sure what they consider bulk. See: https://arxiv.org/help/bulk_data

Might be worth contacting them via their contact us page and finding out what they think is acceptable.

sedimentation-fault commented 7 years ago

@blahah ,

nodeJS has extremely robust and battle-tested networking libs

you have not seen my own battle-tested curl wrapper... :smiling_imp: But let's not start a flame war curl (and its wrappers) vs. nodeJS (and its friends).

Whatever is making things sometimes work with curl, we need to figure out what that is and configure the node requests to work the same way.

That's exactly the point! And what makes my curl wrapper more effective than getpapers right now? It's what I suggest in https://github.com/ContentMine/getpapers/issues/156 - it handles ALL possible HTTP errors according to my strategies.

The best way to do this IMO is this: use some kind of "strategy pattern", i.e. don't try to handle each and every HTTP error in List of HTTP status codes of Wikipedia individually (say in a huge case statement inside some error handler), but rather define a few classes of errors and a strategy per class.

More precisely: if you ask yourself what you want to do for error X, for every X in the error list, you will soon realize that you DON'T want to react differently on each and every error. You will rather build 2-3 strategies that you will want to apply every now and then. Examples:

Write all this down. Present a strategy paper. Discuss it in your group. Don't start hacking hastily - this is a mine field. Sleep a few times over this.

NOTE: :warning: The strategies may even depend on content provider, in accordance to the provider's TOS. You have to keep this in mind.

The guiding principle in designing your strategies should be:

DON'T WAKE UP THE WATCHDOGS!

It should NOT be:

We are responsible web citizens!

The problem with responsibility is: everybody uses it as an excuse, but nobody (from those responsible) really cares. Besides, no matter what the moral side of the story is, if you do wake up their watchdogs, you lose - ever tried to resolve a captcha presented to you by Cloudflare with NodeJS? ;-)

When you are ready, implement your scheme:

Give the user as much control over the strategies as possible. If the user says "I want to retry 10 times per paper, not the default 3", let him do it. If he says "I am willing to wait 1 minute between retries, I have time!" - let him do so!

To answer your indirect question _"what makes [my] curl wrapper[s] work"_: it is exactly the graceful handling of ALL HTTP errors. It does all the above - and some more! :sunglasses:

sedimentation-fault commented 7 years ago

I closed my above post with:

It does all the above - and some more!

"What do you mean by 'some more'? You are just showing off!" - you might think.

No I am not. Here are a few hints:

That's my notion of "don't wake up the watchdogs!". :wink:

blahah commented 7 years ago

@sedimentation-fault for getpapers, we specifically want to avoid that kind of behaviour. It is intended to provide access to services provided by good-faith providers. If people want to bypass reasonable limitations in those services, they will have to do it without our help.

https://github.com/ContentMine/quickscrape on the other hand is designed to scrape where there is no reasonable service provided by the publisher. That would be a better place to provide user agent/referer spoofing, delay randomisation, and other tactics to avoid triggering blocks by bad-faith providers.

And I should add, all those things can easily be done in nodeJS - still no need for curl :)

sedimentation-fault commented 7 years ago

But what if "good-faith" content providers do not pose "reasonable limitations"? What if their limitations are artificial, subjective and hostile?

Personally, I consider it a casus belli if a web server (especially one that supposedly operates in the public interest) sends me a 403 at the slightest hint of automatic downloading. That really pisses me off!

I never, ever thought anything negative of arxiv.org - until yesterday. You know, in this case it's really Either you are with us, or against us!...

Sorry.

sedimentation-fault commented 7 years ago

Let me come back to the programming details: I have found the reason for the 429 errors! Look at this:

ps aux | grep mycurl | wc -l
712

There are 712 curl wrapper instances from my box trying to download from a single provider (EUPMC in this case) right now! That's horrible! What an embarrassment!

It seems that getpapers fires the downloader and forgets it (fire-and-forget, a.k.a asynchronous, or non-blocking execution)! I've read somewhere about the differences between exec and execSync in the _childprocess module. Will try that and report back...

tarrow commented 7 years ago

Yep, that is definitely a problem; one that we also had around a year ago. I'm not sure how you altered your code exactly (or what is in your wrapper) but you probably want to use something like the "handleDl" callback we use. You'll see around line 104 in download.js that we don't start the next item in the queue until we've got the previous one.

I.e. getpapers as we have in master doesn't fire-and-forget; I think that is an alteration introduced by your adaption to use curl.

You probably want to implement something similar in your fork. Again, I'm sorry I can work on this in detail today but you should look at: https://nodejs.org/api/child_process.html#child_process_child_process_exec_command_options_callback

You want to have the next curl called by the callback you pass to exec for the current curl. It is a bit of a learning slope to go from a very procedural language to JS. Typically you don't want to use Sync stuff if you can avoid it since it blocks the whole application. It's nicer to use callbacks or promises to assert what order things happen in.

sedimentation-fault commented 7 years ago

Changed

var exec = require('child_process').exec

to:

var execSync = require('child_process').execSync

then replaced

var child = exec(mycurl, function(err, stdout, stderr) {

with:

var child = execSync(mycurl, function(err, stdout, stderr) {

and reran my EUPMC query. This time, it runs much more sanely - it does not give me the idea of hammering the web server (I don't see the counter of downloaded papers go up like a rocket, but rather one paper after another, as I expect it from my connection speed, local circumstances etc.).

Looks very good - no errors, no embarrassments...Just 2 curl wrapper instances running (not 712!) - as expected from my change of

for(var i=0; i<10; i++) {

to

for(var i=0; i<2; i++) {

a few posts above.

For you this means: if the master code

rq = requestretry.get(...)

in download.js is an asynchronous operation (which I suspect it is), you will have to find a way to make it synchronous, otherwise you risk the 429 error.

I am currently testing this with EUPMC. Will go on with arxiv and report - but I already have the bad feeling that I was too quick to count arxiv on my enemy side...

sedimentation-fault commented 7 years ago

@tarrow , I just saw your comment. Yes, I am learning! :smile:

sedimentation-fault commented 7 years ago

@tarrow ,

EUPMC downloads go very smoothly with the execSync version. So smoothly that I don't even dare to change it. :smile:

Besides, I fail to imagine the difference in user experience between my current


  // Alternative method: use 'exec' to run 'mycurl -o ...'
  // Compose the mycurl command
  var mycurl = 'mycurl -o \'' + base + rename + '\' \'' + url + '\'';
  log.debug('Executing: ' + mycurl);

  // excute mycurl using child_process' exec function
  // use the execSync version
  var child = execSync(mycurl, function(err, stdout, stderr) {
     console.log(stdout);
     console.log(stderr);
     if (err) {
       log.error(err);
     }
     else {
       log.debug(rename + ' downloaded to ' + base);
     }
  });
  nextUrlTask(urlQueue);

and a version with exec and a callback that calls nextUrlTask(urlQueue)... Looks pretty identical to me...As a user, I am supposed to sit there, stare at the display and look how one paper after another passes the wire and lands on my disk. :smile: No more "interaction" expected at that stage. At the end, I am happy if all goes smoothly, without embarrassing errors and - that's important from the user friendliness point of view - without retries due to failed downloads. Whether it takes two hours, instead of one, is totally uninteresting to me. I have time. :smile:

Next thing in the queue is arxiv with the 24000 papers of the math.DG category. I will let you know how this goes.

sedimentation-fault commented 7 years ago

The synchronous curl wrapper version I gave above was not...ehm, let's say it was not the best. :smile: For what it was, it worked impressively well!

After reading some docs, including the link given by @tarrow above, and after some experimentation and inquiry, I settled for this new version, which I am currently trying with the arxiv API:

File: _/usr/lib/nodemodules/getpapers/lib/download.js Start:

// Commented. Has issues with unhandled ECONNRESET errors.
// var requestretry = require('requestretry')
var execSync = require('child_process').execSync

Function downloadURL: Comment standard (master) code:


// //   rq = requestretry.get({url: url,
// //                    fullResponse: false,
// //                    headers: {'User-Agent': config.userAgent},
// //                    encoding: null
// //                   });
//   rq = requestretry.get(Object.assign({url: url, fullResponse: false}, options));
//   rq.then(handleDownload)
//   rq.catch(throwErr)

and add:

  // Alternative method: use 'exec' to run 'mycurl -o ...'
  // Compose the mycurl command
  var mycurl = 'mycurl -o \'' + base + rename + '\' \'' + url + '\'';
  log.debug('Executing: ' + mycurl);

  // Execute mycurl using child_process' exec function.
  // Synchronous version.
  // 
  // You must wrap it in try-catch in order to catch the error
  // object that WILL be thrown if the command exits with
  // non-zero exit code! If you don't catch that error,
  // this script will STOP!
  try {
    // var child = execSync(mycurl, {stdio:[0,1,2]});
    // 'inherit' is equivalent to [0,1,2], but more intuitive.
    var child = execSync(mycurl, {stdio:'inherit'});
  } catch (err) {
    // err.status  has the exit code
    // err.message has the message
    // err.stderr  has the stderr
    log.error('mycurl exit code: ' + err.status);
  }

This last part is a small gem! It shows synchronous execution of a curl wrapper (mycurl) with full, direct, immediate (including progress bars, colors...) output to the terminal, and including catching of the error in case of a non-exit code of the child process (something that some people might think is impossible in synchronous mode).

Other than I now have to wait for 1000+ files to throw a

416 Range Not Satisfiable

as described in https://github.com/ContentMine/getpapers/issues/158, all seems to work fine.

Conclusion

This "too many requests" error was the result of my using a curl wrapper asynchronously and without proper serialization of HTTP requests in the callback.

(NOTE: I was forced to use my own curl wrapper due to ECONNRESET errors, see https://github.com/ContentMine/getpapers/issues/155)

Now that I do it synchronously as above, the number of child download processes has gone down from many hundreds to just a few - and so has the number of connections. Accordingly, this error has gone.

Therefore, if you don't hear from me, it means you may close this issue.

sedimentation-fault commented 7 years ago

The synchronous curl-wrapper workaround above works like a charm - it's been 1.5 days running, has handled all kinds of HTTP errors gracefully, is at 70% and still going! I suggest it as a temporary (or even permanent) solution to HTTP errors that getpapers cannot (yet) handle to anybody, as well as a hack that can help in getting more information about inner workings of the HTTP connection to the developers.

Thank you all for you tips and great help! :+1:

sedimentation-fault commented 7 years ago

Some notes about execution are in order:

Remember, we are talking about downloading 24818 papers of the math.DG category of arxiv.org with

getpapers --api 'arxiv' --query 'cat:math.DG' --outdir arxiv/math.DG -p

Downloads finished. So far so good. However:

If you can cope with the above, you will come to a happy end! It works! :smile:

P.S.: The size of this download is 9GB. Given the $0.04/GB at Amazon S3 pricing (Amazon S3 is suggested by arxiv for bulk downloads, meaning downloads of their tarballs containg the whole 270GB of their papers), I incurred a cost of $0.36 to arxiv. Seen from a different point of view, this low cost also means that the whole 270GB of all their papers would cost $10.80 to download without any of the above problems, in three tarballs, from the arxiv S3 service - which might explain lack of interest in this kind of bulk downloading through getpapers. Of course, getpapers offers much more than just downloading (querying, for example, or even metadata processing through its _arxivresults.json file - to name a few).

blahah commented 7 years ago

@sedimentation-fault as already mentioned, getpapers will not support subverting reasonable limits put in place by the organisations running the APIs we wrap. ArXiv is a free service for researchers, funded through philanthropic grants, and is more efficient with their spending than almost any other publishing platform. Please do not change your IP to avoid their rate-limiting - that's offloading the cost of your download onto them, which they have explicitly said they cannot afford (whereas you have pointed out that you think the cost is reasonable - so why not pay it?). They are not some multinational corporation that makes insane profits year after year at the cost of the public purse (like Elsevier). They push the whole of society forward through their work.

Note also that your ArXiv accounting assumes glacier storage which they will almost certainly not be using - they are most likely on the standard plan, and will have hundreds or thousands of well meaning academics (plus many less well-meaning entities) trying to scrape, crawl, or otherwise mass download their content every day.

There are many bad actors in the publishing system, and we make a point of knowing who they are and never working with them. The organisations we do work with are the ones that (a) deserve all of our support and (b) cannot afford to be exploited.

If, as seems likely on the basis of your helpful and detailed engagement, you're interested in driving this technology forward, your insights into how to bypass unreasonable limits put in place by bad actors would be welcome over at quickscrape. Figuring out how to enable research while minimising the harm of those organisations is a huge challenge, and one we'd really welcome your help with.

Any technical details about how to bypass limits put in place by the providers supported in this repo will be removed.

petermr commented 7 years ago

Thanks both, I think that addressing some of this energy and technology to quickscrape would be really valuable.
getpapers is a tool to maximize the efficiency of extracting content from willing organizations. There are an increasing number of good players who expose APIs and want people to use them responsibly. (I've been on the Project Advisory Board or EuropePMC for 10 years and seen this from the other side - they aim to support high volumes of downloads and we work with them. @tarrow frequently contacts them with problems and they respect this and respond. Note that I and other ContentMine community have frequent contact with many repositories (arXiv, HAL, CORE, etc.) and work with them to resolve problems. But as @blahah says, it's underfunded compared with the investment that rich publishers make in non-open systems.

By contrast quickscrape aims to scrape web pages to which the user has legal access (I stress this). Many publishers do not provide an API and some that do have unacceptable terms and conditions. quickscrape has been designed to take a list of URLs (or resolved DOIs) and download the content from a web site. This should only be done when you believe this is legal. The problem is that the sites often use dynamic HTML / Javascript, contain lots of "Publisher Junk" and change frequently. If you have a list of (say) 1000 URLs then it may well contain 50 different publishers. There is a generic scraper which works well for many, but for some it's necessary to write bespoke scrapers.

A typical (and valuable) use of quickscrape is in conjunction with Crossref (who we are friends with). Crossref contains metadata from publishers (often messy) and the ability to query, but does not itself have the full text. So a typical workflow (which I spent a lot of time runnig last year) is :

This is really valuable for papers which are not in a repository. It's a very messy business as there are frequent "hangs" and unexpected output or none. @tarrow worked hard to improve it but there is still a lot of work to be done.

If you are interested in this PLEASE liaise with @blahah - he wrote it and knows many of the issues.

sedimentation-fault commented 7 years ago

@petermr , thank you for the details on quickscrape. I must confess I haven't run it not even once, as I have concentrated on getpapers so far.

@blahah , while I agree with you in principle, I think it is a bit of overreaction to classify IP address changing as a "black hat" practice. My IP changes every 24 hours - even more often, at the discretion of my Internet provider. Is that change also "bad"? Besides, I am old enough to remember a not-so-distant past when I was changing my IP address each time I was using my modem to get into the Internet. That's normal, by the way - you fire-up your modem, get a new IP, do your work and disconnect - because you don't want to stay connected all time (you would have to pay for this). I thus had times when I was getting a new IP every 10 minutes. If that was not bad practice back then, why should it be considered as such now?

For arxiv and my download habits/connection/circumstances (i.e. for the truly low volume I am using, compared to all youtubers and such out there), I would have to change my IP every 20 minutes/half an hour to keep getpapers going - so what?

As far as costs are concerned, do you really think those $0.36 hurt them?

On the other side, you are right - why not use their paying service? Comparing the days and months one would need to get it all for free - and all the problems along the way - this is really competitive. I didn't use it because I hadn't checked it until yesterday. Maybe I will use it in the future, I really find it very tempting.

BUT: I am definitely NOT interested in ALL arxiv papers - and their service is just that, all or nothing. You can't cherry-pick your papers there. You cannot select by category. With getpapers you can - that's one of the reasons I used that, instead of S3.

I really don't think that hurts anybody. I was the first that got very upset when I realized that something (that I thought was getpapers) was hammering the web server with excessively high frequency - see my very first post in this thread under "The problem now".

sedimentation-fault commented 7 years ago

Maybe my answer above will be interpreted that I "don't care", only because I ask "so what?". Let me clarify that I do care - in fact A LOT.

Here's why: for the very simple reason that it is NOT in the interest of any downloader to kill the web server he downloads from! On the contrary, if the web server thrives, the downloader thrives too. :smile:

On the other side, if you tresspass some limit, the web server WILL tell you - that's what HTTP errors like 403 (Forbidden), 429 (Too many connections) and the likes are there. My philosophy is: if you get one of those, slow down a bit and retry. If it's OK, the web server will let you download - if not, it will give you an error again. Pure trial and error.

I don't see anything wrong with that, no matter what downloader we are talking about, be it getpapers or quickscrape. You cannot know (and implement!) the TOS of every web server out there - but every web server knows its own TOS and will let you know when you are acting against them (at least in theory it should be so). By implementing what I call "elastic sleep", you can easily implement a flexible rate-limiting strategy that adapts to every TOS:

Elastic Sleep

Start with a sensible default sleep interval between downloads. If two consecutive downloads give you a 200 (OK) message, try decreasing it by some sensible amount or percentage (e.g. cut it in half). If you get errors, increase it by another sensible amount or percentage (e.g. double or triple it). This way you will soon settle for a sleep interval that is acceptable to the web server you are talking to.

There is no point in trying with some fixed values just because someone thinks they implement the TOS of some provider. What is acceptable and what not may change in the course of time - even during the day. It would be too much work to try to hit that moving target. Elastic sleep lets you achieve this with minimum programming work - and maximum flexibility.

sedimentation-fault commented 7 years ago

...and if you are looking for ideas, be it for quickscrape, or getpapers, have a look at JDownloader and its (free and open) source code. :wink: