Rate limit EPMC page scrape requests

richard-jones commented 7 years ago

From EPMC:

Our front-end does employ a penalty monitor. E.g. requests of the form:

http://europepmc.org/articles/PMC4337530

The penalty scales depending on the duration window for the requests, but at its strictest blocks a user at >=20 requests per minute.

emanuil-tolev commented 7 years ago

So we need to make < 1 request every 3 seconds. I suggest we go for something safer like 1 request every 5 seconds.

We do know exactly when we're hitting their API and when we're hitting their UI - there are separate functions handling this in the codebase. Implementing a throttle specifically for UI requests might be a bit tricky, considering we have potentially multiple workers asking for their HTML.

@markmacgillivray perhaps something that all workers can inspect is in order, like a value last_epmc_html_access in Mongo. Then each worker can check whether now - last_epmc_html_access < 5 seconds before making a request, and if so, wait for now - last_epmc_html_access and try again.

I do foresee some potential trouble here, like if we have 3 workers, then no. 1 and no. 2 get to make all the EPMC UI requests, and no. 3 just sits there waiting for a long time, potentially tripping our "process stuck" alerts.

markmacgillivray commented 7 years ago

How we implement a throttle was not the main question - and it is not hard to solve. What we do need to know is, was it the UI requests we sent that triggered the block? This was not clear from your previous investigation. Also, Richard had specific questions in the email thread:

"We hit the UI when we are looking for AAM information, right, and that information is not available in the fulltext XML. Can we check that the functionality is working correctly (i.e. it doesn't hit the UI if it doesn't need to), and otherwise look to send such requests through a throttle that will keep us beneath their limits.

Could you also send me/point me to the information we get only from the UI, and I'll see if they are open to adding it to the XML."

@emanuil-tolev can you answer the above questions so that @richard-jones can follow up with them again, and then we will know if a throttle is even necessary.

emanuil-tolev commented 7 years ago

What we do need to know is, was it the UI requests we sent that triggered the block? This was not clear from your previous investigation.

EPMC indicate it can only be UI requests that trigger the block. Presumably it is then the UI requests that triggered the block. Do we need to establish this any more firmly than that?

Also, Richard had specific questions in the email thread

Apologies for the delay in answering these @richard-jones . Some take a fair bit more to answer than looking, e.g. checking we don't hit their UI if we don't need to. The code reads correctly, but coming up with ways of trying to break it is a bit more time-consuming. Maybe we could consider a few unit tests of the "expect HTTP GET to http://europepmc.org/test to be attempted" if we want to ensure something about our integration with a particular API. I've seen Jasmine tests which offer a simple way of asserting an HTTP call was made, and I'm sure they're easily available with Meteor, though I've not checked.

"We hit the UI when we are looking for AAM information, right, and that information is not available in the fulltext XML. Can we check that the functionality is working correctly (i.e. it doesn't hit the UI if it doesn't need to), and otherwise look to send such requests through a throttle that will keep us beneath their limits.

It does appear to be working correctly from what I can tell of reading and testing manually.

Could you also send me/point me to the information we get only from the UI, and I'll see if they are open to adding it to the XML

We get a great deal of information out of the API. We only resort to the UI for:

Attempting to find out the licence in EPMC. But, we only scrape their UI after attempting to get the licence information from the Fulltext XML (which comes from their REST API).

Attempting to figure out Author Manuscript status. We first attempt to get this from the Fulltext XML. Failing that, we look for one of these 4 strings in the HTML version on their UI:

    var s1 = 'Author Manuscript; Accepted for publication in peer reviewed journal';
    var s2 = 'Author manuscript; available in PMC';
    var s3 = 'logo-nihpa.gif';
    var s4 = 'logo-wtpa2.gif';

That's it, those are the only 2 cases and only after the XML has been attempted.

markmacgillivray commented 7 years ago

What I mean by "was it the UI requests that triggered the block" is - yes, EUPMC block based on requests to their UI. But if they have blocked based on requests to their UI, does that ALSO mean we are blocked from querying their API? If they block by IP, and block on any request to any domain they control, then that is possible.

So - was it a request that we sent to their UI that returned a block notification that enabled you to know we were being blocked? Or, did you see that block notification in response to an API call, that may have been blocked AFTER some UI calls caused our IP to be blocked?

This is what is still not clear.

I don't understand why we need Jasmine or unit tests to know if we are calling their API or not, or why it would be hard to test - a message log inside the function that makes the call would show when we are calling it. Even as simple as a console log then watching it would show this up. More complex, if necessary, change the URL it sends to (in test) and record whether or not that URL gets hit - or again, inside the method, ping another URL or send an email to let you know it is happening. Or just record the fact that it was hit in the provenance info about the lantern job. Anyway, all these are probably unnecessary. Just looking at the code should tell us the conditions under which it is called.

@richard-jones do you want to follow up with EUPMC and see about getting the necessary data added to the API, and do you think that is likely within a reasonable timeframe? Or shall I go ahead and put in some rate limiting on this?

emanuil-tolev commented 7 years ago

What I mean by "was it the UI requests that triggered the block" is - yes, EUPMC block based on requests to their UI. But if they have blocked based on requests to their UI, does that ALSO mean we are blocked from querying their API? If they block by IP, and block on any request to any domain they control, then that is possible.

So - was it a request that we sent to their UI that returned a block notification that enabled you to know we were being blocked? Or, did you see that block notification in response to an API call, that may have been blocked AFTER some UI calls caused our IP to be blocked?

Ah, I see. I'll have a look asap. We'd still need to rate limit calls to the UI either way.

Or shall I go ahead and put in some rate limiting on this?

I'd recommend we put in the rate limiting as the results degrade quite severely once the block kicks in, and it happens every sheet over a couple hundred rows. Wellcome definitely want it and it's easy to see why :)

markmacgillivray commented 7 years ago

But if EUPMC will put the data into the API, then we can get the data much faster than going to a rate-limited UI, so I don't think we should put in the rate limiter unless EUPMC won't add the data to the API or can't add it soon enough.

emanuil-tolev commented 7 years ago

But if EUPMC will put the data into the API

During initial development a long time ago (still the Python version), Wellcome informed us that EPMC HTML and the EPMC API were not always up to date with each other. That's why they wanted us to put in the EPMC HTML fallback if we did not find the relevant information in the API. This has been going on for years - EPMC are very unlikely to be able to resolve the problem on the timescale we need to repair the recent degradation in the quality of Lantern results.

If their API and UI were always up to date with each other, we would never ever get licence results with source "epmc_html" since the information would have been retrieved from the API XML. However, we do get these results. The exact same licence checker function is running over both the HTML and the whole of the XML.

EDIT: Same applies to the author manuscript HTML fallback.

markmacgillivray commented 7 years ago

OK. So @richard-jones is saying he thinks he can ask EPMC if data can be added to the API. @emanuil-tolev is saying they are unlikely to do this.

It seems to me that we are as well to ask EPMC first. @richard-jones do you want to follow up with them, or do you want to agree with @emanuil-tolev that it is not worth waiting for a response from them, even if you do follow up with them?

I'll wait for an answer to the above, and then make a rate limiter if necessary, or not if not.

richard-jones commented 7 years ago

I think the answer is that either way right now we're going to need a rate limiter, because any conversation I have with EPMC will take a while, and may not result in them being able to do what we ask, and even if it does, who knows what their backlog or release cycle is like. So, if we can set up a rate limiter anyway, so that we can get the service working correctly in the short term, that would be best.

richard-jones commented 7 years ago

I've now contacted EPMC about this. It may not be possible to fix the issue completely, but they are keen to see what can be done to improve the situation.

To help them debug, it would be useful to have examples of records where we resort to the UI and get a successful result. Could you give me:

Some examples of where we obtain the licence from the EPMC UI
Some examples of where we obtain the AAM information from the EPMC UI. SInce there are a couple of ways of doing this, a couple of different examples would be useful.

markmacgillivray commented 7 years ago

Some DOIs that get licence from epmc_html for me to test with:

markmacgillivray commented 7 years ago

10.1126/science.1230413 10.1126/science.1228160 10.1017/s0033291712002784 10.3109/10428194.2013.818142 10.1111/dmcn.12435 10.1097/sla.0000000000000894

markmacgillivray commented 7 years ago

There are also AAM calls that can trigger UI lookup, will get some examples of that soon

markmacgillivray commented 7 years ago

Confirmed with AAM and licence calls for the above DOIs, and also manual checks on eupmc licence and aam lookups after running a batch job. Parts that call epmc ui now hold for 3.5s between calls, and then the jobs continue. Currently on dev, but not on live, keeping issue open until I move to live and perhaps try with a really large sheet on dev first.

markmacgillivray commented 7 years ago

Done, on dev and live.

CottageLabs / LanternPM

Rate limit EPMC page scrape requests #131