Closed emanuil-tolev closed 7 years ago
Does this mean that the harvester is not currently running?
Could it be changes to the EPMC API that we need to take account of?
Looks like it gets to some point, then stops. We don't yet know what causes the error so no idea at what point in each run it could occur. It could happen at the very start.
I've hit this URL which is one of the ones that was generating a 403:
It is working fine for me, so this must be an intermittent thing at EPMC. It would be useful to know if this is still happening - can we tell from the logs?
One possibility is that we're tripping a rate limiter, though I didn't think EPMC's API had one (their UI does). If this is happening reliably, I'd start by increasing the throttle setting, to see if that resolves it.
After some investigation, we have found this to only be a problem on the DOAJ machine, indicating a possible IP blacklisting. We're not sure why, as the API is not supposed to be rate limited, and we did agree to limit ourselves to a maximum of 5 requests per second in discussions with their technical people early on. We can raise the throttle if there is a rate limit, so that would be a quick fix.
I have contacted the EPMC helpdesk to find out what the situation is.
EPMC have responded and said that they can't see us on a blacklist. I have sent them some more diagnostic information, as this is certainly a problem that only manifests on the live server.
Possible workaround from our side is to change the IP from which we are sending requests, and see if that starts working (and, if it stops working after a certain amount of time).
EPMC have confirmed that we were blacklisted, and they have now fixed that. I will follow up with them and find out if there's any more detail why, and if we can do something to ensure it doesn't happen again.
Here is our TODO to finish this issue:
In addition, @emanuil-tolev and @Steven-Eardley are going to look at a process to kill any running tasks before the next task is started, to avoid the problems of multiple instances running at once.
This task was completed as part of an overall review of the operations of the harvester