DOAJ / doaj

The Directory of Open Access Journals - website and directory software
Apache License 2.0
58 stars 16 forks source link

DOAJ Harvesting from EPMC: 403 Forbidden errors #1226

Closed emanuil-tolev closed 7 years ago

emanuil-tolev commented 7 years ago
App created at  05:30:01 21-02-2017
ES Index Already Exists; host:http://localhost port:9200 db:doajharvester
Not Initialising from document - ES Type+Mapping already exists for state
App initialised at  05:30:01 21---------------------------------------------------------------------------------
INFO in workflow [/home/cloo/harvester/src/harvester/service/workflow.py:10]:
Harvesting for Account:15449173
--------------------------------------------------------------------------------
/home/cloo/harvester/local/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:315: SNIMissingWarning: An HTTPS request has been made, but the SNI (Subject Name Indicati
on) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. For more information, se
e https://urllib3.readthedocs.org/en/latest/security.html#snimissingwarning.
  SNIMissingWarning
/home/cloo/harvester/local/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:120: InsecurePlatformWarning: A true SSLContext object is not available. This prevents url
lib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatfor
mwarning.
  InsecurePlatformWarning
/home/cloo/harvester/local/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:120: InsecurePlatformWarning: A true SSLContext object is not available. This prevents url
lib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatfor
mwarning.
  InsecurePlatformWarning
--------------------------------------------------------------------------------
INFO in workflow [/home/cloo/harvester/src/harvester/service/workflow.py:19]:
Account:15449173 has 13 issns to harvest for
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
INFO in workflow [/home/cloo/harvester/src/harvester/service/workflow.py:53]:
Processing ISSN:1553-7390 for Account:15449173
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
INFO in workflow [/home/cloo/harvester/src/harvester/service/workflow.py:69]:
Processing ISSN:1553-7390 for Account:15449173 with Plugin:epmc Since:2016-10-21T00:00:00Z
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in client [/home/cloo/harvester/src/harvester/magnificent-octopus/octopus/modules/epmc/client.py:118]:
Requesting EPMC metadata from http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=ISSN:"1553-7390" OPEN_ACCESS:"y" UPDATE_DATE:2016-10-21 sort_date:"y"&resulttype=core&format=json&page=1&pageSize=1000
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in http [/home/cloo/harvester/src/harvester/magnificent-octopus/octopus/lib/http.py:74]:
Request to http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=ISSN:"1553-7390" OPEN_ACCESS:"y" UPDATE_DATE:2016-10-21 sort_date:"y"&resulttype=core&format=json&page=1&pageSize=1000 resulted in status 403, attempt 1
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in http [/home/cloo/harvester/src/harvester/magnificent-octopus/octopus/lib/http.py:82]:
Request to http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=ISSN:"1553-7390" OPEN_ACCESS:"y" UPDATE_DATE:2016-10-21 sort_date:"y"&resulttype=core&format=json&page=1&pageSize=1000 backing off for 2 seconds
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in http [/home/cloo/harvester/src/harvester/magnificent-octopus/octopus/lib/http.py:74]:
Request to http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=ISSN:"1553-7390" OPEN_ACCESS:"y" UPDATE_DATE:2016-10-21 sort_date:"y"&resulttype=core&format=json&page=1&pageSize=1000 resulted in status 403, attempt 2
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in http [/home/cloo/harvester/src/harvester/magnificent-octopus/octopus/lib/http.py:82]:
Request to http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=ISSN:"1553-7390" OPEN_ACCESS:"y" UPDATE_DATE:2016-10-21 sort_date:"y"&resulttype=core&format=json&page=1&pageSize=1000 backing off for 4 seconds
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in http [/home/cloo/harvester/src/harvester/magnificent-octopus/octopus/lib/http.py:74]:
Request to http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=ISSN:"1553-7390" OPEN_ACCESS:"y" UPDATE_DATE:2016-10-21 sort_date:"y"&resulttype=core&format=json&page=1&pageSize=1000 resulted in status 403, attempt 3
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in http [/home/cloo/harvester/src/harvester/magnificent-octopus/octopus/lib/http.py:82]:
Request to http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=ISSN:"1553-7390" OPEN_ACCESS:"y" UPDATE_DATE:2016-10-21 sort_date:"y"&resulttype=core&format=json&page=1&pageSize=1000 backing off for 8 seconds
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in http [/home/cloo/harvester/src/harvester/magnificent-octopus/octopus/lib/http.py:74]:
Request to http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=ISSN:"1553-7390" OPEN_ACCESS:"y" UPDATE_DATE:2016-10-21 sort_date:"y"&resulttype=core&format=json&page=1&pageSize=1000 resulted in status 403, attempt 4
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in http [/home/cloo/harvester/src/harvester/magnificent-octopus/octopus/lib/http.py:82]:
Request to http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=ISSN:"1553-7390" OPEN_ACCESS:"y" UPDATE_DATE:2016-10-21 sort_date:"y"&resulttype=core&format=json&page=1&pageSize=1000 backing off for 16 seconds
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in http [/home/cloo/harvester/src/harvester/magnificent-octopus/octopus/lib/http.py:74]:
Request to http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=ISSN:"1553-7390" OPEN_ACCESS:"y" UPDATE_DATE:2016-10-21 sort_date:"y"&resulttype=core&format=json&page=1&pageSize=1000 resulted in status 403, attempt 5
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in http [/home/cloo/harvester/src/harvester/magnificent-octopus/octopus/lib/http.py:82]:
Request to http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=ISSN:"1553-7390" OPEN_ACCESS:"y" UPDATE_DATE:2016-10-21 sort_date:"y"&resulttype=core&format=json&page=1&pageSize=1000 backing off for 30 seconds
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in http [/home/cloo/harvester/src/harvester/magnificent-octopus/octopus/lib/http.py:74]:
Request to http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=ISSN:"1553-7390" OPEN_ACCESS:"y" UPDATE_DATE:2016-10-21 sort_date:"y"&resulttype=core&format=json&page=1&pageSize=1000 resulted in status 403, attempt 6
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in http [/home/cloo/harvester/src/harvester/magnificent-octopus/octopus/lib/http.py:82]:
Request to http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=ISSN:"1553-7390" OPEN_ACCESS:"y" UPDATE_DATE:2016-10-21 sort_date:"y"&resulttype=core&format=json&page=1&pageSize=1000 backing off for 30 seconds
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
INFO in workflow [/home/cloo/harvester/src/harvester/service/workflow.py:78]:
Exception Processing ISSN:1553-7390 for Account:15449173 
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
INFO in workflow [/home/cloo/harvester/src/harvester/service/workflow.py:86]:
Saved state record for ISSN:1553-7390 for Account:15449173
--------------------------------------------------------------------------------
02-2017
https://doaj.org/api/v1/search/journals/username%3A%2215449173%22?page=1&pageSize=100
https://doaj.org/api/v1/search/journals/username%3A%2215449173%22?page=2&pageSize=100
Traceback (most recent call last):
  File "/home/cloo/harvester/src/harvester/service/runner.py", line 8, in <module>
    workflow.HarvesterWorkflow.process_account(account_id)
  File "/home/cloo/harvester/src/harvester/service/workflow.py", line 25, in process_account
    HarvesterWorkflow.process_issn(account_id, issn)
  File "/home/cloo/harvester/src/harvester/service/workflow.py", line 71, in process_issn
    for article, lhd in p.iterate(issn, lh):
  File "/home/cloo/harvester/src/harvester/service/models/epmc.py", line 45, in iterate
    for record in client.EuropePMC.complex_search_iterator(query, throttle=throttle):   # also throttle paging requests
  File "/home/cloo/harvester/src/harvester/magnificent-octopus/octopus/modules/epmc/client.py", line 100, in iterate
    results = cls.query(query_string, page=page, page_size=page_size)
  File "/home/cloo/harvester/src/harvester/magnificent-octopus/octopus/modules/epmc/client.py", line 124, in query
    raise EuropePMCException(resp)
octopus.modules.epmc.client.EuropePMCException: <Response [403]>
richard-jones commented 7 years ago

Does this mean that the harvester is not currently running?

Could it be changes to the EPMC API that we need to take account of?

emanuil-tolev commented 7 years ago

Looks like it gets to some point, then stops. We don't yet know what causes the error so no idea at what point in each run it could occur. It could happen at the very start.

richard-jones commented 7 years ago

I've hit this URL which is one of the ones that was generating a 403:

http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=ISSN:%221553-7390%22%20OPEN_ACCESS:%22y%22%20UPDATE_DATE:2016-10-21%20sort_date:%22y%22&resulttype=core&format=json&page=1&pageSize=1000

It is working fine for me, so this must be an intermittent thing at EPMC. It would be useful to know if this is still happening - can we tell from the logs?

One possibility is that we're tripping a rate limiter, though I didn't think EPMC's API had one (their UI does). If this is happening reliably, I'd start by increasing the throttle setting, to see if that resolves it.

richard-jones commented 7 years ago

After some investigation, we have found this to only be a problem on the DOAJ machine, indicating a possible IP blacklisting. We're not sure why, as the API is not supposed to be rate limited, and we did agree to limit ourselves to a maximum of 5 requests per second in discussions with their technical people early on. We can raise the throttle if there is a rate limit, so that would be a quick fix.

I have contacted the EPMC helpdesk to find out what the situation is.

richard-jones commented 7 years ago

EPMC have responded and said that they can't see us on a blacklist. I have sent them some more diagnostic information, as this is certainly a problem that only manifests on the live server.

Possible workaround from our side is to change the IP from which we are sending requests, and see if that starts working (and, if it stops working after a certain amount of time).

richard-jones commented 7 years ago

EPMC have confirmed that we were blacklisted, and they have now fixed that. I will follow up with them and find out if there's any more detail why, and if we can do something to ensure it doesn't happen again.

richard-jones commented 7 years ago

Here is our TODO to finish this issue:

richard-jones commented 7 years ago

In addition, @emanuil-tolev and @Steven-Eardley are going to look at a process to kill any running tasks before the next task is started, to avoid the problems of multiple instances running at once.

richard-jones commented 7 years ago

This task was completed as part of an overall review of the operations of the harvester