CERNDocumentServer / harvesting-kit

A kit containing various utilities and scripts related to content harvesting used in Invenio Software (http://invenio-software.org) instances such as INSPIRE (http://inspirehep.net) and SCOAP3 (http://scoap3.org)
GNU General Public License v2.0
7 stars 18 forks source link

no timeout in requests.get() #159

Closed tsgit closed 7 years ago

tsgit commented 7 years ago

harvesting hangs forever on

                    session = requests.session()
                    url = 'http://www.sciencedirect.com/science/article/pii'\
                          + path.split('/')[-1]
                    r = session.get(url)

https://github.com/inspirehep/harvesting-kit/blob/master/harvestingkit/elsevier_package.py#L598-L613

because no timeout is specified.

I do not understand the purpose of get_publication_information() https://github.com/inspirehep/harvesting-kit/blob/master/harvestingkit/elsevier_package.py#L577

in

https://github.com/inspirehep/inspire/blob/master/bibtasklets/bst_consyn_harvest.py#L267

when called without a path

specifying a timeout:

In [5]: url = 'http://www.sciencedirect.com/science/article/pii'

In [6]: session = requests.session()

In [7]: r = session.get(url, timeout=20)
---------------------------------------------------------------------------
ReadTimeout                               Traceback (most recent call last)
<ipython-input-49-835119a4da3b> in <module>()
----> 1 r = session.get(url, timeout=20)

/scratch/venvs/invenio-legacy/lib/python2.7/site-packages/requests/sessions.pyc in get(self, url, **kwargs)
    499 
    500         kwargs.setdefault('allow_redirects', True)
--> 501         return self.request('GET', url, **kwargs)
    502 
    503     def options(self, url, **kwargs):

/scratch/venvs/invenio-legacy/lib/python2.7/site-packages/requests/sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    486         }
    487         send_kwargs.update(settings)
--> 488         resp = self.send(prep, **send_kwargs)
    489 
    490         return resp

/scratch/venvs/invenio-legacy/lib/python2.7/site-packages/requests/sessions.pyc in send(self, request, **kwargs)
    607 
    608         # Send the request
--> 609         r = adapter.send(request, **kwargs)
    610 
    611         # Total elapsed time of the request (approximately)

/scratch/venvs/invenio-legacy/lib/python2.7/site-packages/requests/adapters.pyc in send(self, request, stream, timeout, verify, cert, proxies)
    497                 raise SSLError(e, request=request)
    498             elif isinstance(e, ReadTimeoutError):
--> 499                 raise ReadTimeout(e, request=request)
    500             else:
    501                 raise

ReadTimeout: HTTPConnectionPool(host='www.sciencedirect.com', port=80): Read timed out. (read timeout=20)

T.

david-caro commented 7 years ago

So about the get_publication_info case on the bst_consyn_harvest.py, it looks to me that it tries to get the date of the publication, to skip it if it's not new enough (it has some threshold date before which it will skip them).

The timeout seems related to the user agent of the request, when connecting with:

curl 'http://www.sciencedirect.com/science/article/pii' -H 'Host: www.sciencedirect.com' -H 'User-Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0'   --compressed -vv

Works and without:

curl 'http://www.sciencedirect.com/science/article/pii' -H 'Host: www.sciencedirect.com'   --compressed -vv

Just hangs, that looks on their side :/

david-caro commented 7 years ago

Quick fix, add the user agent, I tried a random one and it worked too, so maybe any will do?