CERNDocumentServer / harvesting-kit

A kit containing various utilities and scripts related to content harvesting used in Invenio Software (http://invenio-software.org) instances such as INSPIRE (http://inspirehep.net) and SCOAP3 (http://scoap3.org)
GNU General Public License v2.0
7 stars 18 forks source link

adds timeout and User-Agent to requests #160

Closed tsgit closed 7 years ago

tsgit commented 7 years ago

Signed-off-by: Thorsten Schwander thorsten.schwander@gmail.com

tsgit commented 7 years ago

I note that legacy prod is on version 0.6.7 of harvestingkit (from pypi) while the github version is 0.6.8 ?

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-0.06%) to 67.417% when pulling 01bbadbcb6c7bd6c4b858b184deffc2537b3bfa3 on tsgit:request_timeout into ab97853d041b59f1a70432c2a3d15b93a6f8bc2c on inspirehep:master.

tsgit commented 7 years ago

https://github.com/inspirehep/harvesting-kit/issues/159

tsgit commented 7 years ago

the User-Agent is necessary for e.g. Elsevier

In [2]: s=requests.session()

In [3]: s.get('http://www.sciencedirect.com/science/article/pii', timeout=60)
---------------------------------------------------------------------------
ReadTimeout                               Traceback (most recent call last)
<ipython-input-3-16a8a01c12d0> in <module>()
----> 1 s.get('http://www.sciencedirect.com/science/article/pii', timeout=60)

/usr/lib/python2.6/site-packages/requests/sessions.pyc in get(self, url, **kwargs)
    485 
    486         kwargs.setdefault('allow_redirects', True)
--> 487         return self.request('GET', url, **kwargs)
    488 
    489     def options(self, url, **kwargs):

/usr/lib/python2.6/site-packages/requests/sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    473         }
    474         send_kwargs.update(settings)
--> 475         resp = self.send(prep, **send_kwargs)
    476 
    477         return resp

/usr/lib/python2.6/site-packages/requests/sessions.pyc in send(self, request, **kwargs)
    583 
    584         # Send the request
--> 585         r = adapter.send(request, **kwargs)
    586 
    587         # Total elapsed time of the request (approximately)

/usr/lib/python2.6/site-packages/requests/adapters.pyc in send(self, request, stream, timeout, verify, cert, proxies)
    477                 raise SSLError(e, request=request)
    478             elif isinstance(e, ReadTimeoutError):
--> 479                 raise ReadTimeout(e, request=request)
    480             else:
    481                 raise

ReadTimeout: HTTPConnectionPool(host='www.sciencedirect.com', port=80): Read timed out. (read timeout=60)

In [5]: s.get('http://www.sciencedirect.com/science/article/pii', headers={'user-agent': 'HarvestingKit/0.6.7'}, timeout=60)
Out[5]: <Response [200]>
coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.04%) to 67.518% when pulling cb36e6da3ccdeaa605e26a88a8f25ca6f1b30d4a on tsgit:request_timeout into ab97853d041b59f1a70432c2a3d15b93a6f8bc2c on inspirehep:master.

tsgit commented 7 years ago

added basic tests for harvestingkit.utils.make_user_agent() which takes info via pkg_resources