CERNDocumentServer / harvesting-kit

A kit containing various utilities and scripts related to content harvesting used in Invenio Software ( instances such as INSPIRE ( and SCOAP3 (
GNU General Public License v2.0
7 stars 18 forks source link

no timeout in requests.get() #159

Closed tsgit closed 7 years ago

tsgit commented 7 years ago

harvesting hangs forever on

                    session = requests.session()
                    url = ''\
                          + path.split('/')[-1]
                    r = session.get(url)

because no timeout is specified.

I do not understand the purpose of get_publication_information()


when called without a path

specifying a timeout:

In [5]: url = ''

In [6]: session = requests.session()

In [7]: r = session.get(url, timeout=20)
ReadTimeout                               Traceback (most recent call last)
<ipython-input-49-835119a4da3b> in <module>()
----> 1 r = session.get(url, timeout=20)

/scratch/venvs/invenio-legacy/lib/python2.7/site-packages/requests/sessions.pyc in get(self, url, **kwargs)
    500         kwargs.setdefault('allow_redirects', True)
--> 501         return self.request('GET', url, **kwargs)
    503     def options(self, url, **kwargs):

/scratch/venvs/invenio-legacy/lib/python2.7/site-packages/requests/sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    486         }
    487         send_kwargs.update(settings)
--> 488         resp = self.send(prep, **send_kwargs)
    490         return resp

/scratch/venvs/invenio-legacy/lib/python2.7/site-packages/requests/sessions.pyc in send(self, request, **kwargs)
    608         # Send the request
--> 609         r = adapter.send(request, **kwargs)
    611         # Total elapsed time of the request (approximately)

/scratch/venvs/invenio-legacy/lib/python2.7/site-packages/requests/adapters.pyc in send(self, request, stream, timeout, verify, cert, proxies)
    497                 raise SSLError(e, request=request)
    498             elif isinstance(e, ReadTimeoutError):
--> 499                 raise ReadTimeout(e, request=request)
    500             else:
    501                 raise

ReadTimeout: HTTPConnectionPool(host='', port=80): Read timed out. (read timeout=20)


david-caro commented 7 years ago

So about the get_publication_info case on the, it looks to me that it tries to get the date of the publication, to skip it if it's not new enough (it has some threshold date before which it will skip them).

The timeout seems related to the user agent of the request, when connecting with:

curl '' -H 'Host:' -H 'User-Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0'   --compressed -vv

Works and without:

curl '' -H 'Host:'   --compressed -vv

Just hangs, that looks on their side :/

david-caro commented 7 years ago

Quick fix, add the user agent, I tried a random one and it worked too, so maybe any will do?