Closed tsgit closed 7 years ago
So about the get_publication_info case on the bst_consyn_harvest.py, it looks to me that it tries to get the date of the publication, to skip it if it's not new enough (it has some threshold date before which it will skip them).
The timeout seems related to the user agent of the request, when connecting with:
curl 'http://www.sciencedirect.com/science/article/pii' -H 'Host: www.sciencedirect.com' -H 'User-Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0' --compressed -vv
Works and without:
curl 'http://www.sciencedirect.com/science/article/pii' -H 'Host: www.sciencedirect.com' --compressed -vv
Just hangs, that looks on their side :/
Quick fix, add the user agent, I tried a random one and it worked too, so maybe any will do?
harvesting hangs forever on
https://github.com/inspirehep/harvesting-kit/blob/master/harvestingkit/elsevier_package.py#L598-L613
because no timeout is specified.
I do not understand the purpose of
get_publication_information()
https://github.com/inspirehep/harvesting-kit/blob/master/harvestingkit/elsevier_package.py#L577in
https://github.com/inspirehep/inspire/blob/master/bibtasklets/bst_consyn_harvest.py#L267
when called without a
path
specifying a timeout:
T.