adsabs / adsabs-dev-api

Developer API service description and example client code
162 stars 58 forks source link

pdf downloads with curl fail with captcha message #64

Closed qtast closed 3 years ago

qtast commented 4 years ago

I have set up API token as recommended. I followed the guide to download pdfs directly. For arXiv pdfs, this works:

curl -H "Authorization: Bearer $token" 'https://ui.adsabs.harvard.edu/link_gateway/2005ApJ...618..426M/EPRINT_PDF' -L -o 't.pdf'

However, for the published paper, this curl -H "Authorization: Bearer $token" 'https://ui.adsabs.harvard.edu/link_gateway/2005ApJ...618..426M/PUB_PDF' -L -o 't.pdf'

"downloads" a pdf which is some html coded text with warnings about how my activity is suspected to be a robot: We apologize for the inconvenience... ...but your activity and behavior on this site made us think that you are t.pdf a bot. Note:A number of things could be going on here. If you are attempting to access this site using an anonymous Private/Proxy network, please disable that and try accessing site again. Due to previously detected malicious behavior which originated from the network you're using, please request unblock to site. Please solve this CAPTCHA to request unblock to the website

That shouldn't be happening since I've registered for the token, right? Never dreamed of doing anything "malicious" apart from this sort of thing, which has been working for years via my own shell scripts (without token until recently). I never bulk downloaded anything. In any case, what is the purpose of registering for the token, if I'm still seen as a robot?

Thanks!

ttshimiz commented 3 years ago

I am also having the exact same problem! It seems to only be occurring if I'm trying to download the PUB_PDF for ApJ or MNRAS papers. While the ApJ papers return the same CAPTCHA request described above, the MNRAS papers just simply return: "<!DOCTYPE html>

Error

The requested resource does not exist.

"
qtast commented 3 years ago

My issue has been fully resolved a while ago with the help of Sergi Blanco-Cuaresma at ADS via adshelp@cfa.harvard.edu. In short, the curl call requires the inclusion of a user-agent component.

In more detail, adapted from Sergi's comments:

This is not an ADS issue. The endpoint that I am contacting is open and one does not need a token, the controls are set by the publisher on their website and ADS have nothing to do with them.

The following example curls now work in my specific case, where curls are launched from a Linux system: curl 'https://ui.adsabs.harvard.edu/link_gateway/2014ApJS..212....9T/PUB_PDF' -L -o 't.pdf' -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'

curl 'https://ui.adsabs.harvard.edu/link_gateway/2003MNRAS.340..937T/ADS_PDF' -L -o 't.pdf' -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'

curl 'https://ui.adsabs.harvard.edu/link_gateway/2003MNRAS.340..937T/PUB_PDF' -L -o 't.pdf' -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'

For now, the user agent control check is simple and versions are not verified. It is thus hoped that this solution will continue working for a while.

qtast commented 3 years ago

I am also having the exact same problem! It seems to only be occurring if I'm trying to download the PUB_PDF for ApJ or MNRAS papers. While the ApJ papers return the same CAPTCHA request described above, the MNRAS papers just simply return: "

Error

The requested resource does not exist.

" Dear Taro:

Please see my updated post above. If that doesn't work for you (e.g. you are not an a Linux system), please consider contacting adshelp@cfa.harvard.edu as I mention above. Best wishes!

Panayiotis