Open larsgw opened 7 years ago
We have clearly made a mistake here. I imagine that we're hammering them too hard/not following a delay between requests etc...
Probably the answer is to 1) fix it so we're playing by their rules 2) release a new version with the new version number in the UserAgent 3) let them know that we've fixed it in the new version (should they keep blocking the old one?)
Also: did we get an email to contact@contentmine.org
...or simply spoof the UserAgent header with some innocent string and move on. :-) A list of most common UA strings can be found in: https://techblog.willshouse.com/2012/01/03/most-common-user-agents/
I also bumped into this issue just now. A pity...
Same problem here. It's important to maintain arXiv downloader working, thus I suggest
- fix it so we're playing by their rules
For starters, crawl delays at least of 15 seconds must be introduced.
- let them know that we've fixed it in the new version (should they keep blocking the old one?)
Yes and yes. It is a bit strange that arXiv discourages automated access to /api, but this is probably (?) a bug.
OK I will write to Paul.
@petermr actually, you might be better-off emailing the lead software architect at arxiv (Erick Peirson). I've found him to be quite helpful & communicative: brp53@cornell.edu https://erickpeirson.github.io/
Thanks Ross.
Just to re-iterate that arXiv will block your IP (or your employer's) if you use the getpapers API as is to try and download PDFs.
May I suggest switching off the PDF download for now until a refactoring to conform with the current API guidelines is in place?
On Fri, Aug 9, 2019 at 10:39 AM Stephan Druskat notifications@github.com wrote:
Just to re-iterate that arXiv will block your IP (or your employer's) if you use the getpapers API as is to try and download PDFs.
Thank you. I'll add a comment
May I suggest switching off the PDF download for now until a refactoring to conform with the current API guidelines is in place?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ContentMine/getpapers/issues/167?email_source=notifications&email_token=AAFTCS3WPOGO2VNC33QO3ZTQDU3OVA5CNFSM4D7XGVWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD36FS2A#issuecomment-519854440, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYJXCODE4VA3FT5J7TQDU3OVANCNFSM4D7XGVWA .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
See https://github.com/ContentMine/getpapers/issues/166#issuecomment-331714297. Note that
User-Agent: getpapers/TDM
seems to be working (for me) again (for now).