ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

arXiv.org PDFs denying access to User-Agent "getpapers/(TDM Crawler contact@contentmine.org)" #167

Open larsgw opened 7 years ago

larsgw commented 7 years ago

See https://github.com/ContentMine/getpapers/issues/166#issuecomment-331714297. Note that User-Agent: getpapers/TDM seems to be working (for me) again (for now).

tarrow commented 7 years ago

We have clearly made a mistake here. I imagine that we're hammering them too hard/not following a delay between requests etc...

Probably the answer is to 1) fix it so we're playing by their rules 2) release a new version with the new version number in the UserAgent 3) let them know that we've fixed it in the new version (should they keep blocking the old one?)

Also: did we get an email to contact@contentmine.org

sedimentation-fault commented 6 years ago

...or simply spoof the UserAgent header with some innocent string and move on. :-) A list of most common UA strings can be found in: https://techblog.willshouse.com/2012/01/03/most-common-user-agents/

rossmounce commented 6 years ago

I also bumped into this issue just now. A pity...

merkys commented 6 years ago

Same problem here. It's important to maintain arXiv downloader working, thus I suggest

  1. fix it so we're playing by their rules

For starters, crawl delays at least of 15 seconds must be introduced.

  1. let them know that we've fixed it in the new version (should they keep blocking the old one?)

Yes and yes. It is a bit strange that arXiv discourages automated access to /api, but this is probably (?) a bug.

petermr commented 6 years ago

OK I will write to Paul.

rossmounce commented 6 years ago

@petermr actually, you might be better-off emailing the lead software architect at arxiv (Erick Peirson). I've found him to be quite helpful & communicative: brp53@cornell.edu https://erickpeirson.github.io/

petermr commented 6 years ago

Thanks Ross.

sdruskat commented 5 years ago

Just to re-iterate that arXiv will block your IP (or your employer's) if you use the getpapers API as is to try and download PDFs.

May I suggest switching off the PDF download for now until a refactoring to conform with the current API guidelines is in place?

petermr commented 5 years ago

On Fri, Aug 9, 2019 at 10:39 AM Stephan Druskat notifications@github.com wrote:

Just to re-iterate that arXiv will block your IP (or your employer's) if you use the getpapers API as is to try and download PDFs.

Thank you. I'll add a comment

May I suggest switching off the PDF download for now until a refactoring to conform with the current API guidelines is in place?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ContentMine/getpapers/issues/167?email_source=notifications&email_token=AAFTCS3WPOGO2VNC33QO3ZTQDU3OVA5CNFSM4D7XGVWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD36FS2A#issuecomment-519854440, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYJXCODE4VA3FT5J7TQDU3OVANCNFSM4D7XGVWA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK