ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

Feature Request: Enable using institute access to get full text PDF #39

Open RamRS opened 9 years ago

RamRS commented 9 years ago

Not sure if this can be applied to all types of institute access. My current institute uses SSO, so that could be a bit easy, but form based SSO, such as the one at NYU (and many other institutes) might be a bit more challenging.

blahah commented 9 years ago

@RamRS any idea how your SSO works? Does it give you a cookie for example?

RamRS commented 9 years ago

My current SSO? I'm not sure - from outside my office network, I log in with DOMAIN\user and password. Should I check if I have cookies from my office domain?

RamRS commented 9 years ago

So, I just tried accessing nature, and the URL went from nature.com/<REST_OF_URL> to <INSTITUTE_PORTAL>:<PORT>/<REST_OF_URL>

blahah commented 9 years ago

OK thanks - that looks like it might be tractable. How do you log in - through a form in a web interface?

RamRS commented 9 years ago

Yes, I search for a journal on a public site. When I click on the search result, it redirects to the second URL pattern seen above. I think access to that server is controlled through my institute's AD.

You know what, this protocol works for all my previous schools as well. NYU did this exact same <ROUTING_URL>:<PORT>/<REST_OF_URL> thing, where the first part would substitute the journal website URL.

My grad school (a part of NYU now, the migration happened when I was in school) used to add a .databases.poly.edu to the actual journal's URL (E.g: nature.databases.poly.edu) and not display the routing URL or the port used.

I also think the port is a bit dynamic - as in, I'm not sure if port X maps to journal Y all the time. Maybe they are re-mapped periodically, I'm not sure.

blahah commented 9 years ago

OK thanks - can you give me an example link with the routing url/port intact? I assume they can't be used without logging in first, but I'd like to see what HTTP headers there are.

RamRS commented 9 years ago

Sure. Here we go:

Mount Sinai: Nature: http://eresources.library.mssm.edu:2155/nature/archive/index.html Science: http://eresources.library.mssm.edu:2145/magazine

Tried changing port, leads to different journals.

My NYU access has been revoked.