ScholarNinja / extension

Scholar Ninja - Chrome extension. A distributed open search engine for scholarly content, based on a WebRTC DHT network
MIT License
114 stars 17 forks source link

Add support for recent Institute of Physics (IOP) journals articles #5

Closed imrehg closed 10 years ago

imrehg commented 10 years ago

Multiple journals are supported like this, as IOP hosts all of them on the same website. I Have some examples of working articles 1, 2. The scraping matches current formatting. Unfortunately the formatting was different a few years back, for example 3. Those are not handled by these rules, and would need to implement separately.

I added these changes to the extractor.js only (because of what I've seen in your blogpost), looks like that there are other areas in the code that replicate the same section of code. I couldn't totally test this code, would appreciate feedback!

jure commented 10 years ago

Thanks for your contribution, Gergely!

Looks like number 3 won't be matched by this rule set, because it contains "/fulltext" in the URL. Can you reliably exclude older articles that don't match these rules by only matching "/article" like you do now? Or is that part of the URL a coincidence.

I'll test your additions this afternoon, also trying to figure out how to reliably do testing for these things, so it's a bit easier to play with.

imrehg commented 10 years ago

The difference is not just the URL, but the entire page markup, none of the fields match for 3.

Reliable exclusion is a good question. I looked around, and so far the ones that are open access (I'm outside of academia at the moment) and have the same url pattern have the same markup too. It might be because of the way they redesigned their website some time in the past?

Let me know if you find any cases that didn't fit. It's too bad that they don't use the same templates for the whole site.

jure commented 10 years ago

Works great, thanks for your contribution! Pushing this out with version 0.0.5.

imrehg commented 10 years ago

Thanks a lot, really appreciate it! :) It was fun to make, will try to add more physics related extractors, and also spread the word.

On 25 June 2014 03:41, Jure Triglav notifications@github.com wrote:

Works great, thanks for your contribution! Pushing this out with version 0.0.5.

— Reply to this email directly or view it on GitHub https://github.com/ScholarNinja/extension/pull/5#issuecomment-47020648.