ContentMine / journal-scrapers

Journal scraper definitions for the ContentMine framework
66 stars 33 forks source link

International Journal of Systematic and Evolutionary Microbiology (IJSEM) #33

Open rossmounce opened 9 years ago

rossmounce commented 9 years ago

I hope I've done this correctly...

The figure image scrapable isn't ideal. At the moment it only get the tiny .gif Ideally I'd want the largest version of the figure images but this isn't available one-click from the full-text HTML page -- it requires *large.jpg to be appended on. I might raise that as an issue.

blahah commented 9 years ago

Looks really good. The large image problem can be solved using 'follow-ons', a feature of ScraperJSON that I have not yet documented. I'll add that to this scraper.

Also, the supplementary material element only captures the link to a page that lists the files for download rather than downloading the files themselves. This situation also requires a follow-on.

petermr commented 9 years ago

Follow-ons would be really valuable. Could I suggest Ross and me as alpha-explorers for the existing undocumented code?

On Tue, May 12, 2015 at 11:58 PM, Richard Smith-Unna < notifications@github.com> wrote:

Looks really good. The large image problem can be solved using 'follow-ons', a feature of ScraperJSON that I have not yet documented. I'll add that to this scraper.

Also, the supplementary material element only captures the link to a page that lists the files for download rather than downloading the files themselves. This situation also requires a follow-on.

— Reply to this email directly or view it on GitHub https://github.com/ContentMine/journal-scrapers/pull/33#issuecomment-101451909 .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

blahah commented 9 years ago

Good idea @petermr - if you look at the commit I made above (https://github.com/rossmounce/journal-scrapers/commit/1eae30094c54640cdc09dcb79dfba5669a25d89c) you can see them in action.

Basically, any element can 'follow' any other element in the elements array, just by adding the key-value pair "follow": "element_name" to the element that does the following. If you want to follow an element, but don't want the followed element to be included in the results, you add it to a followables array instead of the elements array, as shown in the example I linked.

The followed array must capture a URL.