ContentMine / journal-scrapers

Journal scraper definitions for the ContentMine framework
66 stars 33 forks source link

BMC pages changed? #42

Closed markmacgillivray closed 7 years ago

markmacgillivray commented 8 years ago

It appears that the BMC scraper looks for an XML link on BMC journal pages:

https://github.com/ContentMine/journal-scrapers/blob/master/scrapers/bmc.json

the scraper definition above includes a search for a link with text "download xml".

However on looking at some BMC pages returned by a getpapers query, there appears to be no mention of XML at all. See:

http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-2041-z

I could just fix the scraper definition, but I am logging this as an issue as I wonder if it is some policy change on the part of BMC. If so, should we look into it and ask them why?

Looking back at another biomedcentral one that we processed back on December 11th:

http://bmcemergmed.biomedcentral.com/articles/10.1186/s12873-015-0063-0

At the time we did manage to get a fulltext.xml file for it. Our daily extract are here:

http://store.contentmine.org/daily20151211/http_www.biomedcentral.com_1471-227X_15_36/

And yet now, looking at the BMC page (both the page in the browser and examining the source code), there does not appear to be mention of an XML file at all...

blahah commented 8 years ago

Yeah, the site looks very different now to how it looked a few months ago. And I agree with you - there's no XML mentioned anywhere in the source. Contacting BMC seems like a good plan.

larsgw commented 8 years ago

I'm having some trouble as well. Articles seem to be stored on subdomains rather than www.biomedcentral.com, making the scraper fail silently when used, and supplementary material and figures have changed as well. I'm preparing a pull request with fixes for these problems, if that's not a problem.

One bigger problem is that the licensee isn't wrapped in a HTML element on its own. Luckily, there now is a meta tag that holds the value. The meta tags seem to exist in three forms right now, the one in the scraper prefixed with "citation", one prefixed with "prism." and one prefixed with "dc.". I'll be using the "citation" one the most in my patch, and where I can't I'll use "dc.".

petermr commented 8 years ago

thanks - this is the sort of thing we want to know.

I'm hoping to normalize the different metadata schemes. Talking with CrossRef tomorrow. P.

On Wed, Jun 22, 2016 at 4:05 PM, larsgw notifications@github.com wrote:

I'm having some trouble as well. Articles seem to be stored on subdomains rather than www.biomedcentral.com and supplementary material and figures have changed as well. I'm preparing a pull request with fixes for these problems, if that's not a problem.

One bigger problem is that the licensee isn't wrapped in a HTML element on its own. Luckily, there now is a meta tag that holds the value. The meta tags seem to exist in three forms right now, the one in the scraper prefixed with "citation", one prefixed with "prism." and one prefixed with "dc.". I'll be using the "citation" one the most in my patch, and where I can't I'll use "dc.".

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ContentMine/journal-scrapers/issues/42#issuecomment-227773542, or mute the thread https://github.com/notifications/unsubscribe/AAsxS57h7eIeO5i73QmowVFCB03-Ythyks5qOU80gaJpZM4HAJWN .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

rossmounce commented 7 years ago

I think this issue has been resolved now. bmc.json working fine for me just now.