ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

Inconsistent type of URL returned for simple EUPMC search #25

Open rossmounce opened 9 years ago

rossmounce commented 9 years ago

getpapers -q extremophiles --outdir ./extremophiles

The returned fulltext_html_urls.txt file contains a list of 836 URLs that initially are 100% DOIs ... however down from about 67th in the list to the end, the URLs mysteriously switch from being DOIs to mostly being of the form: http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=EBI&pubmedid=24961215 (and each of these does have a DOI, it's not because they are DOI-less papers)

It doesn't appear to be associated with particular journals. There are journals e.g. PLOS ONE that appear as DOIs if they were among the first 50 results returned and as PMID based links if they were in the later bit of the list e.g. 100-836th.

Odd behaviour.

blahah commented 9 years ago

getpapers takes the first open access fulltext HTML url in the list of urls returned by the API. I would suggest that at some point EPMC started inserting the DOI-url as the first one in the list, but for older papers the PMC url was first.

blahah commented 9 years ago

If this behaviour is undesirable we could have a priority list of which type of URLs to favour? Or perhaps we could always favour the DOI one, otherwise use whatever is first?

rossmounce commented 9 years ago

I would prefer DOI's myself because CrossRef content negotiation makes them surely more useful than the EPMC url?

markmacgillivray commented 9 years ago

I have seen this issue too, and the DOI URL would be more useful because I pass the URLs into quickscrape to get the XML direct from PLOS, and if they are the EPMC ones (as the ones I saw were) then they still can't return the XML. Which makes things harder for Norma, and so causes issues for running the daily via getpapers.

blahah commented 9 years ago

OK, so this is not so simple. The problem is that EPMC sometimes doesn't list the DOI url as open access, even for open access articles like PLOS. See for example this result...

"fullTextUrlList": [
{
  "fullTextUrl": [
    {
      "availability": [
        "Open access"
      ],
      "availabilityCode": [
        "OA"
      ],
      "documentStyle": [
        "pdf"
      ],
      "site": [
        "Europe_PMC"
      ],
      "url": [
        "http://europepmc.org/articles/PMC4472175?pdf=render"
      ]
    },
    {
      "availability": [
        "Open access"
      ],
      "availabilityCode": [
        "OA"
      ],
      "documentStyle": [
        "html"
      ],
      "site": [
        "Europe_PMC"
      ],
      "url": [
        "http://europepmc.org/articles/PMC4472175"
      ]
    },
    {
      "availability": [
        "Subscription required"
      ],
      "availabilityCode": [
        "S"
      ],
      "documentStyle": [
        "doi"
      ],
      "site": [
        "DOI"
      ],
      "url": [
        "http://dx.doi.org/10.1186/s12862-015-0399-9"
      ]
    }
  ]
}
],

This means that the open access fulltext HTML URL list can't return the DOI URLs in these cases. We could have some alternative option to allow non-open URLs in the URL list?

blahah commented 9 years ago

I think the better solution here is just to take the DOIs from the result JSON and use them to construct DOI URLs. You can do this with JQ and gnutools...

This will get you all the DOIs, one per line, filter out records that had no DOI (nulls), and construct the URLs:

$ jq '.[].DOI[0]' gasteria_p/eupmc_results.json | tr -d '"'| grep -v '^null' | awk '$0="http://doi.org/"$0'
http://doi.org/10.1155/2015/529521
http://doi.org/10.3897/compcytogen.v8i1.6444
http://doi.org/10.1093/aobpla/plu029
http://doi.org/10.7150/ijbs.6427
http://doi.org/10.1371/journal.pone.0059472
http://doi.org/10.4103/0973-1296.96564
http://doi.org/10.1186/1472-6882-12-43
http://doi.org/10.1007/s00709-011-0287-0
http://doi.org/10.1007/s00709-010-0167-z
http://doi.org/10.1085/jgp.19.1.179
http://doi.org/10.1186/jbiol233
http://doi.org/10.1186/1471-2229-10-32