ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

No PNAS fulltext (PDF or XML) via getpapers #47

Open rossmounce opened 9 years ago

rossmounce commented 9 years ago

Very strange. It appears one can't get PNAS fulltext as either PDF or XML via getpapers! Yet, via the EuropePMC website there's clearly a lot of freely available full text articles, with PDF (not so sure about availability of full text XML).

Absolutely zero fulltext downloads appear to be possible for PNAS or Science:

getpapers --query 'Journal:"PNAS" AND FIRST_PDATE:[2000-01-01 TO 2015-05-01]' -x --outdir pnas
info: Searching using eupmc API
info: Found 0 open access results
#include closed papers
 getpapers --query 'Journal:"PNAS" AND FIRST_PDATE:[2000-01-01 TO 2015-05-01]' --all --outdir pnas
info: Searching using eupmc API
info: Found 57575 results

Take Busch et al as the test case: http://europepmc.org/articles/PMC4321246 Clearly available as full text for free for human eyes via EPMC as html & downloadable PDF.

#finds the paper because of --all switch
getpapers --query 'Author:"Busch" AND FIRST_PDATE:[2015-01-01 TO 2015-02-01]' --all --outdir busch
#DOES NOT find the paper
getpapers --query 'Author:"Busch" AND FIRST_PDATE:[2015-01-01 TO 2015-02-01]'  --outdir openbusch
blahah commented 9 years ago

This is the difference between free and OA. These articles are available free, but not OA (at least as classified by EPMC).

This is the fulltext url portion of the result for Busch et al:

"fullTextUrlList": [
  {
    "fullTextUrl": [
      {
        "availability": [
          "Free"
        ],
        "availabilityCode": [
          "F"
        ],
        "documentStyle": [
          "pdf"
        ],
        "site": [
          "Europe_PMC"
        ],
        "url": [
          "http://europepmc.org/articles/PMC4321246?pdf=render"
        ]
      },
      {
        "availability": [
          "Free"
        ],
        "availabilityCode": [
          "F"
        ],
        "documentStyle": [
          "html"
        ],
        "site": [
          "Europe_PMC"
        ],
        "url": [
          "http://europepmc.org/articles/PMC4321246"
        ]
      },
      {
        "availability": [
          "Subscription required"
        ],
        "availabilityCode": [
          "S"
        ],
        "documentStyle": [
          "doi"
        ],
        "site": [
          "DOI"
        ],
        "url": [
          "http://dx.doi.org/10.1073/pnas.1412514112"
        ]
      }
    ]
  }
],

At the moment getpapers will only try to get PDF/XML for OA papers, not 'free' ones as the license is unclear on these. We could add a --free argument?

rossmounce commented 9 years ago

Even if something is just 'free' doesn't necessarily mean it can't be downloaded (Readcube aside). Clearly the fulltext can be downloaded from the (EPMC) website so I would have thought getpapers should mirror that availability.

--free sounds good to me. Might confuse some though. Can't please everyone I guess

blahah commented 9 years ago

Yes, they can often be downloaded but they are not under an open license, so in many countries they can't be contentmined without permission.

Here's what I'm thinking: we have --free which, in addition to OA, will attempt to get resources from papers marked free. When --all is chosen, rather than just not trying to get PDF/XML as we currently do, we could attempt to download from whatever URLs are available. This way, if a user is at their university or using a VPN, they should be able to get a lot of content.

We could include a warning when --free or --all are used that makes it clear users need to check what they are legally allowed to do. I think we should have a guide to legality on contentmine.

petermr commented 9 years ago

Good idea, we may also be downloading hybrid papers.

Note that "free" and "open" are overloaded ("openwashed") so we should define them.

On Fri, Aug 7, 2015 at 10:06 PM, Richard Smith-Unna < notifications@github.com> wrote:

Yes, they can often be downloaded but they are not under an open license, so in many countries they can't be contentmined without permission.

Here's what I'm thinking: we have --free which, in addition to OA, will attempt to get resources from papers marked free. When --all is chosen, rather than just not trying to get PDF/XML as we currently do, we could attempt to download from whatever URLs are available. This way, if a user is at their university or using a VPN, they should be able to get a lot of content.

We could include a warning when --free or --all are used that makes it clear users need to check what they are legally allowed to do. I think we should have a guide to legality on contentmine.

— Reply to this email directly or view it on GitHub https://github.com/ContentMine/getpapers/issues/47#issuecomment-128834188 .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

blahah commented 9 years ago

@petermr I agree, but I think the best we can do to define them is say that they are classified as such by the source (EPMC/ArXiv/IEEE), and link to their explanation of the terms if they have one.