ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

Inconsistent information about HTML retrieval from EPMC #66

Closed petermr closed 8 years ago

petermr commented 8 years ago

[Not sure if this is related to #57]

When running without -x or -p the default is HTML. A typical search of OA EPMC announces many open access papers but only "downloads" a trivial fraction (5%).

dhcp-10-248-131-71:junk pm286$ getpapers --query '"ursus" and PUB_YEAR:2015'  -o junk1 
info: Searching using eupmc API
info: Found 120 open access results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Duplicate records found: 100 unique results identified
info: Saving result metdata
info: Full EUPMC result metadata written to eupmc_results.json
info: Extracting fulltext HTML URL list (may not be available for all articles)
warn: Article with pmcid "PMC4366689" had no fulltext HTML url
warn: Article with pmcid "PMC4583410" had no fulltext HTML url
warn: Article with pmcid "PMC4664997" had no fulltext HTML url
warn: Article with pmcid "PMC4631324" had no fulltext HTML url
warn: Article with pmcid "PMC4630841" had no fulltext HTML url
[snipped]

There are only 3 URLs in the junk3/fulltext_html_urls.txt

http://europepmc.org/articles/PMC4459689
http://europepmc.org/articles/PMC4413493
http://europepmc.org/articles/PMC4289921

These do, indeed have fulltext HTML. But manual inspection of the others shows that they also have fulltext HTML. This may be a EPMC bug/feature, but it means that using the fulltext_html_urls.txt in quickscrape misses almost all the papers.

In contrast the options -x and -p seem to give the expected 100 (deduplicated) papers.

blahah commented 8 years ago

Confirmed that this is a bug:

~/c/g/t/ursus master ❯ jq '.[].fullTextUrlList[].fullTextUrl[] | select(.documentStyle[] == "html" and .availabilityCode[] == "OA").url[]' test_out/ursus/eupmc_results.json | wc -l
    100
~/c/g/t/ursus master ❯ cat fulltext_html_urls.txt
http://europepmc.org/articles/PMC4459689
http://europepmc.org/articles/PMC4413493
http://europepmc.org/articles/PMC4289921

Only three URLs in the URL file, but 100 OA html fulltext URLs in the results JSON. Investigating now. Note that in the new version of getpapers there's an option to directly output the html files as already happens with XML and PDF.

blahah commented 8 years ago

Just to clarify - by 'new version' I mean the one I am working on that is not yet released. It will be a v1 alpha.

blahah commented 8 years ago

Fixed in version 0.4.1