ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

-k not working for EPMC query #121

Closed petermr closed 8 years ago

petermr commented 8 years ago

I ran a query on EPMC

getpapers -q parp -o parp -x -k 1000

subsequently I wanted a smaller set so I ran with smaller -k. This recognised the flag nad its value (see below) but still downloaded 1000. It may have cached this value somewhere??

 getpapers -q parp -o parp200 -x -k 200
info: Searching using eupmc API
info: Found 10270 open access results
info: Limiting to 200 hits
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
warn: Article with pmcid "PMC4842564" was not Open Access (therefore no XML)
warn: Article with pmcid "PMC4582751" was not Open Access (therefore no XML)
warn: Article with pmcid "PMC4842580" was not Open Access (therefore no XML)
warn: Article with pmcid "PMC4577772" was not Open Access (therefore no XML)
warn: Article with pmcid "PMC4073066" was not Open Access (therefore no XML)
warn: Article with pmcid "PMC4080194" was not Open Access (therefore no XML)
info: Got XML URLs for 994 out of 1000 results
info: Downloading fulltext XML files
Downloading files [==============================] 100% (994/994) [62.4s elapsed, eta 0.0]
info: All downloads succeeded!

Note that although it "limits" it to 200, 994 are downloaded.

larsgw commented 8 years ago

-k 250 worked for me. Maybe the lower limit of the -k limit is somewhere between 200-250, or it's a different problem. When under 250 it doesn't work for me either. For 10 for example, when I just want to test some stuff, it downloaded 100.

> getpapers --api eupmc -q 'Abies OR Pinus OR Picea' -o . -k 250 -x
info: Searching using eupmc API
info: Found 1366009 open access results
info: Limiting to 250 hits
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
warn: Article with pmcid "PMC4963410" was not Open Access (therefore no XML)
info: Got XML URLs for 249 out of 250 results
info: Downloading fulltext XML files
Downloading files [==============================] 100% (249/249) [13.3s elapsed, eta 0.0]
info: All downloads succeeded!