ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

PLOS ONE dinosaurs search inconsistency #50

Closed rossmounce closed 9 years ago

rossmounce commented 9 years ago

Why can't getpapers metadata-only supply the user a list of dinosaur-related fulltext URLs from PLOS ONE?

(edit: same for PeerJ & eLife. Even when doing metadata only searches, I would like/expect getpapers to output a fulltext_urls.txt file)

$getpapers -q 'dinosaurs JOURNAL:"PLOS ONE"' --api eupmc -o plos_test_eupmc
info: Searching using eupmc API
info: Found 350 open access results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Duplicate records found: 325 unique results identified
info: Saving result metdata
info: Full EUPMC result metadata written to eupmc_results.json
info: Extracting fulltext HTML URL list (may not be available for all articles)
warn: Article with pmcid "PMC4439161" had no fulltext HTML url
warn: Article with pmcid "PMC4373905" had no fulltext HTML url
... snip 322 similar warnings snip ...
$ ls
eupmc_results.json

The JSON file from the above metadata only query returns 325 items.

Compare this with the search with added -p, where the JSON file contains 750 records, and the url file contains 18, and it downloaded ~33 PDFs. Super inconsistent!

getpapers -q 'dinosaurs' --api eupmc -p -o pdf_test_eupmc
wc pdf_test_eupmc/fulltext_html_urls.txt 
 18  19 778 fulltext_html_urls.txt
petermr commented 9 years ago

It's useful to give as much diagnostic output as possible. I get:

localhost:junk pm286$ getpapers -q 'dinosaurs' --api eupmc -p -o pdf_test_eupmc
info: Searching using eupmc API
info: Found 769 open access results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Duplicate records found: 750 unique results identified
info: Saving result metdata
info: Full EUPMC result metadata written to eupmc_results.json
info: Extracting fulltext HTML URL list (may not be available for all articles)
warn: Article with pmcid "PMC4439161" had no fulltext HTML url
warn: Article with pmcid "PMC4373905" had no fulltext HTML url
warn: Article with pmcid "PMC4468865" had no fulltext HTML url
warn: Article with pmcid "PMC4452486" had no fulltext HTML url
...
warn: Article with pmcid "PMC3192393" had no fulltext PDF url
info: Downloading fulltext PDF files
Downloading files [=======================-------] 75% (eta 0.2s)
/Users/pm286/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:333
          fourohfour();
          ^
TypeError: undefined is not a function
    at /Users/pm286/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:333:11
    at /Users/pm286/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/index.js:152:6
    at BufferStream.<anonymous> (/Users/pm286/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/index.js:52:3)
    at BufferStream.emit (events.js:117:20)
    at finishMaybe (/Users/pm286/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:460:14)
    at afterWrite (/Users/pm286/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:340:3)
    at /Users/pm286/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:327:9
    at process._tickCallback (node.js:448:13)
localhost:junk pm286$ 
localhost:pdf_test_eupmc pm286$ ls -lt | wc
     733    6590   43943

localhost:pdf_test_eupmc pm286$ wc fulltext_html_urls.txt
      18      19     778 fulltext_html_urls.txt
localhost:pdf_test_eupmc pm286$ wc eupmc_results.json 
  294467  593731 7207505 eupmc_results.json

Looks like you missed the crash in some way.

blahah commented 9 years ago

@rossmounce the reason you get different numbers of results for those two queries is because they are different queries! The first one searches only PLOS and gets 325 unique results. The second one searches the whole of EPMC and gets >700 results.

If you use the same query with and without -p you get the same number of results.

The problem with the HTML url lists is a separate issue and has been fixed in 3155ce6.