ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

Too many open files #70

Open petermr opened 8 years ago

petermr commented 8 years ago
localhost:zika pm286$ getpapers -q microcephaly -o microcephaly -p -x
info: Searching using eupmc API
info: Found 2481 open access results
Retrieving results [==----------------------------] 5% (eta 251.0s)^Clocalhost:zika pm286$ 
localhost:zika pm286$ getpapers -q "microcephaly AND virus" -o microcephaly -p -x
info: Searching using eupmc API
info: Found 312 open access results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Duplicate records found: 300 unique results identified
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
warn: Article with pmcid "PMC4712469" was not Open Access (therefore no XML)
warn: Article with pmcid "PMC4172451" was not Open Access (therefore no XML)
warn: Article with pmcid "PMC4287802" was not Open Access (therefore no XML)
warn: Article with pmid "25905393 did not have a PMCID (therefore no XML)
warn: Article with pmcid "PMC4174040" was not Open Access (therefore no XML)
warn: Article with pmcid "PMC4685155" was not Open Access (therefore no XML)
warn: Article with pmid "26389232 did not have a PMCID (therefore no XML)
warn: Article with pmcid "PMC4023431" was not Open Access (therefore no XML)
info: Got XML URLs for 292 out of 300 results
info: Downloading fulltext XML files
Downloading files [==============================] 100% (292/292) [6.7s elapsed, eta 0.0]
warn: 49 downloads timed out. Retrying.
warn: Article with pmcid "PMC4712469" had no fulltext PDF url
warn: Article with pmcid "PMC4601444" had no fulltext PDF url
warn: Article with pmcid "PMC4372928" had no fulltext PDF url
warn: Article with pmcid "PMC4172451" had no fulltext PDF url
warn: Article with pmcid "PMC4339303" had no fulltext PDF url
warn: Article with pmcid "PMC4470205" had no fulltext PDF url
warn: Article with pmcid "PMC4287802" had no fulltext PDF url
warn: Article with pmcid "PMC4116582" had no fulltext PDF url
warn: Article with pmcid "PMC4234431" had no fulltext PDF url
warn: Article with pmcid "PMC3989768" had no fulltext PDF url
warn: Article with pmcid "PMC3882056" had no fulltext PDF url
warn: Article with pmcid "PMC4555914" had no fulltext PDF url
warn: Article with pmcid "PMC3788278" had no fulltext PDF url
warn: Article with pmcid "PMC3677093" had no fulltext PDF url
warn: Article with pmid "25905393" had no fulltext PDF url
warn: Article with pmcid "PMC4174040" had no fulltext PDF url
warn: Article with pmcid "PMC4685155" had no fulltext PDF url
warn: Article with pmcid "PMC3168062" had no fulltext PDF url
warn: Article with pmcid "PMC2888342" had no fulltext PDF url
warn: Article with pmcid "PMC4173406" had no fulltext PDF url
warn: Article with pmcid "PMC3593470" had no fulltext PDF url
warn: Article with pmcid "PMC3089914" had no fulltext PDF url
warn: Article with pmcid "PMC4452538" had no fulltext PDF url
warn: Article with pmcid "PMC2233781" had no fulltext PDF url
warn: Article with pmcid "PMC2989439" had no fulltext PDF url
warn: Article with pmid "26389232" had no fulltext PDF url
warn: Article with pmcid "PMC3081075" had no fulltext PDF url
warn: Article with pmcid "PMC4023431" had no fulltext PDF url
info: Downloading fulltext PDF files
Downloading files [------------------------------] 0% (1/292) [0.0s elapsed, eta 0.0]warn: 1 downloads timed out. Retrying.

path.js:309
      var path = (i >= 0) ? arguments[i] : process.cwd();
                                                   ^
Error: EMFILE, too many open files
    at Object.exports.resolve (path.js:309:52)
    at Function.sync (/Users/pm286/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/mkdirp/index.js:68:14)
    at /Users/pm286/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:351:12
    at Array.forEach (native)
    at EuPmc.downloadUrls (/Users/pm286/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:348:8)
    at /Users/pm286/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:228:13
    at /Users/pm286/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/lodash/index.js:7305:23
    at /Users/pm286/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:370:9
    at ClientRequest.<anonymous> (/Users/pm286/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/index.js:177:4)
    at ClientRequest.g (events.js:180:16)
localhost:zika pm286$ 
tarrow commented 8 years ago

This looks like it may be an OS issue (OSX) not a Node.JS one. We might have success increasing ulimit.

blahah commented 8 years ago

This is a general unix feature, where the operating system prevents runaway programs from destroying the machine. We definitely don't want to raise the ulimit, but we need to figure out why so many file handles are being kept open and make sure we keep the number open concurrently down to a reasonable limit.

tarrow commented 8 years ago

I think the reason is tied to #58. I think at the moment the code spawns as many attempted downloads as there are links without checking how many concurrent operations are happening. I think to reason we don't usually see it is people don't actually usually finish searching queries with over 1024 papers

tarrow commented 8 years ago

I actually just successfully downloaded 2539 results without hitting this problem (or a timeout). This may need further investigation.

robintw commented 8 years ago

Has there been any progress on dealing with this issue, for example by limiting how many concurrent operations can take place? I'm trying to download a large number of papers (many tens of thousands) and I keep getting timeouts and crashing.

tarrow commented 8 years ago

Yes, this is fixed by #87. However it is still waiting for review. It is basically the same issue as #58 except we are also saturating the number of local file handles as well as the network connection. If you want you could always checkout #87 and test if it solves your problem?

tarrow commented 8 years ago

also @robintw you mentioned you had possibly found a workaround to this bug in contentmine/getpapers#74 is it different to what happens in #87 ?