ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

Potentially unhandled rejection - invalid string length #182

Closed J-E-J-S closed 4 years ago

J-E-J-S commented 4 years ago

$ getpapers -q 'synthetic genomics' -o ~/documents/GeorgesMarvelousMiner/test_3/ -x -k 20000 info: Searching using eupmc API info: Found 19686 open access results warn: This version of getpapers wasn't built with this version of the EuPMC api in mind warn: getpapers EuPMCVersion: 5.3.2 vs. 6.2 reported by api

info: Done collecting results info: Saving result metadata Potentially unhandled rejection [1] RangeError: Invalid string length at JSON.stringify () at EuPmc.handleSearchResults (C:\Users\James\Documents\Content_Mine\nvm\v7.10.1\node_modules\getpapers\lib\eupmc.js:219:21) at EuPmc.completeCallback (C:\Users\James\Documents\Content_Mine\nvm\v7.10.1\node_modules\getpapers\lib\eupmc.js:167:11) at C:\Users\James\Documents\Content_Mine\nvm\v7.10.1\node_modules\getpapers\lib\eupmc.js:97:7 at Parser. (C:\Users\James\Documents\Content_Mine\nvm\v7.10.1\node_modules\getpapers\node_modules\xml2js\lib\parser.js:306:18) at emitOne (events.js:96:13) at Parser.emit (events.js:191:7) at SAXParser.onclosetag (C:\Users\James\Documents\Content_Mine\nvm\v7.10.1\node_modules\getpapers\node_modules\xml2js\lib\parser.js:264:26) at emit (C:\Users\James\Documents\Content_Mine\nvm\v7.10.1\node_modules\getpapers\node_modules\xml2js\node_modules\sax\lib\sax.js:624:35) at emitNode (C:\Users\James\Documents\Content_Mine\nvm\v7.10.1\node_modules\getpapers\node_modules\xml2js\node_modules\sax\lib\sax.js:629:5)

##########################################################################

Same query works fine at -k 1000 but not -k 20000 or without limit defined ?

petermr commented 4 years ago

I can verify that an error occurs (although not the same):

I ran it with 10000 and it doesn't crash before the download (I didn't wait for the end

MacBook-Pro-3:bugs pm286$ getpapers -q 'synthetic genomics' -o test10000 -x
-k 10000 -f log10000.txt
info: Saving logs to ./log10000.txt
info: Searching using eupmc API
info: Found 19686 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC
api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.2 reported by api
info: Limiting to 10000 hits
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all
articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
warn: Article with pmcid "PMC6404622" was not Open Access (therefore no XML)
warn: Article with pmcid "PMC5741271" was not Open Access (therefore no XML)
warn: Article with pmcid "PMC6343059" was not Open Access (therefore no XML)
warn: Article with pmcid "PMC5563319" was not Open Access (therefore no XML)
warn: Article with pmcid "PMC6099823" was not Open Access (therefore no XML)
warn: Article with pmcid "PMC5611562" was not Open Access (therefore no XML)
... and so on ...

With 20000 I get:

MacBook-Pro-3:bugs pm286$ getpapers -q 'synthetic genomics' -o test20000 -x
-k 20000 -f log20000.txt
info: Saving logs to ./log20000.txt
info: Searching using eupmc API
info: Found 19686 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC
api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.2 reported by api
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
Segmentation fault: 11

so it fails at "about the same point" (I am using MacOSX).

I don't have time to solve this - it could be (a) that 20000 is universally unacceptable. In that case the (ucky) workaround would be to chop the problem into smaller chunks (e.g. by date ranges). If you run it with the same output dir it should simply aggregate them seamlessly. (b) there is a corrupt entry in the download stream. This would either require identifying it and omitting it from the query or (better) modifying the code.

Sorry I personally cannot help.

On Wed, Jan 8, 2020 at 12:08 PM J-E-J-S notifications@github.com wrote:

$ getpapers -q 'synthetic genomics' -o ~/documents/GeorgesMarvelousMiner/test_3/ -x -k 20000 info: Searching using eupmc API info: Found 19686 open access results warn: This version of getpapers wasn't built with this version of the EuPMC api in mind warn: getpapers EuPMCVersion: 5.3.2 vs. 6.2 reported by api

info: Done collecting results info: Saving result metadata Potentially unhandled rejection [1] RangeError: Invalid string length at JSON.stringify () at EuPmc.handleSearchResults (C:\Users\James\Documents\Content_Mine\nvm\v7.10.1\node_modules\getpapers\lib\eupmc.js:219:21) at EuPmc.completeCallback (C:\Users\James\Documents\Content_Mine\nvm\v7.10.1\node_modules\getpapers\lib\eupmc.js:167:11) at C:\Users\James\Documents\Content_Mine\nvm\v7.10.1\node_modules\getpapers\lib\eupmc.js:97:7 at Parser. (C:\Users\James\Documents\Content_Mine\nvm\v7.10.1\node_modules\getpapers\node_modules\xml2js\lib\parser.js:306:18) at emitOne (events.js:96:13) at Parser.emit (events.js:191:7) at SAXParser.onclosetag (C:\Users\James\Documents\Content_Mine\nvm\v7.10.1\node_modules\getpapers\node_modules\xml2js\lib\parser.js:264:26) at emit (C:\Users\James\Documents\Content_Mine\nvm\v7.10.1\node_modules\getpapers\node_modules\xml2js\node_modules\sax\lib\sax.js:624:35) at emitNode (C:\Users\James\Documents\Content_Mine\nvm\v7.10.1\node_modules\getpapers\node_modules\xml2js\node_modules\sax\lib\sax.js:629:5)

##########################################################################

Same query works fine at -k 1000 but not -k 20000 or without limit defined ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ContentMine/getpapers/issues/182?email_source=notifications&email_token=AAFTCS7TI7G75EZRDL3FTCLQ4W637A5CNFSM4KEHLSN2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IEX2ZZA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYK2MXIDJTQHNQ5P33Q4W637ANCNFSM4KEHLSNQ .

-- "I always retain copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same".

Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

J-E-J-S commented 4 years ago

Ah no worries, thanks for looking into it Peter.

petermr commented 4 years ago

Pleased to see you are using it - would be interested to know what you are doing.

The code is Rik Smith-Unna's - I think he may get alerts from here. My own inclination would be to trap the error. note it and skip. Any Node expert is welcome to suggest how.

On Wed, Jan 8, 2020 at 2:28 PM J-E-J-S notifications@github.com wrote:

Ah no worries, thanks for looking into it Peter.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ContentMine/getpapers/issues/182?email_source=notifications&email_token=AAFTCS5Z7WTCP2GB672TBW3Q4XPJPA5CNFSM4KEHLSN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIMTZUY#issuecomment-572079315, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS3HNHVPPNWJWC7P4FTQ4XPJPANCNFSM4KEHLSNQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

tarrow commented 4 years ago

Hi @J-E-J-S,

I was able to repeat your search with the current code on github without error.

Unfortunately I believe you've basically hit an out of memory error on your machine. You can read about it (for example) at: https://stackoverflow.com/questions/29175877/json-stringify-throws-rangeerror-invalid-string-length-for-huge-objects

Sadly rewriting getpapers so it doesn't keep the metadata of papers to download in memory isn't trivial. The best I can suggest is the same as Peter: try chunking into two scrapes e.g. by date in the query it's self. you can read about date range queries of eupmc here: https://europepmc.org/searchsyntax

Alternatively if you happen to have another machine with more RAM you could try running there.

Best of luck, P.S. closing the ticket because the error is really a result of the machine it's run on, how much RAM it has and how big the metadata blob is.

J-E-J-S commented 4 years ago

Hi @tarrow, thanks for looking into it