ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

Crash for parsing JSON in large CrossRef query (116,808) #124

Open chartgerink opened 8 years ago

chartgerink commented 8 years ago

I was trying to collect a large set of metadata and getpapers seemed to have trouble parsing the metadata. I had sufficient RAM remaining on my machine, so it might be a limitation of JSON parsing. I copied the query and error below. Not crucial for me right now (I'll loop through the years instead of a large one), but cutting into smaller queries might become problematic when a restricted query also yields large number of results (e.g., month query in 2015).

Query

getpapers --api crossref -o cr-res --filter "type:journal-article,prefix:10.1016,from-pub-date:1000,until-pub-date:1886"

Error

info: Searching using crossref API
info: Found 116808 results
info: Saving result metadata
/usr/local/lib/node_modules/getpapers/lib/crossref.js:119
  var pretty = JSON.stringify(crossref.allresults, null, 2)
                    ^

RangeError: Invalid string length
    at join (native)
    at Object.stringify (native)
    at CrossRef.handleSearchResults (/usr/local/lib/node_modules/getpapers/lib/crossref.js:119:21)
    at pageQuery (/usr/local/lib/node_modules/getpapers/lib/crossref.js:41:16)
    at /usr/local/lib/node_modules/getpapers/node_modules/crossref/index.js:92:5
    at Request._callback (/usr/local/lib/node_modules/getpapers/node_modules/crossref/index.js:31:5)
    at Request.self.callback (/usr/local/lib/node_modules/getpapers/node_modules/request/request.js:198:22)
    at emitTwo (events.js:106:13)
    at Request.emit (events.js:191:7)
    at Request.<anonymous> (/usr/local/lib/node_modules/getpapers/node_modules/request/request.js:1082:10)
petermr commented 8 years ago

http://stackoverflow.com/questions/24153996/is-there-a-limit-on-the-size-of-a-string-in-json-with-node-js

see also http://stackoverflow.com/questions/24153996/is-there-a-limit-on-the-size-of-a-string-in-json-with-node-js

this suggests a redesign of getpapers is necessary using streaming. As a first pass can you block it into data ranges? (say 1800-1850, 1850-1870, 1870-1888, etc.

tarrow commented 8 years ago

I would suggest that the answer is just to up the memory available. Basically node only allocates so much memory even if you have loads you ram. You need to change a setting called max-old-space-size which can be done with a command line flag on the node executable --max-old-space-size <e.g. 10000>

petermr commented 8 years ago

There is a hard limit between 18K and 116K results. Try batching into dates ranges, e.g.

localhost:junk pm286$ getpapers --api crossref -o cr-res --filter "type:journal-article,prefix:10.1016,from-pub-date:1880,until-pub-date:1886"
info: Searching using crossref API
info: Found 18705 results
info: Saving result metadata
info: Full CrossRef result metadata written to crossref_results.json

My guess is that the performance is quadratic so that even if larger sets don't run over the hard limit they will get slower rapidly.

petermr commented 8 years ago

@tarrow see the SO discussions - there seems to be a node limit which you can't get round.

chartgerink commented 8 years ago

Thanks, batching is what I am doing (looping per year), but wanted to check whether this was something that needed fixing. Apparently a larger upscale problem of node (which makes sense given the size of the JSON files that are returned from crossref).

tarrow commented 8 years ago

I'm just testing pushing up the memory limits with node --max-old-space-size=70000

My conclusion is that is doesn't fix the issue. This is interesting because the way I read the SO links suggest that it throws this error because we hit the limit of the heap size. I bumped my heap size to 70GB. The RSS and VSZ of the process rose to a large number but not higher than 70GB.

RSS    VSZ
1189760 2376404

Just under 24 GB.

It errors at the same point as you both found on the stringify call. I'm going to look into why this happens.