Open chartgerink opened 8 years ago
this suggests a redesign of getpapers is necessary using streaming. As a first pass can you block it into data ranges? (say 1800-1850, 1850-1870, 1870-1888, etc.
I would suggest that the answer is just to up the memory available. Basically node only allocates so much memory even if you have loads you ram. You need to change a setting called max-old-space-size
which can be done with a command line flag on the node executable --max-old-space-size <e.g. 10000>
There is a hard limit between 18K and 116K results. Try batching into dates ranges, e.g.
localhost:junk pm286$ getpapers --api crossref -o cr-res --filter "type:journal-article,prefix:10.1016,from-pub-date:1880,until-pub-date:1886"
info: Searching using crossref API
info: Found 18705 results
info: Saving result metadata
info: Full CrossRef result metadata written to crossref_results.json
My guess is that the performance is quadratic so that even if larger sets don't run over the hard limit they will get slower rapidly.
@tarrow see the SO discussions - there seems to be a node limit which you can't get round.
Thanks, batching is what I am doing (looping per year), but wanted to check whether this was something that needed fixing. Apparently a larger upscale problem of node (which makes sense given the size of the JSON files that are returned from crossref).
I'm just testing pushing up the memory limits with node --max-old-space-size=70000
My conclusion is that is doesn't fix the issue. This is interesting because the way I read the SO links suggest that it throws this error because we hit the limit of the heap size. I bumped my heap size to 70GB. The RSS and VSZ of the process rose to a large number but not higher than 70GB.
RSS VSZ
1189760 2376404
Just under 24 GB.
It errors at the same point as you both found on the stringify call. I'm going to look into why this happens.
I was trying to collect a large set of metadata and
getpapers
seemed to have trouble parsing the metadata. I had sufficient RAM remaining on my machine, so it might be a limitation of JSON parsing. I copied the query and error below. Not crucial for me right now (I'll loop through the years instead of a large one), but cutting into smaller queries might become problematic when a restricted query also yields large number of results (e.g., month query in 2015).Query
Error