Open lancecarlson opened 11 years ago
cool, I was doing this with npm data set, which is only 30k, and at first it didn't work then I fixed it so that it did.
https://github.com/dominictarr/JSONStream/blob/master/index.js#L76
made it work, but to be honest - I was only guessing. maybe you can make a similar hack that solves your case and still passes the tests?
I'm currently running 0.6.4 which includes that line, but I'm still having serious memory leak issues.
I did run an interesting experiment though. I think the biggest culprit is the underlying jsonparse library. When running the same data set against a custom streamer plus the parser, I observed similar memory leak behavior:
request = require('request'),
Stream = require('stream').Stream,
Parser = require('jsonparse');
var p = new Parser();
p.onValue = function(value) {
console.log(value)
}
var down = new Stream()
down.writable = true
down.write = function(data) {
p.write(data)
return true
}
down.end = function() {
console.log('end')
}
var host = process.env.DB_HOST
var path = '_all_docs?include_docs=true'
var url = host + '/' + dbName + '/' + path
request(url).pipe(down)
Memory leak steadily increases from 50 to 260MB's, then midway through it jumps to 500-600MB's. I feel like the json parser queues up data for something then does something else with the queued data afterwards.
I'm going to post a ticket with that project, but do you have any ideas why this might be happening?
well, that is by design actually! it's json-parse is trying to collect the whole object!
we usually want to throw out the other stuff... I don't really have time to work on this, perhaps we can run a https://www.bountysource.com/ bounty on this problem?
Do you know why it was implemented this way?
Bountysource.com looks really awesome! Browsing it now.. :)
hmm, I guess that just seemed like the right idea at the time...
BTW, I'm looking at the Sax parser gist referenced on the jsonparse project README. From my initial testing, it has much better performance characteristics. Do you think it would be possible to port JSONStream to using the sax parser instead?
hmm, you'd just have to match the paths... noting when you go into or come out of an object, and whether that is into a path your are interested in... (and only collecting a whole object when it matches the end of your path!)
seems reasonable - If that had been exposed at the time I probably would have done it like that...
Will be happy to merge if you where to undertake this rewrite!
Looking into this right now. Will keep you posted!
BTW, do you think it would be useful to have another fixture that has maybe 20k objects or more to more easily expose big data memory leaks?
absolutely, but it's best to generate it, rather than check it in though. also checkout this module : https://npm.im/memwatch
this script generates a large test object:
https://github.com/dominictarr/JSONStream/blob/master/test/gen.js
I just updated all the dependencies and devDependencies in package.json and ran the gen.js test at 5 million and it passed without running out of memory.
@nickpoorman would you mind making a pull request with the updated dependencies? thanks
GOOD NEWS.
I have a fix to the memory leak on the memory-leak branch. To test, download all npm docs (currently a 800mb json object)
curl 'http://isaacs.iriscouch.com/registry/_all_docs?include_docs=true' > npm.json
then, if you pipe that to JSONStream, and filter out a deep path (this part is important)
cd JSONStream
./index.js rows.*.doc.versions.*._npmUser < npm.json
and then use top
to monitor memory usage...
on the master branch, this used 30% of memory and then eventually became very sluggish (GC) and then crashed.
on the memory-leak branch this sits at 1% of memory and is fast!
The only problem is that the tests for @PaulMougel 's https://github.com/dominictarr/JSONStream/pull/33 break, so I we need to get that working before I publish this.
Is this still an issue with JSONStream?
there is still a little bit of work to do here, as described in my last comment.
Sorry about that, let me look into it.
It's been a long time since I looked at this chunk of code; I don't quite remember why j
would be greater than the stack's length.
This dumb diff
diff --git a/index.js b/index.js
index 6ecd87b..9d1bd5c 100755
--- a/index.js
+++ b/index.js
@@ -68,7 +68,8 @@ exports.parse = function (path, map) {
if (!c) return
if (check(nextKey, c.key)) {
i++;
- this.stack[j].value = null
+ if (j < this.stack.length)
+ this.stack[j].value = null
break
}
j++
makes the tests on doubledot1.js
and doubledot2.js
pass, but multiple_objects.js
fail anyways:
*** test/multiple_objects.js ***
END
equal: 0 == 3
@dominictarr any idea?
I just had this issue, any workarounds more than welcome!
Since this issue is quite old, still not fixed, and caused by a badly-documented and buggy dependency (https://github.com/creationix/jsonparse/issues/8), I would recommend to use a more tested library, like https://www.npmjs.com/package/clarinet or https://www.npmjs.com/package/stream-json
Okay, so I have decided that the memory leak is more important than some of the little use-cases (multiple objects and decent operator) people have contributed. In the interest of parsing large objects I have merged my memory leak fixing changes, but disabled the tests for the features #33 and #19, and bumped the major version. Of course I would be happy to merge a pull request that brings these features back.
Oh never mind... I can't publish a new npm version because of https://github.com/npm/npm/issues/7260 Apparently this should be fixed in the next version of npm though.
@dominictarr I've been investigating why multiple_objects.js
test was failing in the memory-leak
branch and the problem is not with the functionality itself, that it is still working, but with the root
event. After the changes the root
event is not emitted anymore in most of cases. For example if we add a root
listener to the parser in test/test.js
, running it in master
the event is emitted but not in the memory-leak
branch.
I can for the moment provide a patch that removes the checking of the number of times the root
event has been called, so multiple_objects.js
test passes again.
@santigimeno great work. I think we should drop features (like root object) if we need to solve the memory problem. People who need those features can just use JSONStream@0
@dominictarr agreed
I tried to start using JSONStream in an application where I'm passing ~100mb of json around, but I would always run into out of memory crashes. I've switched away from JSONStream by changing the json to be newline-separated and used readline and JSON.parse on each line and haven't had any more issues.
Hey @dominictarr any news on this? I would love to be able to keep using JSONStream but it's unfortunately not feasible because of the memory leak. It seems like everything is almost in place to be able to have a better memory footprint? Or should we all move to https://github.com/uhop/stream-json?
@vvo this is fixed as well as it can be. it depends on the path being used, it's not really a memory leak if you ask it to parse literally everything, so there are still certain ways to run out of memory with JSONStream, but it's not JSONStream's fault.
JSONStream doesn't even parse anything, it just wraps a nicer api around json-parse. There have been other streaming parsers implemented since then - as linked above, but I don't actually know wether they have better memory/cpu performance.
What I would love to see is to make JSONStream take pluggable back ends (json-parse, clarinet, stream-json would be options) then we could actually do a side-by-side comparison, and choose the best one.
Would very gladly take a PR for this!
@dominictarr I was completely doing the "wrong" thing on my end. Basically I have an http request that I pipe into JSONStream, then for every object I would queue it to be synced to a backend. But the backend is slower than the http request so I quickly run out of memory.
I guess I would have to wrap my mind around how to handle this difference of processing power using streams.
Anyway, thanks a lot for this module, very efficient!
@vvo sounds like you need some backpressure! this is what streams are all about! you just need to pause the stream when there are too many items in the processing queue. I usually use https://github.com/pull-stream/pull-paramap for something like that (although sorry to drop a completely different stream api on you, I've switched to using pull-streams because they make things like back pressure somewhat easier to reason about). I'm sure you can find a node-stream version if you dig around a bit.
@vvo https://github.com/mafintosh/parallel-transform should work for your use case. You just have to be wary of what the highWaterMark
is set to but I believe the default for object streams is sane these days.
This may be relevant here, its not the same leak, but there does seem to be an excessive memory use bug in jsonparse that might be affecting some people reading this thread. see https://github.com/creationix/jsonparse/pull/31
I have not been able to figure out where exactly in the chain I get this memory leak, but whenever I pipe request's JSONStream from couchdb _all_docs (about 85k docs), it eats up my memory like crazy. Any one else experience this? From my limited experience creating custom stream objects, the memory should stay fairly consistent as it is parsing I think?
Example code: