dominictarr / JSONStream

rawStream.pipe(JSONStream.parse()).pipe(streamOfObjects)
Other
1.92k stars 165 forks source link

memory leak #32

Open lancecarlson opened 11 years ago

lancecarlson commented 11 years ago

I have not been able to figure out where exactly in the chain I get this memory leak, but whenever I pipe request's JSONStream from couchdb _all_docs (about 85k docs), it eats up my memory like crazy. Any one else experience this? From my limited experience creating custom stream objects, the memory should stay fairly consistent as it is parsing I think?

Example code:

var down = JSONStream.parse(["rows", true, "doc"])
request.pipe(down)
dominictarr commented 11 years ago

cool, I was doing this with npm data set, which is only 30k, and at first it didn't work then I fixed it so that it did.

https://github.com/dominictarr/JSONStream/blob/master/index.js#L76

made it work, but to be honest - I was only guessing. maybe you can make a similar hack that solves your case and still passes the tests?

lancecarlson commented 11 years ago

I'm currently running 0.6.4 which includes that line, but I'm still having serious memory leak issues.

I did run an interesting experiment though. I think the biggest culprit is the underlying jsonparse library. When running the same data set against a custom streamer plus the parser, I observed similar memory leak behavior:

request = require('request'),
Stream = require('stream').Stream,
Parser = require('jsonparse');

var p = new Parser();
p.onValue = function(value) {
  console.log(value)
}

var down = new Stream()
down.writable = true
down.write = function(data) {
  p.write(data)
  return true
}
down.end = function() {
  console.log('end')
}

var host = process.env.DB_HOST
var path = '_all_docs?include_docs=true'
var url = host + '/' + dbName + '/' + path
request(url).pipe(down)

Memory leak steadily increases from 50 to 260MB's, then midway through it jumps to 500-600MB's. I feel like the json parser queues up data for something then does something else with the queued data afterwards.

I'm going to post a ticket with that project, but do you have any ideas why this might be happening?

dominictarr commented 11 years ago

well, that is by design actually! it's json-parse is trying to collect the whole object!

we usually want to throw out the other stuff... I don't really have time to work on this, perhaps we can run a https://www.bountysource.com/ bounty on this problem?

lancecarlson commented 11 years ago

Do you know why it was implemented this way?

Bountysource.com looks really awesome! Browsing it now.. :)

dominictarr commented 11 years ago

hmm, I guess that just seemed like the right idea at the time...

lancecarlson commented 11 years ago

BTW, I'm looking at the Sax parser gist referenced on the jsonparse project README. From my initial testing, it has much better performance characteristics. Do you think it would be possible to port JSONStream to using the sax parser instead?

https://gist.github.com/creationix/1821394

dominictarr commented 11 years ago

hmm, you'd just have to match the paths... noting when you go into or come out of an object, and whether that is into a path your are interested in... (and only collecting a whole object when it matches the end of your path!)

seems reasonable - If that had been exposed at the time I probably would have done it like that...

Will be happy to merge if you where to undertake this rewrite!

lancecarlson commented 11 years ago

Looking into this right now. Will keep you posted!

lancecarlson commented 11 years ago

BTW, do you think it would be useful to have another fixture that has maybe 20k objects or more to more easily expose big data memory leaks?

dominictarr commented 11 years ago

absolutely, but it's best to generate it, rather than check it in though. also checkout this module : https://npm.im/memwatch

this script generates a large test object:

https://github.com/dominictarr/JSONStream/blob/master/test/gen.js

nickpoorman commented 11 years ago

I just updated all the dependencies and devDependencies in package.json and ran the gen.js test at 5 million and it passed without running out of memory.

alessioalex commented 10 years ago

@nickpoorman would you mind making a pull request with the updated dependencies? thanks

dominictarr commented 10 years ago

GOOD NEWS.

I have a fix to the memory leak on the memory-leak branch. To test, download all npm docs (currently a 800mb json object)

curl 'http://isaacs.iriscouch.com/registry/_all_docs?include_docs=true' > npm.json

then, if you pipe that to JSONStream, and filter out a deep path (this part is important)

cd JSONStream
./index.js rows.*.doc.versions.*._npmUser < npm.json

and then use top to monitor memory usage... on the master branch, this used 30% of memory and then eventually became very sluggish (GC) and then crashed. on the memory-leak branch this sits at 1% of memory and is fast!

The only problem is that the tests for @PaulMougel 's https://github.com/dominictarr/JSONStream/pull/33 break, so I we need to get that working before I publish this.

felipesabino commented 9 years ago

Is this still an issue with JSONStream?

dominictarr commented 9 years ago

there is still a little bit of work to do here, as described in my last comment.

PaulMougel commented 9 years ago

Sorry about that, let me look into it.

PaulMougel commented 9 years ago

It's been a long time since I looked at this chunk of code; I don't quite remember why j would be greater than the stack's length.

This dumb diff

diff --git a/index.js b/index.js
index 6ecd87b..9d1bd5c 100755
--- a/index.js
+++ b/index.js
@@ -68,7 +68,8 @@ exports.parse = function (path, map) {
           if (!c) return
           if (check(nextKey, c.key)) {
             i++;
-            this.stack[j].value = null
+            if (j < this.stack.length)
+              this.stack[j].value = null
             break
           }
           j++

makes the tests on doubledot1.js and doubledot2.js pass, but multiple_objects.js fail anyways:

*** test/multiple_objects.js ***
END

equal: 0 == 3

@dominictarr any idea?

davidrapin commented 9 years ago

I just had this issue, any workarounds more than welcome!

davidrapin commented 9 years ago

Since this issue is quite old, still not fixed, and caused by a badly-documented and buggy dependency (https://github.com/creationix/jsonparse/issues/8), I would recommend to use a more tested library, like https://www.npmjs.com/package/clarinet or https://www.npmjs.com/package/stream-json

dominictarr commented 9 years ago

Okay, so I have decided that the memory leak is more important than some of the little use-cases (multiple objects and decent operator) people have contributed. In the interest of parsing large objects I have merged my memory leak fixing changes, but disabled the tests for the features #33 and #19, and bumped the major version. Of course I would be happy to merge a pull request that brings these features back.

dominictarr commented 9 years ago

Oh never mind... I can't publish a new npm version because of https://github.com/npm/npm/issues/7260 Apparently this should be fixed in the next version of npm though.

santigimeno commented 9 years ago

@dominictarr I've been investigating why multiple_objects.js test was failing in the memory-leak branch and the problem is not with the functionality itself, that it is still working, but with the root event. After the changes the root event is not emitted anymore in most of cases. For example if we add a root listener to the parser in test/test.js, running it in master the event is emitted but not in the memory-leak branch. I can for the moment provide a patch that removes the checking of the number of times the root event has been called, so multiple_objects.js test passes again.

dominictarr commented 9 years ago

@santigimeno great work. I think we should drop features (like root object) if we need to solve the memory problem. People who need those features can just use JSONStream@0

santigimeno commented 9 years ago

@dominictarr agreed

Macil commented 8 years ago

I tried to start using JSONStream in an application where I'm passing ~100mb of json around, but I would always run into out of memory crashes. I've switched away from JSONStream by changing the json to be newline-separated and used readline and JSON.parse on each line and haven't had any more issues.

vvo commented 7 years ago

Hey @dominictarr any news on this? I would love to be able to keep using JSONStream but it's unfortunately not feasible because of the memory leak. It seems like everything is almost in place to be able to have a better memory footprint? Or should we all move to https://github.com/uhop/stream-json?

dominictarr commented 7 years ago

@vvo this is fixed as well as it can be. it depends on the path being used, it's not really a memory leak if you ask it to parse literally everything, so there are still certain ways to run out of memory with JSONStream, but it's not JSONStream's fault.

JSONStream doesn't even parse anything, it just wraps a nicer api around json-parse. There have been other streaming parsers implemented since then - as linked above, but I don't actually know wether they have better memory/cpu performance.

What I would love to see is to make JSONStream take pluggable back ends (json-parse, clarinet, stream-json would be options) then we could actually do a side-by-side comparison, and choose the best one.

Would very gladly take a PR for this!

vvo commented 7 years ago

@dominictarr I was completely doing the "wrong" thing on my end. Basically I have an http request that I pipe into JSONStream, then for every object I would queue it to be synced to a backend. But the backend is slower than the http request so I quickly run out of memory.

I guess I would have to wrap my mind around how to handle this difference of processing power using streams.

Anyway, thanks a lot for this module, very efficient!

dominictarr commented 7 years ago

@vvo sounds like you need some backpressure! this is what streams are all about! you just need to pause the stream when there are too many items in the processing queue. I usually use https://github.com/pull-stream/pull-paramap for something like that (although sorry to drop a completely different stream api on you, I've switched to using pull-streams because they make things like back pressure somewhat easier to reason about). I'm sure you can find a node-stream version if you dig around a bit.

jcrugzz commented 7 years ago

@vvo https://github.com/mafintosh/parallel-transform should work for your use case. You just have to be wary of what the highWaterMark is set to but I believe the default for object streams is sane these days.

hayes commented 7 years ago

This may be relevant here, its not the same leak, but there does seem to be an excessive memory use bug in jsonparse that might be affecting some people reading this thread. see https://github.com/creationix/jsonparse/pull/31