Memory leak in db.batch

fergiemcdowall commented 11 years ago

Hei

Maintainer of search-index and Norch here :)

There seems to be a memory leak in db.batch. When inserting thousands of batch files with db.batch the memory allocation of the parent app shoots up rather alarmingly, and does not come down again.

This issue has appeared in the last few weeks

Any ways to manually free up memory or otherwise force garbage collection?

F

dominictarr commented 11 years ago

do you have a test example that reproduces this? I'm seeing similar stuff running npmd - which is also doing indexing, I think this can happen if you queue up loads of writes at once?

dominictarr commented 11 years ago

what I meant to say, is that I've encountered a similar problem, but I think it was just a flow control issue (just trying todo to many things at once) it's easy to queue up many more operations than leveldb can take. Actually, balancing this is a problem that has been encountered in other areas.

https://github.com/rvagg/node-levelup/pull/153

fergiemcdowall commented 11 years ago

Test example: reading in the test dataset in the Norch package- It generates about 400000 entries (I think) from 21000 batch files of about 150-200 entries.

Yes, in the past I have also managed to queue up way to many operations, leading to huge server hangs. Now however, search-index reads in batches in serial (one after the other). This seems to be more stable, but leads to a big memory leak in db.batch.

dominictarr commented 11 years ago

I've found that large batches work pretty well - how many items do you have in your batches?

rvagg commented 11 years ago

We need a simple test case for this so we can replicate the behavior and track it down and also we need to know whether this is new behavior or has been present for a while. There was a LevelDOWN release this week that contains lots of new C++ which could possibly be leaky but it'd be nice to know f this is recent behavior.

Also note that LevelDB does have some behavior that can look like a leak when you're doing heavy writes. Initially it'll just climb right up and not look like its going down but it eventually comes right back down and settles there. I don't think I have a graph of this but that'd probably be helpful to weed this kind of thing out.

fergiemcdowall commented 11 years ago

@dominictarr its just over 21000 batch files of maybe 1-200 entries each

@rvagg OK- I will throw together a gist

fergiemcdowall commented 11 years ago

@rvagg See https://gist.github.com/fergiemcdowall/6239924

NOTE: this gist demonstrates the memory jump (about 3gb on my machine) when inserting many batches to levelUP, although I cant replicate the out-of-memory error I am experiencing when inserting to levelUP from Express.js- there the cleanup doesnt happen unless there is a start-stop of the application.

rvagg commented 11 years ago

added an adjusted version to the gist that gives some breathing space to V8 to let the callbacks come in.

increasing the total number of batches triggers the problem you're talking about:

FATAL ERROR: JS Allocation failed - process out of memory

so there's something there but I'm not sure where and whether it's a real memory leak on our end or not (could plausibly be up the stack or down the stack!).

fergiemcdowall commented 11 years ago

OK- thanks for the tips- that breathing space was a good idea

fergiemcdowall commented 11 years ago

@rvagg could you push the adjusted version to the gist? :)

rvagg commented 11 years ago

see the comment in the gist

fergiemcdowall commented 11 years ago

Aha- got it, cheers!

rvagg commented 11 years ago

if I run the same stuff against leveldown directly I get exactly the same behaviour so it's not in levelup.

I'm still not entirely convinced this isn't a leveldb thing. if you run leak-tester.js in the tests directory of leveldown and watch it over time you'll see memory usage balloon and then settle back down again over time, it's odd behaviour but I imagine if you push hard enough you could make that ballooning push you over Node's limit. Perhaps that's what's happening here?

I've also opened a leakage branch of leveldown with some minor things I've found so far.

No9 commented 11 years ago

The GC plot makes for scary reading https://gist.github.com/No9/fa818a9d63d22551a837 (See plot at bottom of page) We could have hit a known issue but I need to definitively map the node.js group thread to our situation.

No9 commented 11 years ago

Here is a flame graph of the execution over 20 mins.

The leveldb core library is out on the left which would suggest to me that our issue is in node or our JS as opposed to the leveldb code.

Alt text

rvagg commented 11 years ago

that's an odd flamegraph.. for 20 mins worth of work there's a lot of string and module stuff going on in there and I can't see much related to levelup/leveldown. Can you skip the use of lorem-ipsum since it seems to be getting in the way and see what it does with plain Buffers with crypto.randomBytes(x) or Strings if you want crypto.randomBytes(x).toString('hex') (both cases are interesting since they have to do different things).

No9 commented 11 years ago

Yup and I will also provide one that filters on doBatch so we can see if that sheds any light.

No9 commented 11 years ago

@rvagg here is the flamegraph you were looking for. This appears to be more readable This is a ten minute sample and the app Seg Faulted so I have a coredump I can run ::filtjsobjects on if I can get hold of the right V8.so. CryptoOutput

No9 commented 11 years ago

@dominictarr suggested compaction and now we have a clearer flamegraph that could be worth some investigation I'll run a trace on

leveldb::DBImpl::BackgroundCompaction()
leveldb::DBImpl::MaybeScheduleCompaction()
leveldb::DBImpl::FinishCompactionOutputFile()
leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)

And try and get a correlation by timestamping the above and GC so we can see possible cause and effect:

(Edit) See this link for more detail on the current GC metrics. https://gist.github.com/No9/4f979544861588945788

Maybe compaction is causing demands on system resources that is forcing aggressive GC by V8?

heapwolf commented 11 years ago

As I said to @dominictarr it could be compaction kicking in, but this might not be a problem at all. Im not sure, but I think that leveldb::DBImpl::BackgroundCompaction() is designed to run when it can, as much as it can.

No9 commented 11 years ago

OK So actual compactions over a 10min period look like the following (Milliseconds)

121.787329
993.087684
1223.09732
2774.197186
988.637749

The full range of data is available here in the funky github tsv format https://github.com/No9/datadump/blob/master/gc-vs-comp.tsv (You can filter for 'Do Compaction' at the top All the other compaction calls are in there too)

While nearly 3 seconds is not optimal and it isn't clear how this would be/is impacting. (I was hoping for longer compaction throttled by GC)

I think I am going to go for looking at memory paging next but I will also keep chasing the V8.so as analysis of the core dump could be handy.

rvagg commented 11 years ago

@maxogden is reporting a leak trying to import via batch: https://github.com/maxogden/dat/blob/master/test/insert-junk-data.js

rvagg commented 11 years ago

outputs this: https://gist.github.com/maxogden/0ddccdd28263391a2251

ouch

juliangruber commented 11 years ago

did you try using the chained batch, eg db.batch().put()......write() ?

rickbergfalk commented 11 years ago

I'm not so sure this is just a batch issue anymore. I tried loading data via looping single puts, and I eventually hit the same out-of-memory error. Puts with {sync: true} still leak, it just takes a lot longer to get there.

var levelup = require('level');
var db = levelup('./leakydb', {valueEncoding: 'json'}); // modifying cache size here doesn't help

var putOne = function (i) {
    var i = i + 1;
    // log a little message every 10,000 puts
    if (i % 10000 == 0) {
        console.log(i + " ops   " + (new Date()).toISOString());
    }
    // if we're under 9,000,001 ops, do another put, otherwise stop
    if (i < 9000001) {
        var keyFriendlyI = i + 1000000; // add a million so its sort friendly
        var key = "aKeyLabel~" + keyFriendlyI;
        var value = {
            i: i,
            createdDate: new Date(),
            someText: "I thought that maybe single puts would make it further than the batch approach",
            moreText: "but that isn't necessarily true.",
            evenMoreText: "This hits a FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory."
        };
        // tried setting {sync: true} here and we still run out of memory 
        // it just takes a lot longer to get there
        db.put(key, value, function (err) {
            if (err) {
                console.log(err);
            } else {
                putOne(i);
            }
        });
    } else {
        console.log("Finished!");
    }
};
putOne(0);

rvagg commented 11 years ago

I concur, my initial testing suggests it's deeper than batch

dominictarr commented 11 years ago

hmm, so, what happens if you run leveldown directly, without levelup?

rvagg commented 11 years ago

@tjfontaine I think we need some SmartOS magic to hunt down some kind of memory leak. It's likely in the C++ of LevelDOWN, perhaps even has something to do with the way Node interacts with LevelDB. How's that blog post coming along about finding leaks? Some pointers would be nice.

tjfontaine commented 11 years ago

I haven't finished the blog post, so here is a list of steps, some are obvious they are not meant to be patronizing, just for those who also might wander into this thread.

mlogin
- requires an active joyent account, or if you already have a smartos/illumos based instance somewhere else get there
git clone ...
npm install
UMEM_DEBUG=default node myLeakyScript.js &
- you can also use screen or tmux instead of running in the background, the point is it should be running for the next steps
- if you're using a library that doesn't link against libumem or links against something else for its memory allocation you can use LD_PRELOAD=libumem.so UMEM_DEBUG=default node myLeakyScript.js to force the dependency chain to use libumem allocators.
ps -ef | grep node
- find the pid of your node process
gcore <pid>
- creates a file named core.<pid> in $CWD
mdb core.<pid>
::findleaks
- This is only going to show you leaks it can verify on the C/C++ side of things, it's not going to be able to diagnose JS leaks
- You're looking for output that looks roughly like

CACHE LEAKED BUFCTL CALLER 08072c90 1 0807dd08 buf_create+0x12 08072c90 1 0807dca0 func_leak+0x12 08072c90 1 0807dbd0 main+0x12

0807dd08::bufctl_audit
- The third field is a bucket you can diagnose, the result of this will show you the stack trace for where the leaked memory was alloc'd

Catch me in IRC if there are more details you need or want an extra pair of eyes

rvagg commented 11 years ago

@tjfontaine I just don't get anything interesting, it mostly looks like this:

BYTES             LEAKED VMEM_SEG CALLER
8192                   1 932d9000 MMAP
------------------------------------------------------------------------
           Total       1 oversized leak, 8192 bytes

does that mean it can't find any c++ leaks and that the problem could be something to do with V8, like not letting go of persistent handles properly?

tjfontaine commented 11 years ago

Right, this means that -- at least during the time of the core -- all memory allocated on the native side was accounted for. I should have mentioned that you did the gcore after you were able to observe the memory leak happening?

The next step, since you've eliminated at least the obvious native leaks, works like this, in the same core:

::load v8.so
::findjsobjects -v -a

This may take a while (depending on the amount of objects that are still observable). It will output into a pager a list of object addresses, the count of objects, the amount of properties on that specific object type, and then finally a brief description of the "type" of object. This may be a primitive, or the property based class of an object.

8355dce5      210        8 Array
a313320d     1140        2 Arguments: length, callee
835860f9      492        7 Error: message, path, type, code, arguments, ...
8358aec1      537       40 Array
> a313320d::jsprint
{
    length: undefined,
    callee: undefined,
}
> a313320d::findjsobjects | ::jsprint
{
    length: undefined,
    callee: undefined,
}
{
    length: 3,
    callee: function <anonymous> (as EventEmitter.emit),
}
{
    length: 3,
    callee: function <anonymous> (as EventEmitter.emit),
}
...

This should give the easiest way to see what objects are still being held around in the JS heap, and where your leak may be, or it may not even be a leak so much as a less than ideal design.

Other reading can be found at http://dtrace.org/blogs/bmc/2012/05/05/debugging-node-js-memory-leaks/ or catching me on IRC

rvagg commented 11 years ago

oh the irony!

*** mdb: received signal SEGV at:
    [1] v8.so`jsobj_properties+0x262()
    [2] v8.so`findjsobjects_range+0x14b()
    [3] v8.so`findjsobjects_mapping+0x4a()
    [4] libproc.so.1`i_Pmapping_iter+0x60()
    [5] libproc.so.1`Pmapping_iter+0x16()
    [6] v8.so`dcmd_findjsobjects+0x187()
    [7] mdb`dcmd_invoke+0x40()
    [8] mdb`mdb_call_idcmd+0x128()
    [9] mdb`mdb_call+0x325()
    [10] mdb`yyparse+0x3f7()
    [11] mdb`mdb_run+0x26d()
    [12] mdb`main+0x153a()
    [13] mdb`_start+0x83()

tjfontaine commented 11 years ago

@rvagg this can happen on older v8.so's, if you put that core in manta however and mlogin -s /rvagg/stor/core.1234 --memory=2048 you should be able to do mdb /assets/rvagg/stor/core.1234 and it will do the right thing

rvagg commented 11 years ago

Thanks @tjfontaine, I've managed to figure out doing this on manta.

Everyone else: we seem to have a problem with persistent references not being freed. Running test/leak-tester.js after a while I get a lot of objects and these are the ones with significant enough counts to bother looking at, along with what I know & can guess about them:

817d5185    30523        0 Array
# 40 'empty' elements in each of these
85008a2d     3423        1 Object: key
# I'm pretty sure this is from get() calls
8172f419     3579        1 Array
8500890d     3372        2 Object: key, value
# I'm pretty sure this is from put() calls
850088e9     6799        1 Object: callback
# anon
8172f44d     3580        2 Object: callback, domain
# `run`, which is the main function in leak-tester.js
8172ed89    40025        2 SlowBuffer: length, used
# length=0, cleaned up perhaps?
8171484d    27367        3 Error: arguments, type, message
# mostly empty, the occasional one has a large hex string as the `message`

/cc @kkoopa

rvagg commented 11 years ago

I should also say that the nature of leak-tester.js is this: it generates random keys in a space of 10000000 possible, performs a get() on those keys (the Object with 'key' in the above data) to see if they exist, mostly they don't (the Error in the above data), then does a put() (the Object with 'key' and 'value' in the data above) for that key and a random String from: crypto.randomBytes(1024).toString('hex') (it can be made to do just Buffers but in my case I'm running Strings.

kkoopa commented 11 years ago

I see SlowBuffer in the list there, which indicates Node 0.11.2 or less. Is the problem present in older versions of leveldown? What about on newer Node? Could some persistent handles require being made weak?

rvagg commented 11 years ago

sorry, I should have said, this is the latest 0.10. I'm just messing with 0.11 now but am finding some other odd problems just getting the leak tester going!

kkoopa commented 11 years ago

I see, latest node master produces a whole bunch of failed tests due to uv-stuff:

node: ../deps/uv/src/unix/loop.c:150: uv__loop_delete: Assertion `!((((*(&(loop)->active_reqs))[0]) == (&(loop)->active_reqs)) == 0)' failed.
not ok test/approximate-size-test.js .................... 1/2
    Command: "node" "approximate-size-test.js"
    TAP version 13
    ok 1 cleanup returned an error
    not ok 2 test/approximate-size-test.js
      ---
        exit:    ~
        signal:  SIGABRT
        stderr:  node: ../deps/uv/src/unix/loop.c:150: uv__loop_delete: Assertion `!((((*(&(loop)->active_reqs))[0]) == (&(loop)->active_reqs)) == 0)' failed.
        command: "node" "approximate-size-test.js"
      ...

    1..2
    # tests 2
    # pass  1
    # fail  1

I'll try having a look with 0.10 in the meantime.

kkoopa commented 11 years ago

I don't recall anyone mentioning testing with --expose-gc and forcing garbage collection. If all (non-persistent) objects are still in the active context, they might not get garbage collected despite having no references and so forth.

rvagg commented 11 years ago

I messed a bit with --trace_gc and even --gc_interval but not much beyond that. All I could see was a bunch of V8 objects not being collected.

rvagg commented 11 years ago

So I think I've solved at least part of this puzzle. leak-tester.js behaves a little more reasonably after this: https://github.com/rvagg/nan/commit/e4097ab87b74fff4a5a6da7ff54f27d522e9e1d8

I honestly don't understand the interaction of v8::HandleScope and Persistent references (or perhaps this isn't even about Persistent references?) but it seems that when you forget to use it you get subtle but nasty behaviour. Perhaps @tjfontaine or @kkoopa have more of a clue than I?

rvagg commented 11 years ago

Anyone who thinks they see a leak: try reinstalling leveldown to pick up nan@0.3.2 from npm and see if it makes a difference.

rvagg commented 11 years ago

getCount = 1561000 , putCount =  1445759 , rss = 1583% 144M ["1","44","208","1284","0","0","0"]

Much more like normal now! memory goes up when the throughput is high and settles down as leveldb starts to slow down a little with higher volume. I'm not even sure whether it's v8 or leveldb that causes this memory behaviour but I know it's normal!

The next problem is that I can't even get leak-tester.js to run in Node 0.11.4+, Node seems to completely lose the reference to the callback given to put() and it doesn't get called at all so it just stops. If I put a console.log() in there then it continues. This is the case even up to current Node master so I'd be interested to hear from anyone who's using Node 0.11 for anything serious; @juliangruber perhaps?

kkoopa commented 11 years ago

Aha, that might very well be it. It's not about persistent references per se, but Locals (or Handles) on the stack. I am not sure, but seem to remember something along the lines of if a HandleScope is not specified in a non-inlined?function, the scope of the caller is used. For some reason I don't quite remember now, any time a new Local or Handle is created, you have to have a HandleScope in the surrounding scope (or is it context). NanPersistentToLocal always creates a new Local<T> in 0.11+, as well as for weak handles in 0.10-. This is 0.10- and the handles are not weak, so this does not apply in this particular case. However, #define NanSymbol(value) v8::String::NewSymbol(value) creates a new String and therefore a HandleScope is required.

In general, NAN assumes scopes are available in several functions, e.g. NanReturnValue(), which assumes a HandleScope called scope exists in 0.10-, but not in 0.11+, and so will work in the latter, but not in the former. The easiest solution is recommending always doing a NanScope() whenever using NAN methods.

juliangruber commented 11 years ago

@rvagg using node 0.11, levelup 0.14 and leveldown 0.8 in a dev environment and so far it works fine

rvagg commented 11 years ago

@fergiemcdowall could you report back when you've given the latest a go and close this issue if you think your immediate problem has gone away.

You should npm ls | grep nan and see nan@0.3.2 to know you have the fixed version.

fergiemcdowall commented 11 years ago

Testing on a Windows 8 machine. Will try OSX and Linux later

The issue as reported is still there, although the behaviour is slightly different. Memory allocation still climbs inexorably upwards, and finally crashes with FATAL ERROR: JS Allocation failed - process out of memory, although now "batch" is a bit slower. Same end result, but it takes longer to get there (although its around the same in terms of total .batch calls issued)

fergiemcdowall commented 11 years ago

My Level is as follows:

├── levelup@0.14.0 (concat-stream@0.1.1, simple-bufferstream@0.0.4, prr@0.0.0, e rrno@0.0.5, semver@1.1.4, bops@0.0.6, xtend@2.0.6) └── leveldown@0.8.0 (bindings@1.1.1, nan@0.3.2)

And my node is:

$ node --version v0.11.6

rvagg commented 11 years ago

A leveldown 0.8.1 was published a couple of days ago with some additional (minor) leak fixes that were present in my local testing environment but I'd forgotten to commit and publish them when I found the NAN leak. 0.8.2 is also out now but that's just got FreeBSD support.

So, grab the latest version and try again and let's see if it's any different.

fergiemcdowall commented 11 years ago

Hmmm, now I get a slightly different error at more or less the same point:

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory

My Level is now

search-index@0.2.17 node_modules\search-index ├── levelup@0.14.0 (concat-stream@0.1.1, prr@0.0.0, simple-bufferstream@0.0.4, e rrno@0.0.5, semver@1.1.4, bops@0.0.6, xtend@2.0.6) └── leveldown@0.8.1 (bindings@1.1.1, nan@0.3.2)

Level / levelup

Memory leak in db.batch #171