MatrixAI / Polykey

Polykey Core Library
https://polykey.com
GNU General Public License v3.0
31 stars 4 forks source link

Memory Leak Tracking and Remote Debugging #598

Closed CMCDragonkai closed 12 months ago

CMCDragonkai commented 1 year ago

Describe the bug

Testnet 6 has been deployed and we are getting some interesting results from the nodes, they no longer crash but is accumulating memory.

image

This is evidence of a memory leak. Once it reaches near 100% it will crash.

I recently saw this, and you can see this occurred during an automatic deployment so it was not ac crash.

image

But you can see the memory usage gets reset again and it will start growing.

Now we could try to find out the problem and directly solve it, but long-term this problem is likely to re-occur. We need a more robust way of detecting memory leaks in production over a long period of time.

I think to achieve this we can add:

To Reproduce

  1. Go to https://ap-southeast-2.console.aws.amazon.com/cloudwatch/home?region=ap-southeast-2#dashboards/dashboard/polykey-testnet and observe the memory utilization graph.

Expected behavior

It should just flatten out when idling.

Screenshots

See above.

Additional context

Notify maintainers

tegefaulkes commented 1 year ago

I can take memory dumps and do diffs between them. But there's too much going on in the code to know for sure where the leaks are coming from.

What I've done so far.

  1. I made a small test script in polykey that just runs a PolykeyAgent. I can run this with profiling tools to take memory snapshots at any time.
  2. I took snapshots at irregular times, these were saved to files I could inspect later.
  3. I used my IDE tool for looking at these snapshot. The chrome dev tools can do this too and I think are better. I may look into other tools but the chrome tools should suffice.
  4. Looking over the snapshots I found that at most 44MB were used. A far cry from the 200MB minimum the docker image uses.
  5. Looking through the snapshots. It seems that a lot of memory is taken up by strings, buffers, closures, abort signals. Pretty much anything that is used that should be ephemeral. This may just be normal build up from usage.

I think moving forward it's better to isolate parts of the code and reduce the amount of noise in the snapshots. If there are any true leaks then they'll survive a garbage collection. But we can't really know when a GC happens? But we can trigger one within code. So moving forward I can approach the problem a few ways.

  1. Write test scripts that isolate some aspect of the code. E.G. How buffers, closures and events could cause leaks.
  2. Isolate specific libraries async-init, timers, timed cancellable, async-locks for testing as well.
  3. Trigger GC manually to filter out memory that would be cleaned up anyway. And have a comparison of what kind of memory isn't a problem but hangs around for a little while. We can also look into freeing this memory more pro-actively.
tegefaulkes commented 1 year ago

Moving forward I'm going to modify the source code slightly so that background tasks happen with a much shorter delay, I'll also run 4 nodes locally simulating a network. Hopefully with a lot more happening in the background I can see the memory leak will be more noticeable.

tegefaulkes commented 1 year ago

Hmm, there's new stability issues with the nodes now.

One coming from js-quic. This may be a race condition in js-quic, I'll need to investigate.

WARN:task v0pa8b6o9rto0154c8lmt97rabk:Failed - Reason: ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message
/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/quic/dist/QUICStream.js:255
                    throw new errors.ErrorQUICStreamInternal('Failed to prime local stream state with a 0-length message', { cause: e });
                          ^

ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message
    at new QUICStream (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/quic/dist/QUICStream.js:255:27)
    at new QUICStream (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/events/dist/Evented.js:25:17)
    ... 7 lines matching cause stack trace ...
    at Proxy.<anonymous> (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/rpc/dist/RPCClient.js:56:50) {
  data: {},
  cause: Error: StreamLimit
      at new QUICStream (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/quic/dist/QUICStream.js:247:33)
      at new QUICStream (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/events/dist/Evented.js:25:17)
      at new QUICStream (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/async-init/dist/CreateDestroy.js:16:30)
      at QUICStream.createQUICStream (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/quic/dist/QUICStream.js:64:24)
      at constructor_.newStream (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/quic/dist/QUICConnection.js:864:49)
      at constructor_.newStream (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/async-init/dist/StartStop.js:189:26)
      at RPCClient.streamFactory (/home/faulkes/matrixcode/polykey/Polykey/dist/nodes/NodeConnection.js:252:39)
      at RPCClient.duplexStreamCaller (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/rpc/dist/RPCClient.js:240:40)
      at RPCClient.serverStreamCaller (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/rpc/dist/RPCClient.js:140:44)
      at Proxy.<anonymous> (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/rpc/dist/RPCClient.js:56:50) {
    code: 'GenericFailure'
  },
  timestamp: 2023-11-06T03:14:15.432Z
}

And one I need to fix in polykey. This one is a simple fix. Though it's weird that it's timing out at all.

INFO:Status:Status is DEAD
/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/rpc/dist/RPCClient.js:250
            rpcStream.cancel(new errors.ErrorRPCTimedOut('RPC has timed out', {
                             ^

ErrorRPCTimedOut: RPC has timed out
    at refreshingTimer (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/rpc/dist/RPCClient.js:250:30)
    at onFulfilled_ (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/async-cancellable/dist/PromiseCancellable.js:145:39) {
  data: {},
  cause: ErrorContextsTimedTimeOut
      at setupTimedCancellable (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/contexts/dist/functions/timedCancellable.js:72:19)
      at constructor_.getRemoteNodeClosestNodes (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/contexts/dist/decorators/timedCancellable.js:54:65)
      at constructor_.getRemoteNodeClosestNodes (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/async-init/dist/StartStop.js:189:26)
      at constructor_.syncNodeGraph (/home/faulkes/matrixcode/polykey/Polykey/dist/nodes/NodeManager.js:780:67) {
    data: {},
    cause: undefined,
    timestamp: 2023-11-06T03:13:16.219Z
  },
  timestamp: 2023-11-06T03:13:31.220Z,
  code: -32008
}
tegefaulkes commented 1 year ago

There are too many moving parts in Polykey. I can maybe gain some simpler insights by running small stress tests in js-RPC. I'm seeing a lot of live streams on the memory profiling. But I'll need to do some A/B testing to see how things affect memory.

tegefaulkes commented 1 year ago

The last snapshot I took showed a lot of retained memory from the streamMap in QUICConnection and the handler in Timer. The Handler is a closure so that will hold the context of where it was defined. From reading up on this, closures like this seem to be a common source of memory problems.

My next step is to take a closer look at js-timer to see how we can track down why it's happening and optimise it. In this case we may just need to delete the handler when it's not needed anymore.

CMCDragonkai commented 1 year ago

Current status update @tegefaulkes? ETA on fix?

tegefaulkes commented 1 year ago

It's hard to give a fixed ETA on this. There's too much unknown to give a number. The problem scope is that somewhere in the code we're doing something that holds on to memory for too long.

Right now I have an idea what it is. The handlers in Timer and the streamMap in Connection and possibly the webstreams. I don't know exactly why yet.

Reading up on memory issues in javascript points out these common suspects. https://amplication.com/blog/understanding-and-preventing-memory-leaks-in-nodejs

  1. Global variables
  2. Circular reference?
  3. Closures
  4. Timers
  5. Event handlers

So moving forward I'm going to focus on these.

But like I said, the problem scope is kind of unbounded, so I can't estimate an ETA. That's why I've been suggesting focusing on the other issues first since their scopes are well known and right now things function with the memory issue.

tegefaulkes commented 1 year ago

I have two tests doing the same thing.

  1. create 10000 Timers with a handler that references a 1kb buffer. When these timers are created they are put into an array.
  2. Wait for all timers to finish
  3. Either null the array to free the reference to all the timers or leave it be.
  4. trigger GC.

The code looks like this.

async function main1() {

  console.log('Take Snapshot1 BEFORE');
  await sleep(10000);

  // example of taking up a lot of memory
  let timers = []
  for (let i = 0; i < 10000; i++) {
    timers.push(createTimer());
  }
  console.log('Take Snapshot2 ACTIVE TIMERS');
  await Promise.all(timers);
  console.log('Take Snapshot2 DONE TIMERS');
  await sleep(10000);

  timers = null;
  if (global.gc) global.gc();
  console.log('Take Snapshot3 GC');
  await sleep(10000);
}

The first run leaves the array alone to keep the reference to all created timers.

When the test is finished we can see that the snapshot shows 44MB used. I can see that 20MB are allocated when the timers are created. A further 9MB is created when the timers settle.

image

  1. Shows the total un-freed memory.
  2. First peak shows creation of the timers
  3. 2nd peak shows when they settle.
  4. We can see most of the retained size is due to arrays, Timer and Closures. That relates to the Timer holding reference to the handler which holds reference to the buffer.

Since we're holding a reference to all timers, this is about what we'd expect. The timer array is the main source of the leak here. But can see that the timers are still keeping a reference to things it doesn't need anymore. Mainy the handler closure is being kept for the full lifetime of the handler.

The 2nd test does the same but sets timers array to null to remove the reference

image

  1. Total memory is significantly reduced.
  2. First peak is timer creation.
  3. 2nd peak is where timers settle.

We can see that most of the memory has been freed, but not all of it.

So right now we can conclude a few things.

  1. Timers will hold reference to the closure for as long as the timer exists. This can be pretty bad since a badly made closure can hold reference to the whole scope it was defined in. Could be a source of leaks. In any case, we can look into nulling any properties we don't need after settling to free memory proactively.
  2. Even if we remove all reference to a time we've created, there is still some memory being held by them. Something small is leaking here but it builds up.

Here we have an example of live memory. We can see what things are still live by the end of the tests.

image

Hmm, some things are still live which seem odd. The createTimer function shouldn't be. I'll try re-running with use strict to avoid hoisting.

tegefaulkes commented 1 year ago

Using use strict does remove the createTimer from the list. But it's mostly the same.

image

tegefaulkes commented 1 year ago

Ok, so that's not really a leak. with 10x more timers created, the remaining memory remains the same.

image

So the only real problem with Timer is that it keeps the closure around longer than it really needs to. Nulling the handlers when done with them can help. But we should be cleaning up timer properly anyway.

CMCDragonkai commented 1 year ago

If you need to do an explicit deletion use delete this.handler. Probably a good idea, since it's held around in the function context. Make sure this is only done when the timer is no longer needed...

tegefaulkes commented 1 year ago

I pushed a patch to free the handler when the timer is done.

I'm going to move on to checking out js-quic and see if the streams/connections are cleaned up properly.

tegefaulkes commented 1 year ago

I found a leak in js-logger Logger class. each Logger can create a child logger with getChild(). We use this everywhere in our code, each now class get's a child logger. This is how we track where the logging is coming from.

However, each Logger tracks it's parent and children. This means the whole tree of children loggers forms a strongly connected reference to each other referred to an Object Group. The node GC is smart enough to know that if this group detaches from a root reference then it can be deleted. HOWEVER If a single reference is held to any object in this group then the whole group is retained. In this case, we usually keep at least 1 logger around for long running objects.

I created a test script to demonstrate this.

"use strict";
const { default: Logger } = require('./dist/Logger');

function sleep(delay) {
  return new Promise( r => setTimeout(r, delay));
}

async function main() {

  console.log('starting');
  await sleep(5000);

  const root = new Logger();
  let head = root;
  for (let i = 0; i < 10000; i++) {
    head = head.getChild(`Logger ${i}`);
  }
  console.log('created!');
  await sleep(1000);

  head = null;
  console.log('deleted!')
  await sleep(10000);
  console.log('done!')
}

main();

Running this script with the inspector shows the following.

  1. We see with 10,000 Loggers we're holding 5MB of memory.
  2. We can see that all 10,000 Loggers still exist in memory even though we only keep the root logger.

image

So If I modify Logger to only track used keys and not the children we should avoid the problem. We can see this realised by running the test after making the change.

  1. We can see the memory is reduced
  2. Only the root logger remains since we only hold that reference by the end of the test.

image

I'll push a patch to the logger in a moment.

tegefaulkes commented 1 year ago

Looking over js-quic again. I have two concerns.

  1. contexts is registering event handlers for abortion but not removing them. I'm pretty sure this is leaking the timeout timers.
  2. Each stream creates a 1mb buffer by default. This is a lot of memory for a single stream. It's getting released properly BUT this means a lot of concurrent streams will eat a lot of memory. To fix this we need to use the BYOB streams sooner rather than later. https://github.com/MatrixAI/js-quic/issues/65
tegefaulkes commented 1 year ago

I'm going to look over contexts now.

tegefaulkes commented 1 year ago

Looks like a timedCancellable function cleans up fully once the PromiseCancellable is un referenced. So no real problem there. Note that the Timer, EventTarget, AbortController and the like is referenced by this PromiseCancellable. I feel as an optimisation this once the promise has resolved these can be released.

Some notes about EventTarget.

  1. An event target will hold a reference to all handlers provided to it. These will be held live so long as the target is live.
  2. Handlers are closures, so anything referenced within them will be kept alive as well. Be careful with this. Try not to reference this within the closure.
  3. If the event target is un-referenced it is garbage collected, this includes all of the handlers that were still registered.

So it's not the end of the world to not remove event handlers. But since EventTargets tend to be long lived it's not ideal.

With that said. If you pass in a signal to the ctx and not free it at some point then there's a chance you'll leak the handlers and the timeout timer if it was created.

CMCDragonkai commented 12 months ago
  1. There hasn't been an empirical test on your js-logger changes on PK CLI. The changes are already in PK. Propagate to the PK CLI and deploy a new seed nodes to check this.
  2. All the other things mentioned are optimisations but don't look like sources of memory leaks.
CMCDragonkai commented 12 months ago

@amydevs can you deploy a new one based on the current PK now.

CMCDragonkai commented 12 months ago

@tegefaulkes go to AWS have a look.

CMCDragonkai commented 12 months ago

And post another picture of the memory leak since we had 24 hrs.

CMCDragonkai commented 12 months ago

The PK CLI auto deploys to the the testnet on every commit to staging.

amydevs commented 12 months ago

image

CMCDragonkai commented 12 months ago

We can see that before we had 3% in 8 hrs, now it is 1% growth in 8 hrs.

If this stabilises over the weekend, then at the very least the memory leak is solved while idling, but we don't know if that could still occur in other interactions.

tegefaulkes commented 12 months ago

Previously we saw

image

with the latest push we see

image

CMCDragonkai commented 12 months ago

I think this can be said as sufficiently solved practically. Theoretically there is likely other memory leak issues to be detected, we need a better instrumentation and observability systems to help us detect those. But I think we can close this issue.

image