Closed CMCDragonkai closed 12 months ago
I can take memory dumps and do diffs between them. But there's too much going on in the code to know for sure where the leaks are coming from.
What I've done so far.
polykey
that just runs a PolykeyAgent
. I can run this with profiling tools to take memory snapshots at any time.I think moving forward it's better to isolate parts of the code and reduce the amount of noise in the snapshots. If there are any true leaks then they'll survive a garbage collection. But we can't really know when a GC happens? But we can trigger one within code. So moving forward I can approach the problem a few ways.
Moving forward I'm going to modify the source code slightly so that background tasks happen with a much shorter delay, I'll also run 4 nodes locally simulating a network. Hopefully with a lot more happening in the background I can see the memory leak will be more noticeable.
Hmm, there's new stability issues with the nodes now.
One coming from js-quic
. This may be a race condition in js-quic, I'll need to investigate.
WARN:task v0pa8b6o9rto0154c8lmt97rabk:Failed - Reason: ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message
/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/quic/dist/QUICStream.js:255
throw new errors.ErrorQUICStreamInternal('Failed to prime local stream state with a 0-length message', { cause: e });
^
ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message
at new QUICStream (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/quic/dist/QUICStream.js:255:27)
at new QUICStream (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/events/dist/Evented.js:25:17)
... 7 lines matching cause stack trace ...
at Proxy.<anonymous> (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/rpc/dist/RPCClient.js:56:50) {
data: {},
cause: Error: StreamLimit
at new QUICStream (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/quic/dist/QUICStream.js:247:33)
at new QUICStream (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/events/dist/Evented.js:25:17)
at new QUICStream (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/async-init/dist/CreateDestroy.js:16:30)
at QUICStream.createQUICStream (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/quic/dist/QUICStream.js:64:24)
at constructor_.newStream (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/quic/dist/QUICConnection.js:864:49)
at constructor_.newStream (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/async-init/dist/StartStop.js:189:26)
at RPCClient.streamFactory (/home/faulkes/matrixcode/polykey/Polykey/dist/nodes/NodeConnection.js:252:39)
at RPCClient.duplexStreamCaller (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/rpc/dist/RPCClient.js:240:40)
at RPCClient.serverStreamCaller (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/rpc/dist/RPCClient.js:140:44)
at Proxy.<anonymous> (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/rpc/dist/RPCClient.js:56:50) {
code: 'GenericFailure'
},
timestamp: 2023-11-06T03:14:15.432Z
}
And one I need to fix in polykey
. This one is a simple fix. Though it's weird that it's timing out at all.
INFO:Status:Status is DEAD
/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/rpc/dist/RPCClient.js:250
rpcStream.cancel(new errors.ErrorRPCTimedOut('RPC has timed out', {
^
ErrorRPCTimedOut: RPC has timed out
at refreshingTimer (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/rpc/dist/RPCClient.js:250:30)
at onFulfilled_ (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/async-cancellable/dist/PromiseCancellable.js:145:39) {
data: {},
cause: ErrorContextsTimedTimeOut
at setupTimedCancellable (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/contexts/dist/functions/timedCancellable.js:72:19)
at constructor_.getRemoteNodeClosestNodes (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/contexts/dist/decorators/timedCancellable.js:54:65)
at constructor_.getRemoteNodeClosestNodes (/home/faulkes/matrixcode/polykey/Polykey/node_modules/@matrixai/async-init/dist/StartStop.js:189:26)
at constructor_.syncNodeGraph (/home/faulkes/matrixcode/polykey/Polykey/dist/nodes/NodeManager.js:780:67) {
data: {},
cause: undefined,
timestamp: 2023-11-06T03:13:16.219Z
},
timestamp: 2023-11-06T03:13:31.220Z,
code: -32008
}
There are too many moving parts in Polykey
. I can maybe gain some simpler insights by running small stress tests in js-RPC
. I'm seeing a lot of live streams on the memory profiling. But I'll need to do some A/B testing to see how things affect memory.
The last snapshot I took showed a lot of retained memory from the streamMap
in QUICConnection
and the handler
in Timer
. The Handler is a closure so that will hold the context of where it was defined. From reading up on this, closures like this seem to be a common source of memory problems.
My next step is to take a closer look at js-timer
to see how we can track down why it's happening and optimise it. In this case we may just need to delete the handler when it's not needed anymore.
Current status update @tegefaulkes? ETA on fix?
It's hard to give a fixed ETA on this. There's too much unknown to give a number. The problem scope is that somewhere in the code we're doing something that holds on to memory for too long.
Right now I have an idea what it is. The handlers in Timer
and the streamMap
in Connection
and possibly the webstreams. I don't know exactly why yet.
Reading up on memory issues in javascript points out these common suspects. https://amplication.com/blog/understanding-and-preventing-memory-leaks-in-nodejs
So moving forward I'm going to focus on these.
But like I said, the problem scope is kind of unbounded, so I can't estimate an ETA. That's why I've been suggesting focusing on the other issues first since their scopes are well known and right now things function with the memory issue.
I have two tests doing the same thing.
Timer
s with a handler that references a 1kb
buffer. When these timers are created they are put into an array.The code looks like this.
async function main1() {
console.log('Take Snapshot1 BEFORE');
await sleep(10000);
// example of taking up a lot of memory
let timers = []
for (let i = 0; i < 10000; i++) {
timers.push(createTimer());
}
console.log('Take Snapshot2 ACTIVE TIMERS');
await Promise.all(timers);
console.log('Take Snapshot2 DONE TIMERS');
await sleep(10000);
timers = null;
if (global.gc) global.gc();
console.log('Take Snapshot3 GC');
await sleep(10000);
}
The first run leaves the array alone to keep the reference to all created timers.
When the test is finished we can see that the snapshot shows 44MB used. I can see that 20MB are allocated when the timers are created. A further 9MB is created when the timers settle.
Since we're holding a reference to all timers, this is about what we'd expect. The timer
array is the main source of the leak here. But can see that the timers are still keeping a reference to things it doesn't need anymore. Mainy the handler closure is being kept for the full lifetime of the handler.
The 2nd test does the same but sets timers
array to null
to remove the reference
We can see that most of the memory has been freed, but not all of it.
So right now we can conclude a few things.
Here we have an example of live memory. We can see what things are still live by the end of the tests.
Hmm, some things are still live which seem odd. The createTimer
function shouldn't be. I'll try re-running with use strict
to avoid hoisting.
Using use strict
does remove the createTimer
from the list. But it's mostly the same.
Ok, so that's not really a leak. with 10x more timers created, the remaining memory remains the same.
So the only real problem with Timer
is that it keeps the closure around longer than it really needs to. Nulling the handlers when done with them can help. But we should be cleaning up timer properly anyway.
If you need to do an explicit deletion use delete this.handler
. Probably a good idea, since it's held around in the function context. Make sure this is only done when the timer is no longer needed...
I pushed a patch to free the handler when the timer is done.
I'm going to move on to checking out js-quic
and see if the streams/connections are cleaned up properly.
I found a leak in js-logger
Logger
class. each Logger
can create a child logger with getChild()
. We use this everywhere in our code, each now class get's a child logger. This is how we track where the logging is coming from.
However, each Logger
tracks it's parent and children. This means the whole tree of children loggers forms a strongly connected reference to each other referred to an Object Group
. The node GC is smart enough to know that if this group detaches from a root reference then it can be deleted. HOWEVER If a single reference is held to any object in this group then the whole group is retained. In this case, we usually keep at least 1 logger around for long running objects.
I created a test script to demonstrate this.
"use strict";
const { default: Logger } = require('./dist/Logger');
function sleep(delay) {
return new Promise( r => setTimeout(r, delay));
}
async function main() {
console.log('starting');
await sleep(5000);
const root = new Logger();
let head = root;
for (let i = 0; i < 10000; i++) {
head = head.getChild(`Logger ${i}`);
}
console.log('created!');
await sleep(1000);
head = null;
console.log('deleted!')
await sleep(10000);
console.log('done!')
}
main();
Running this script with the inspector shows the following.
Loggers
we're holding 5MB of memory.Loggers
still exist in memory even though we only keep the root logger.So If I modify Logger
to only track used key
s and not the children we should avoid the problem. We can see this realised by running the test after making the change.
I'll push a patch to the logger in a moment.
Looking over js-quic
again. I have two concerns.
I'm going to look over contexts now.
Looks like a timedCancellable
function cleans up fully once the PromiseCancellable
is un referenced. So no real problem there. Note that the Timer
, EventTarget
, AbortController
and the like is referenced by this PromiseCancellable
. I feel as an optimisation this once the promise has resolved these can be released.
Some notes about EventTarget
.
this
within the closure.So it's not the end of the world to not remove event handlers. But since EventTarget
s tend to be long lived it's not ideal.
With that said. If you pass in a signal to the ctx
and not free it at some point then there's a chance you'll leak the handlers and the timeout timer if it was created.
@amydevs can you deploy a new one based on the current PK now.
@tegefaulkes go to AWS have a look.
And post another picture of the memory leak since we had 24 hrs.
The PK CLI auto deploys to the the testnet on every commit to staging.
We can see that before we had 3% in 8 hrs, now it is 1% growth in 8 hrs.
If this stabilises over the weekend, then at the very least the memory leak is solved while idling, but we don't know if that could still occur in other interactions.
Previously we saw
with the latest push we see
I think this can be said as sufficiently solved practically. Theoretically there is likely other memory leak issues to be detected, we need a better instrumentation and observability systems to help us detect those. But I think we can close this issue.
Describe the bug
Testnet 6 has been deployed and we are getting some interesting results from the nodes, they no longer crash but is accumulating memory.
This is evidence of a memory leak. Once it reaches near 100% it will crash.
I recently saw this, and you can see this occurred during an automatic deployment so it was not ac crash.
But you can see the memory usage gets reset again and it will start growing.
Now we could try to find out the problem and directly solve it, but long-term this problem is likely to re-occur. We need a more robust way of detecting memory leaks in production over a long period of time.
I think to achieve this we can add:
To Reproduce
Expected behavior
It should just flatten out when idling.
Screenshots
See above.
Additional context
412 - the idea of an HTTP status page is also useful as it can provide more useful explanation too
542
Notify maintainers