Open warner opened 2 years ago
I'm indeed observing a memory leak when observeIteration
is following a generator that follows a Notifier. The following setup code:
const { updater, notifier } = makeNotifierKit(); // Y1
async function *getBalances() { // Y2
let uprec = await notifier.getUpdateSince(); // read Y1
while (uprec.updateCount) {
yield uprec.value;
uprec = await notifier.getUpdateSince(uprec.updateCount); // read Y1
}
}
observeIteration(getBalances(), {}); // reads Y2
causes the Node.js process to grow by approximately 3.3kB per loop (where each loop is a updater.updateState()
followed by a setImmediate
-based promise-queue flush).
The full test program is:
/* global setImmediate */
import '@endo/init';
import fs from 'fs';
import { makeNotifierKit, observeIteration } from '@agoric/notifier';
async function run() {
const { updater, notifier } = makeNotifierKit(); // Y1
async function *getBalances() { // Y2
let uprec = await notifier.getUpdateSince(); // read Y1
while (uprec.updateCount) {
yield uprec.value;
uprec = await notifier.getUpdateSince(uprec.updateCount); // read Y1
}
}
observeIteration(getBalances(), {}); // reads Y2
for (let amount = 0n; amount < 1_000_000n; amount += 1n) {
if (amount % 10_000n === 0n) {
const fn = `/proc/${process.pid}/status`;
const status = fs.readFileSync(fn);
for (let line of status.toString().split('\n')) {
if (line.startsWith('VmRSS')) {
const rss = /(\d+)/.exec(line)[0];
console.log(`${amount.toString().padStart(9)} , ${rss.toString().padStart(9)}`);
}
}
}
updater.updateState(amount); // write Y1
await new Promise(res => setImmediate(res));
}
}
run();
cc @erights
Here's the memory usage of 200k iterations of the above program with the current version of @endo/eventual-send
:
iterations , memory(kiB)
0 , 42236
10000 , 121372
20000 , 148852
30000 , 185308
40000 , 231172
50000 , 234836
60000 , 311864
70000 , 299348
80000 , 340960
90000 , 385784
100000 , 396028
110000 , 406796
120000 , 508372
130000 , 514412
140000 , 524392
150000 , 534880
160000 , 572580
170000 , 664660
180000 , 765180
190000 , 709856
200000 , 719828
If I edit eventual-send
to disable trackTurns()
(by returning funcs
unwrapped):
then I get:
iterations , memory(kiB)
0 , 41816
10000 , 81920
20000 , 86572
30000 , 92832
40000 , 85632
50000 , 85244
60000 , 90000
70000 , 96016
80000 , 85668
90000 , 87012
100000 , 93060
110000 , 87524
120000 , 86664
130000 , 90056
140000 , 96296
150000 , 87336
160000 , 87944
170000 , 93220
180000 , 88416
190000 , 87904
200000 , 90060
which is much better. I don't trust that it's flat yet (I ran it for 1M iterations and the final value was 89644 KiB), but if there's still a leak, then it's not more than a few bytes per cycle.
Commenting out the line that retains sendingError
is also sufficient to stop the leak:
return funcs.map(
func =>
func &&
((...args) => {
hiddenPriorError = sendingError; // <--- comment out this
hiddenCurrentTurn += 1;
hiddenCurrentEvent = 0;
I experimented with disabling other aspects without success: disabling the assert.note
calls, and removing the catch
added on line 81. None of those changes made the leak go away.
Thanks for tracking this down!
I don't think it is worth creating a version of trackTurns in which this one line is disabled. Rather
If I edit eventual-send to disable trackTurns() (by returning funcs unwrapped):
That is the correct solution for production and performance testing. It can even default off, but we need to find a way to make it switchable, so we can turn it on during development when we need to. (Better would be streaming this info to a logger that either logs to something external or drops things, rather than keep diagnostics in ram.)
Lacking either mechanism, we can just turn it off for now. Then, when we find we need it to track down a bug, we can edit source code.
cc @kriskowal since this change is in eventual-send
, I think we'll need an Endo release to resolve it
cc @kriskowal since this change is in
eventual-send
, I think we'll need an Endo release to resolve it
Not really. Someone (you, @turadg, ??) reminded me we can use the patch mechanism in the meantime. I'll try to get two PRs to you shortly. One in agoric-sdk using patch for immediate repair, and one for endo.
The agoric-sdk patch PR will have no switching mechanism. I'll leave anything that complex for the endo PR. The agoric-sdk PR will just switch it off. It'll then stay off until a future endo release gives us a switch for turning it back on.
Sounds good and sounds like we should run, not walk, to writing traces to a ring buffer that an external aggregator can periodically flush, in order to turn these back on maybe someday.
Sounds good and sounds like we should run, not walk, to writing traces to a ring buffer that an external aggregator can periodically flush, in order to turn these back on maybe someday.
Any idea how we could do this quickly? Since I have no such plan, the endo PR I'll have for you soon is much less ambitious. But, as always, I'd rather do the right thing than invest further in stopgap engineering until we do the right thing. But we do need this quickly and the stopgap will be small.
In order to break the chain of indefinite retention, it may be sufficient to just do the ring buffer and worry about collection later. For now, it may be sufficient to retain a fixed size array in which we store errors, and a parallel array tracking the sequence number of the parent error, such that we can construct a deep-but-not-that-deep stack trace on demand, showing elisions for ancestors that were overwritten. So, the index of an error would be the sequence number modulo the size of the ring buffer, unless the error in that position does not have a matching sequence number.
I like that plan, but it is more than I wanted to do right now. After I do the agoric-sdk patch PR https://github.com/Agoric/agoric-sdk/pull/5892 , may I assign the endo part to you?
I assigned both of us. We can coordinate from there.
@erights has since implemented a ring buffer for async stacks possibly alleviating this concern. Has the fix already been integrated and verified? https://github.com/endojs/endo/pull/1513
What is the Problem Being Solved?
While investigating #5671, @erights identified something to look into. HandledPromise does some clever tricks to retain enough information for a "deep stack" to be displayed in case of an error: when one
E()
call triggers another, an Error object is created (to capture a stack trace) and held internally by the new Promise.There's a portion of vat-bank which wraps a Notifier in a generator to provide an iterator of new balances:
https://github.com/Agoric/agoric-sdk/blob/15234fa723c1e1a14843eff154b38439b1cdb1f9/packages/vats/src/vat-bank.js#L42-L52
This is wrapped by the virtual purse to eagerly update a local
lastBalance
variable each time the balance changes:https://github.com/Agoric/agoric-sdk/blob/15234fa723c1e1a14843eff154b38439b1cdb1f9/packages/vats/src/virtual-purse.js#L105-L116
observeIteration
is defined by ournotifier
package:https://github.com/Agoric/agoric-sdk/blob/15234fa723c1e1a14843eff154b38439b1cdb1f9/packages/notifier/src/asyncIterableAdaptor.js#L137-L139
and uses
observeIterator
:https://github.com/Agoric/agoric-sdk/blob/15234fa723c1e1a14843eff154b38439b1cdb1f9/packages/notifier/src/asyncIterableAdaptor.js#L101-L124
This composition probably creates a series of promises over time (one per balance update), but might do so in a fashion that makes each promise look like a causal descendant of the previous one. If so, the deep-stack code might be retaining an unbounded series of Error objects, causing memory usage to grow linearly with the number of updates, when it ought to be flat.
The first step is to write a small test harness that emulates the vat-bank loop, and run it a few million times, while watching the memory usage of the process.