does HandledPromise's deep-stack support cause memory problems when used in iterator/generators?

warner commented 2 years ago

What is the Problem Being Solved?

While investigating #5671, @erights identified something to look into. HandledPromise does some clever tricks to retain enough information for a "deep stack" to be displayed in case of an error: when one E() call triggers another, an Error object is created (to capture a stack trace) and held internally by the new Promise.

There's a portion of vat-bank which wraps a Notifier in a generator to provide an iterator of new balances:

https://github.com/Agoric/agoric-sdk/blob/15234fa723c1e1a14843eff154b38439b1cdb1f9/packages/vats/src/vat-bank.js#L42-L52

This is wrapped by the virtual purse to eagerly update a local lastBalance variable each time the balance changes:

https://github.com/Agoric/agoric-sdk/blob/15234fa723c1e1a14843eff154b38439b1cdb1f9/packages/vats/src/virtual-purse.js#L105-L116

observeIteration is defined by our notifier package:

https://github.com/Agoric/agoric-sdk/blob/15234fa723c1e1a14843eff154b38439b1cdb1f9/packages/notifier/src/asyncIterableAdaptor.js#L137-L139

and uses observeIterator:

https://github.com/Agoric/agoric-sdk/blob/15234fa723c1e1a14843eff154b38439b1cdb1f9/packages/notifier/src/asyncIterableAdaptor.js#L101-L124

This composition probably creates a series of promises over time (one per balance update), but might do so in a fashion that makes each promise look like a causal descendant of the previous one. If so, the deep-stack code might be retaining an unbounded series of Error objects, causing memory usage to grow linearly with the number of updates, when it ought to be flat.

The first step is to write a small test harness that emulates the vat-bank loop, and run it a few million times, while watching the memory usage of the process.

warner commented 2 years ago

I'm indeed observing a memory leak when observeIteration is following a generator that follows a Notifier. The following setup code:

  const { updater, notifier } = makeNotifierKit(); // Y1

  async function *getBalances() { // Y2
    let uprec = await notifier.getUpdateSince(); // read Y1
    while (uprec.updateCount) {
      yield uprec.value;
      uprec = await notifier.getUpdateSince(uprec.updateCount); // read Y1
    }
  }

  observeIteration(getBalances(), {}); // reads Y2

causes the Node.js process to grow by approximately 3.3kB per loop (where each loop is a updater.updateState() followed by a setImmediate-based promise-queue flush).

The full test program is:

/* global setImmediate */
import '@endo/init';
import fs from 'fs';
import { makeNotifierKit, observeIteration } from '@agoric/notifier';

async function run() {
  const { updater, notifier } = makeNotifierKit(); // Y1

  async function *getBalances() { // Y2
    let uprec = await notifier.getUpdateSince(); // read Y1
    while (uprec.updateCount) {
      yield uprec.value;
      uprec = await notifier.getUpdateSince(uprec.updateCount); // read Y1
    }
  }

  observeIteration(getBalances(), {}); // reads Y2

  for (let amount = 0n; amount < 1_000_000n; amount += 1n) {
    if (amount % 10_000n === 0n) {
      const fn = `/proc/${process.pid}/status`;
      const status = fs.readFileSync(fn);
      for (let line of status.toString().split('\n')) {
        if (line.startsWith('VmRSS')) {
          const rss = /(\d+)/.exec(line)[0];
          console.log(`${amount.toString().padStart(9)} , ${rss.toString().padStart(9)}`);
        }
      }
    }
    updater.updateState(amount); // write Y1
    await new Promise(res => setImmediate(res));
  }
}
run();

cc @erights

warner commented 2 years ago

Here's the memory usage of 200k iterations of the above program with the current version of @endo/eventual-send:

iterations ,  memory(kiB)
        0 ,     42236
    10000 ,    121372
    20000 ,    148852
    30000 ,    185308
    40000 ,    231172
    50000 ,    234836
    60000 ,    311864
    70000 ,    299348
    80000 ,    340960
    90000 ,    385784
   100000 ,    396028
   110000 ,    406796
   120000 ,    508372
   130000 ,    514412
   140000 ,    524392
   150000 ,    534880
   160000 ,    572580
   170000 ,    664660
   180000 ,    765180
   190000 ,    709856
   200000 ,    719828

If I edit eventual-send to disable trackTurns() (by returning funcs unwrapped):

https://github.com/endojs/endo/blob/1895f5b0172cadacb4fc54b9f64c7e0c8314d041/packages/eventual-send/src/track-turns.js#L42

then I get:

iterations ,  memory(kiB)
        0 ,     41816
    10000 ,     81920
    20000 ,     86572
    30000 ,     92832
    40000 ,     85632
    50000 ,     85244
    60000 ,     90000
    70000 ,     96016
    80000 ,     85668
    90000 ,     87012
   100000 ,     93060
   110000 ,     87524
   120000 ,     86664
   130000 ,     90056
   140000 ,     96296
   150000 ,     87336
   160000 ,     87944
   170000 ,     93220
   180000 ,     88416
   190000 ,     87904
   200000 ,     90060

which is much better. I don't trust that it's flat yet (I ran it for 1M iterations and the final value was 89644 KiB), but if there's still a leak, then it's not more than a few bytes per cycle.

warner commented 2 years ago

Commenting out the line that retains sendingError is also sufficient to stop the leak:

https://github.com/endojs/endo/blob/1895f5b0172cadacb4fc54b9f64c7e0c8314d041/packages/eventual-send/src/track-turns.js#L60

  return funcs.map(
    func =>
      func &&
      ((...args) => {
        hiddenPriorError = sendingError; // <--- comment out this
        hiddenCurrentTurn += 1;
        hiddenCurrentEvent = 0;

I experimented with disabling other aspects without success: disabling the assert.note calls, and removing the catch added on line 81. None of those changes made the leak go away.

erights commented 2 years ago

Thanks for tracking this down!

I don't think it is worth creating a version of trackTurns in which this one line is disabled. Rather

If I edit eventual-send to disable trackTurns() (by returning funcs unwrapped):

That is the correct solution for production and performance testing. It can even default off, but we need to find a way to make it switchable, so we can turn it on during development when we need to. (Better would be streaming this info to a logger that either logs to something external or drops things, rather than keep diagnostics in ram.)

Lacking either mechanism, we can just turn it off for now. Then, when we find we need it to track down a bug, we can edit source code.

warner commented 2 years ago

cc @kriskowal since this change is in eventual-send, I think we'll need an Endo release to resolve it

erights commented 2 years ago

cc @kriskowal since this change is in eventual-send, I think we'll need an Endo release to resolve it

Not really. Someone (you, @turadg, ??) reminded me we can use the patch mechanism in the meantime. I'll try to get two PRs to you shortly. One in agoric-sdk using patch for immediate repair, and one for endo.

erights commented 2 years ago

The agoric-sdk patch PR will have no switching mechanism. I'll leave anything that complex for the endo PR. The agoric-sdk PR will just switch it off. It'll then stay off until a future endo release gives us a switch for turning it back on.

kriskowal commented 2 years ago

Sounds good and sounds like we should run, not walk, to writing traces to a ring buffer that an external aggregator can periodically flush, in order to turn these back on maybe someday.

erights commented 2 years ago

Sounds good and sounds like we should run, not walk, to writing traces to a ring buffer that an external aggregator can periodically flush, in order to turn these back on maybe someday.

Any idea how we could do this quickly? Since I have no such plan, the endo PR I'll have for you soon is much less ambitious. But, as always, I'd rather do the right thing than invest further in stopgap engineering until we do the right thing. But we do need this quickly and the stopgap will be small.

kriskowal commented 2 years ago

In order to break the chain of indefinite retention, it may be sufficient to just do the ring buffer and worry about collection later. For now, it may be sufficient to retain a fixed size array in which we store errors, and a parallel array tracking the sequence number of the parent error, such that we can construct a deep-but-not-that-deep stack trace on demand, showing elisions for ancestors that were overwritten. So, the index of an error would be the sequence number modulo the size of the ring buffer, unless the error in that position does not have a matching sequence number.

erights commented 2 years ago

I like that plan, but it is more than I wanted to do right now. After I do the agoric-sdk patch PR https://github.com/Agoric/agoric-sdk/pull/5892 , may I assign the endo part to you?

erights commented 2 years ago

I assigned both of us. We can coordinate from there.

kriskowal commented 1 year ago

@erights has since implemented a ring buffer for async stacks possibly alleviating this concern. Has the fix already been integrated and verified? https://github.com/endojs/endo/pull/1513

Agoric / agoric-sdk

does HandledPromise's deep-stack support cause memory problems when used in iterator/generators? #5886

What is the Problem Being Solved?