improve devex for: during prepare in upgrade, calls to the Zoe from ZCF don't resolve

turadg commented 1 year ago

Describe the bug

If the contract's prepare function awaits a call to the Zoe service from its ZCF, the upgrade fails.

To Reproduce

Make coveredCallDurable-V3.js start as such,

const prepare = async (zcf, _privateArgs, instanceBaggage) => {
  const firstTime = !instanceBaggage.has('DidStart');
  if (firstTime) {
    instanceBaggage.init('DidStart', true);
  }
  const upgraded = firstTime ? 'V3 ' : 'V3 upgraded ';
  console.log('DEBUG coveredCall prepare', upgraded);
  console.log('DEBUG', await E(zcf.getZoeService()).getFeeIssuer());

Its test will fail with vat-upgrade failure and this console output,

error during vat dispatch() of startVat,[object Object] buildRootObject unresolved

Expected behavior

Calling a zoe service method in prepare doesn't break upgrade.

warner commented 1 year ago

All vats must be able to finish their upgrade without contacting other vats. There might be messages queued inbound to the vat being upgraded, and we can't safely deliver those messages until the upgrade is complete. The kernel can't tell which external messages are needed for upgrade, vs which are new work that need to be delayed until upgrade is finished, so our rule is that buildRootObject() must be standalone.

(The analogy is that a shop closes its doors to take inventory or update training manuals or fix equipment. There are customers lined up outside, waiting to enter, but the doors are closed so they can't. The shop does not open its doors again until they're fully ready to accept customers. If their equipment fixes need a new part, they can't let the new-part-delivery-person in through the same doors because that would let customers in too. The requirement is that the shop has all the parts that it needs before it starts the upgrade.)

Ok that analogy implies that upgrades need new parts, and that's not how we designed the Zoe/ZCF/contract upgrade dance: all the defineKind calls are supposed to happen pretty statically, and synchronously.

Also, if a prepare depends on something it retrieved from Zoe, it should only be fetching that object during the initial incarnation.. during subsequent upgrades, it should be using a value that it stashed away in the baggage. It's possible that the pattern causing this issue is actually suffering from another problem, where every upgrade might replace an existing object with a new one, when it ought to keep re-using the initial one.

Chris-Hibbert commented 1 year ago

@warner How does the kernel enforce that "All vats must be able to finish their upgrade without contacting other vats"? What we're seeing is that the upgrade simply hangs and the upgrade fails with an unresolved promise.

Should zoe/zcf have a state flag and reject calls made too early? I don't expect contract authors to be able to detect and diagnose this as a failure in their code.

erights commented 1 year ago

Expected behavior

Calling a zoe service method in prepare doesn't break upgrade.

Calling it shouldn't break upgrade. awaiting a response before all prepares are complete should break upgrade.

What we're seeing is that the upgrade simply hangs and the upgrade fails with an unresolved promise.

Hanging is bad. If the first crank completes without all prior exo behaviors being prepared, the upgrade should fail with a diagnostic that makes the problem clear. The vat being upgraded from should remain in the state of still being suspended and needing to be upgraded from.

Should zoe/zcf have a state flag and reject calls made too early?

I don't think zoe or zcf should have any additional mechanism. They should "simply" obey the constraint: that all prior behaviors are prepared during the first crank.

turadg commented 1 year ago

the upgrade should fail with a diagnostic that makes the problem clear

So far it doesn't. It just logs,

error during vat dispatch() of startVat,[object Object] buildRootObject unresolved
Logging sent error stack (RemoteError#1)
RemoteError#1: vat-upgrade failure

Is there any issue for devex around upgrades?

They should "simply" obey the constraint: that all prior behaviors are prepared during the first crank.

Could something observe whether that's true?

Meanwhile, the constraint is not obvious. I hope seeing https://github.com/Agoric/agoric-sdk/pull/7096/commits/a7f9e14e9f86b937f61fae43ab872d2dd32cd1eb used in contracts will help socialize the need.

Chris-Hibbert commented 1 year ago

if the ... DEBUG, await ... line above is replaced with

  E.when(E(zcf.getZoeService()).getFeeIssuer(), fi => {
    console.log(' DEBUG', firstTime, fi);
  });

the test passes and correctly prints DEBUG false Object [Alleged: ZDEFAULT issuer] {} and then DEBUG true Object [Alleged: ZDEFAULT issuer] {}.

I'm less unhappy with a resolution that prepare() can't await an external message at the top-level.

warner commented 1 year ago

@warner How does the kernel enforce that "All vats must be able to finish their upgrade without contacting other vats"? What we're seeing is that the upgrade simply hangs and the upgrade fails with an unresolved promise.

Liveslots has a function named startVat, which first builds a Set of all KindIDs from the previous incarnation (one per KindHandle), by reading them from the vatstore database. Then it invokes the user-provided buildRootObject() function (in this case, ZCF), which gets as many turns as it wants to perform initialization/upgrade work. While it does that, it should call defineKind, which has code to remove items from this set as userspace re-attaches behavior to each one. Liveslots does an await on the Promise returned by buildRootObject: after the await it checks to see if Set is empty and throws an error if not. Basically:

async function startVat() {
  const required = fetchKindIDs();
  await buildRootObject();
  if (required.size) {
    throw Error(`unfulfilled KindIDs: call defineKind() for all of them`);
  }

startVat() is called by a funky little wrapper which effectively does:

  let complete = false;
  p = Promise.resolve().then(() => startVat());
  p.finally(() => (complete = true);
  await waitUntilQuiescent();
  return complete ? p : Promise.reject('buildRootObject unresolved');

This means userspace has until its buildRootObject() promise resolves to fulfill its obligations. (This is a tighter constraint than I claimed earlier: I was misremembering an alternate way we could have implemented this, which would have given userspace a whole crank).

waitUntilQuiescent() is a special function which gets to use setImmediate to wait for the vat to become idle (a power withheld from userspace). If userspace returns a buildRootObject() promise which is blocked by a pending off-vat message, then it won't resolve by the time the vat becomes idle, which means startVat's promise won't either. We'll still reach the last line of the wrapper, but complete = false, and the wrapper will signal an error back to the kernel, which will declare vat startup to have failed (the parent vat might get the buildRootObject unresolved message, or something more generic, I forget how forthcoming we decided to be with the error details).

If userspace's promise rejects, then startVat's promise rejects, and the wrapper returns a rejected promise, and the kernel will get the same "vat startup failed" error signal (but with a different message).

If userspace's promise fulfills before the vat goes idle, then the epilogue of startVat gets to run, which checks the Set, and either rejects because of unfulfilled KindID obligations, or resolves. A rejection will cause the kernel to see the same "vat startup failed" signal, but with a message from liveslots instead of the userspace code.

As a side note, the "buildRootObject must finish promptly" rule was harder to violate in the early days, because the vat didn't initially have any Presences to send messages to. buildRootObject creates the initial receiver, but the vat didn't have any imports until after somebody invoked a method on the root object and decided to pass in the first reference. This changed twice, once when we allowed vatParameters to include object references (originally it could only hold flat/pure data), and again when we introduced vat upgrade and baggage (which allows the first incarnation to send Presences to the second incarnation that are available early, during buildRootObject).

These days, both pathways make it possible for code invoked from buildRootObject to send messages off-vat. However, both for clarity and to obey the "buildRootObject must be prompt" rule, it's a good idea to refrain from sending any messages during buildRootObject, even if the specific requirement is just that you dont await them.

warner commented 1 year ago

One simple/silly idea would be to change the buildRootObject unresolved error message to something like buildRootObject unresolved: are you waiting for an off-vat message? you can't.

Agoric / agoric-sdk