liveslots: reject too many imported promises

warner commented 2 years ago

What is the Problem Being Solved?

Part of the memory consumption that can be inflicted upon an innocent vat is when the inbound dispatch.deliver or dispatch.notify contains promise vrefs (vpids) in the arguments or resolution data. Liveslots creates actual Promise objects for each of these, passes them to userspace, and remembers their resolve/reject functions in the importedVPIDs table.

Each promise incurs memory consumption within the vat worker, which cannot be shed until it receives a dispatch.notify and the Promise can be settled. At that point, the resolved promise loses its identity, and liveslots is allowed to stop tracking it. If userspace does not retain a reference, the Promise can then be garbage-collected by the engine.

The problem is that an attacker might send messages with spurious extra arguments, which contain vpids, to provoke this memory consumption. The victim vat is not looking for extra arguments (or extra properties on normal arguments): it could, but it would be pretty awkward (walking the whole argument object graph). And even if it did that, the memory consumption is triggered when liveslots deserializes the data, so by the time userspace gets them, it is too late.

The same is true for promises that appear in resolution data (when syscall.notify includes capdata with new vpids in slots), but it's even more awkward for userspace to be on the lookout for these. It would have to subscribe to every promise it ever sees, just to learn about even more promises in the resolution.

As a general design principle, we prefer to get promises from a service over passing them in. And when we do pass a promise in, the intention is to use pipelining to save roundtrips, so we expect the promise to resolve soon, possibly even on the vat to which we're sending the message.

So in normal usage, we don't expect these argument-carried promises to remain unresolved for very long. This points to a potential defensive mechanism (suggested by @dtribble): allow liveslots to unilaterally reject() argument promises that have remained unresolved for too long.

Using a timeout is troublesome (unpredictable/nondeterministic, for one), but a similar effect can be had by declaring a maximum count of unresolved argument promises, building a length-limited FIFO, pushing promises onto the FIFO upon receipt, and rejecting them if they reach the far end. Dean suggested a limit of 10k promises.

Spurious Promises which were sent through extra (ignored) arguments will have been delivered to userspace by liveslots, but userspace never even assigned them to a name, so nothing will have called .then, so when liveslots rejects them, nobody will notice. At that time, liveslots can drop them from the importedPromises table, and the memory obligation goes away.

This allows a smaller attack against an innocent somewhat-long-lived argument promise: if the attacker can push through 10k promises before the innocent one can be resolved, it will be unfairly rejected, and some userspace operation will fail. For this reason, I don't really want this threshold to be a hard-coded property of liveslots. I want vats to opt-in to this mechanism. I'm thinking that the entry-point file, which normally only exports buildRootObject, could optionally export a const vatSettings = { unresolvedArgumentPromiseLimit : 10_000 } or something. Another option is for the createVat options to include the setting, and have Zoe choose the number. I'm not yet sure how to allow contracts to inform Zoe what value they want. And maybe the limit should be configurable after vat creation, either from within the vat (vatPowers.setLimit(..)) or from outside (E(adminFacet).setLimit(..));

Internally, when liveslots rejects the imported Promise, it needs to make sure the unilateral action doesn't cause confusion with the kernel later. Liveslots will basically synthesize a fake dispatch.notify() for the promise, which will delete the slotToVal/valToSlot entry, shedding the identity. The kernel doesn't know that, so the entry will remain in the vat's c-list, and the vat remains subscribed to hear about the resolution. If that ever actually happens, the kernel will deliver a (real / duplicate) dispatch.notify(). However liveslots will ignore this, because this could already happen in two other situations:

Vat version-1 receives a promise, the vat is upgraded, vat version-2 receives the dispatch.notify but doesn't have a slotToVal entry for it.
- note that the DurablePromiseWatcher feature may react to the vpid anyways
The vat does syscall.send and uses e.g. p+4 as the result= vpid, so it does syscall.subscribe to register an interest in the result. Then someone else sends it a dispatch.deliver with result='p+4', which returns decision making authority back into the vat. The kernel doesn't pay attention to that fact, so the vat remains subscribed. If/when the vat uses syscall.resolve to tell the kernel about its decision, the kernel will turn around and send a dispatch.notify to fulfill its notification duties.

An alternative to simply allowing the duplicate notify would be to add a new syscall, maybe syscall.dropImportedPromise. This would tell the kernel to drop the c-list entry, and to remove the vatID from the subscriptions list. (Note this is more than what a syscall.unsubscribe would do, which would leave the c-list in place, allowing the vat to talk about the promise in later outbound messages). This approach would save the c-list entry and a useless delivery later, but it would not allow the kernel DB promise table entry to be removed unless/until we add better refcounting for promises to the kernel.

Description of the Design

TBD, but basically change liveslots to:

build a FIFO structure with O(1) .length, push (append), shift (left-deletion), and interior-deletion (when a Promise is resolved, so we don't retain the Promise memory for a long time)
push Promises (mainly the reject function) on when deserializing argument capdata in dispatch.deliver (deserialize, then walk .slots looking for vpids, look up their reject controls in importedVPIDs)
if that causes the FIFO to exceed 10k, shift and reject until we're back down below the threshold. Do whatever we need to do (maybe nothing) to avoid confusion later if/when the real dispatch.notify arrives
if a Promise is resolved/rejected early, remove the item from the FIFO, or nullify it (maybe a circular buffer and a promise->index mapping table??)
add control APIs to set (and maybe change) the limit, some subset of:
- an extra vatSettings export from the vat, to keep the configuration close to the code
- vatPowers.setLimit() to make it adjustable from within
- DynamicVatOptions so vatAdminService~.createVat() can initialize it from outside
- adminFacet~.setLimit() so the creator can adjust it from outside

Security Considerations

This gives attackers a limited ability to surprisingly/arbitrarily reject other people's promises, in exchange for removing their ability to kill the entire vat by way of accumulated promises. If other rate-limiting mechanisms do not prevent it, they may be able to target specific promises for rejection, however their chances are much lower if the target promise is resolved promptly (i.e. the sender was sending an already-resolved promise, which will appear as a syscall.send and a syscall.resolve in the same delivery crank, and would only become separated later on if priority scheduling cause them to "get on different trains").

If the limit is configurable, anyone with that configuration control could dial it down to zero (or very low) any time they wanted, rejecting any promises in the FIFO. So this control should probably only be exposed to parties that already control the vat, i.e. the adminFacet used to create the vat originally.

Deciding what limit to use will be a tradeoff between the chances of accidentally triggering it, and the size of the memory-consumption attack left exposed.

Tracking promises in arguments, but not in message results or promise resolution data, leaves an attack vector open. However this can only be utilized by attackers to whom the target vat has decided to send a message. It isn't very common for our contracts to send message to user-provided objects (especially since many of those objects live off-chain in an ag-solo, and those are not considered to be as reliable or available as on-chain objects). So this threat can be avoided by userspace behavior, unlike the argument-borne promises.

Test Plan

Bunch of unit tests.

Alternatives

Another approach would be to have a "flush" call, exposed by liveslots (maybe on adminFacet, somehow), which told liveslots to reject all currently-unresolved inbound promises. The idea would be that we watch the memory usage of the vat, and if it gets too large (and our analysis shows that someone has been attacking it, and it holds a lot of spurious promises), then some kind of governance action is used to flush the junk.

This would be more predictable than a FIFO or time-based mechanism. It would be a temporary solution (the attacker could just keep doing whatever they were doing to re-fill the liveslots tables with more junk), but we could perhaps look carefully at the list of unresolved vpids for innocent victims and refrain from calling flush() if any existed. Or, the flush() call could take a list of vpids to preserve (i.e. not reject), and the governance proposal could avoid the collateral damage. OTOH, depending upon how fast the attack was consuming memory, the governance action might not arrive fast enough to save the vat from the host OS's OOM killer.

mhofman commented 2 years ago

Thinking more about this, it's only a problem because resolvers hold strongly onto their associated promise instance. This is arguably a JS engine implementation shortcoming that doesn't have be (the spec does not require this).

If the promise wasn't held strongly by the resolvers, a userland not interested in the promise could simply drop it, and we could unsubscribe when it gets collected (noticed through FinalizationRegistry.register(importedPromise)).

I have on my todo to write a userland promise which is not subject to this and other memory leaks, show what engines ought to be able to implement themselves.

mhofman commented 2 years ago

One thing I realized after talking to @michaelfig and @warner is that according to the definition of liveness in the spec, a promise with reactions (either explicit .then() call, or internal through await) is not technically live if the user drops all references after registering the reaction. For a FinalizationRegistry approach to work, we would need the implementation (possibly through a normative spec change) to keep the promise instance live while there are reactions pending.

warner commented 2 years ago

Not PSO: without mailbox access, new promises cannot be created by off-chain attackers in liveslots.

warner commented 2 years ago

Would be obviated by liveslots being able to track promises being .thened, and unsubscribing if not.

mhofman commented 2 years ago

Created an issue for the alternative of noticing promises being dropped by userland. We still need an issue for the last alternative of noticing then and await on imported promises. Both approaches are mentioned in the issue describing the root memory consumption problem.

Agoric / agoric-sdk