Cope with asynchrony in remote retirement of promise IDs

Background

When the kernel delivers a notify to a vat, informing the vat about the resolution of a promise it had previously imported, an implicit consequence is that the vat's identifier for that promise is retired. That is, the vat's c-list entry for the promise is deleted and the vat and kernel jointly agree never to make reference to this identifier in future interactions. In vats implemented with liveslots, the vat assures its compliance with this agreement by deleting entries in its internal tables that map between the (now retired) promise ID and the actual underlying JavaScript Promise object, so that any such future references become impossible.

Sometimes, when a promise is resolved, the value it is resolved to may contain references to other promises that are themselves now resolved. When the kernel delivers the notify for this, these subsidiary promises are included in the notification as additional promise resolutions. It's possible these promises had previously been known to the vat but retired when they were originally resolved, or they could be entirely new to the vat. In either case they are treated as new: the subsidairy promises are imported (or possibly reimported) under new identifiers. These new identifiers are then immediately retired as a consequence of the vat processing the notify. The new identifiers exist only for the duration of the notify delivery and processing.

The same reasoning applies to a vat resolving an exported promise via the resolve syscall. While there are substantial differences in the implementation details due to the asymmetry of the vat/kernel relationship, the logic is identical.

In both cases, the kernel, which actually holds and manages the c-lists themselves, benefits from the fact that the interactions across the vat/kernel boundary are synchronous. In particular, the kernel knows that when the vat returns from processing a notify, the vat has done everything necessary internally to ensure it will never mention the retired promise ID again.

The Problem

The above picture is more complicated in the case of a pair of comms vats linking two swingsets.

When a comms vat receives a notify from its local kernel, it forwards the notification (in the form of a resolve message) to its remote counterpart. However, because it cannot interact with that counterpart synchronously, it cannot immediately delete its local record of the promise identifer it shares with the remote end. Until the resolve has been received and processed, it remains possible for the remote end to send a message referencing the resolved promise. In other words, an inbound message referencing a promise and an outbound message announcing that promise's resolution can cross in transit.

Until it knows that the remote end has retired the promise identifier, in order to maintain its commitment to no longer use the identifier it had shared with its local kernel, the comms vat must maintain a record that will enable it to act appropriately if it receives a reference to that promise in a message from the other comms vat. (What "act appropriately" means depends on how the promise ID is used and what the promise resolved to, which we'll get to shortly.) Knowing that the remote end has received and processed the resolve message requires a network round trip.

How to deal with The Problem

As we considered the best way to deal with the above problem, we realized that essentially the same issue arises when a vat that imported a presence chooses to indicate that it's no longer using it or when a vat that exported a presence chooses to revoke the export. The issue occurs yet again in the context of garbage collection; even though in this case the discovery that presence or promise has become unreferenced may be non-deterministic, propagating this knowledge to other remote swingsets raises exactly the same asynchrony concern.

Considering the reference dropping problem independently from the particulars of the promise retirement problem provided some clarifying focus. From consideration of this narrower problem we came up with what we're calling the "distributed drop protocol". I will describe it generically, then discuss how it can be applied in the case of promise retirement.

Distributed drop protocol

Promise retirement, presence revocation, and distributed garbage collection all share a common underlying need for one end of a remote connection to be able to reliably drop a reference imported or exported by the other. Although there are some distinctive wrinkles to each of these different use cases, all three require a way to cope with the asynchrony imposed by the network. In particular, they need to be able to deal with the case where message that initiated the drop crosses in transit with one or more messages traveling in the opposite direction that mention the reference being dropped. The distributed drop protocol is our solution to that problem.

The first element of the solution is the addition of a sequence number to messages in the comms protocol. This sequence number is an ordinal that counts upward from 0 with each message sent over a particular connection. While the sequence number need not be made explicit in the protocol itself -- it is sufficient for opposite ends of a connection to simply count messages sent and recevied -- embedding the sequence number in the message can provide some measure of sanity checking. We chose to add a place in the message format for the sequence number but to make the number itself optional. If the sequence number is present in a message, the message receiver will compare it to the count of messages it has gotten from the remote host that sent it. If the two numbers don't match, something has gone wrong and the connection will be terminated in an error state with appropriate complains to the log and so forth.

Each time a comms vat sends a message, the sender records the message sequence number in the corresponding c-list entries for the message target and each reference contained within the message itself.

When a comms vat wishes to drop one of the references it holds, it sends a drop message containing the reference ID of the reference(s) being dropped, along with the sequence number of the last message received from the host to which the drop is being sent. After the drop message is sent, the sender must thereafter refrain from making use of the dropped reference in its communications with the remote vat.

Concurrent with sending the drop message, the sender must take whatever action is needed, according to use case, to render its end of the reference dead or resolved or revoked or whatever, though it must retain any information necessary to react in an appropriate way (e.g., by signalling an error) if a reference to the dropped entity is received from the remote end.

When a drop message is received, the receiver must take whatever actions are appropriate for dropping the reference on its end, and then respond with a drop message of its own. (In this case the "last received message" sequence number will likely be that of the drop message to which it is responding, though I don't think this is per se of any operational significance.)

Once the initiator of the original drop receives the reply drop, it is then free to release any resources it had retained in association with the reference and forget about it entirely.

From the either end's perspective, the successful dropping of a reference requires each end to both send and receive a drop message for the reference. In the initiator's case, the send precedes the receive, and the sender must be prepared to act upon a mention of the reference in the interval between the two events. In the replier's [need a better word] case, the receive precedes the send and so may be executed within a single crank, with no record keeping required for the intermediate state.

A simple and obvious generalization is allowing a drop message to indicate multiple references to be dropped in a single message, rather than requiring the sending of multiple drop messages one after another.

Application of the distributed drop protocol to promise retirement

In promise retirement, reference dropping flows from the decider to the subscriber(s). The resolve message implies the dropping of all the promises being resolved, though for purposes of the drop protocol it only really matters for those which are not being introduced by the resolve message itself -- promise references that are directly introduced can immediate be retired, since there is no possibility of intervening message traffic referencing them. We have a choice whether to follow each resolve with a corrresponding drop, or to treat the resolve itself as implying a drop for those promises for which it would be relevant (in that case we'd need to extent the syntax of the resolve message to incorporate the sequence numbers, but since it is an extremely common application of reference dropping -- perhaps the most common case, actually -- handling this as a special case would make the message log quite a bit less messy in general).

What to do with messages to or referring to resolved promises

Orthogonal to the question of how to keep track of a remote vat's dropping of a promise identifier is the question of how to appropriately react to the remote vat mentioning an as yet unacknowledged resolved promise.

This problem breaks in turn into two related sub-problems:

how to handle messages addressed to the resolved promise identifier
what to do when messages include the resolved promise identifier in their arguments

A key insight is to examine what the kernel does in such cases. Although such cases do not arise directly, since resolves and notifys immediately retire the pertinent promise identifiers. However, these cases do arise when the kernel delivers messages that had been queued on a promise awaiting its resolution. We believe that essentially the same logic can be replicated in the comms vat, though of course it will be expressed in terms of comms references rather than kernel references, comms c-lists instead of kernel c-lists, etc.

In particular, the kernel knows that when the vat returns from processing a notify, the vat has done everything necessary internally to ensure it will never mention the retired promise ID again.

It's even sooner than that: the vat won't mention the retired ID in any syscalls it emits during the notify crank.

When a drop message is received, the receiver must take whatever actions are appropriate for dropping the reference on its end, and then respond with a drop message of its own.

Once the initiator of the original drop receives the reply drop, it is then free to release any resources it had retained in association with the reference and forget about it entirely.

My hunch is that only the revoked-export case needs two distinct drop messages. Revoked-export is kinda funny (and sort of rude, in a way: one side is unilaterally abandoning its commitment to recognize an identifier), so I wouldn't be surprised if it's an exception.

Drop Import

For the most common "drop import" case, which is initiated by the importing side, I suspect that we only need a single drop message from the importer. When the exporter receives a drop(rref, lastSeenSeqnum), it should look up rref in its clist and find the lastSentSeqnum. If lastSeenSeqnum >= lastSentSeqnum, then it knows there are no additional references to rref still in flight, so it knows rref is not about to be resurrected on the importing side, and it knows the importer has just deleted rref from its own clists. Because of the latter, it also knows that there are no remaining inbound messages from the importer that might mention rref. Therefore the exporter can safely delete rref from its own clist.

If lastSeenSeqnum < lastSentSeqnum, then the exporter knows that the importer has just deleted rref, but that it is about to be resurrected by a message that is still in-flight. The importer doesn't remember rref, so the arrival of the in-flight message will allocate a new localref for it, and add a new clist entry. The exporter can just ignore the drop: the exporter should keep using the rref as usual (it doesn't care what the importer does with the rref, it just knows that the rref is still in use).

Revoked Exports

For revoked exports (#2070), first off we need to figure how/whether identity is maintained for revoked objects. If revoked objects lose their identity, that will change a lot of the security model.

The easier part of revocation is the short-circuiting of message delivery: the exporter is telling all importers that messages aimed at the target do not actually need to be sent all the way through to the exporting vat. Instead, the importers are given enough information to handle those message deliveries locally, namely an Error with which all deliveries should be rejected.

So regardless of the identity-retention question, we'll need the revocation message to include an Error. That makes it different than a dropped import (which doesn't need to provide any additional vat-visible data).

Identity-Retaining Revocation

In the retained-identity case, I think the protocol should be for the exporter to send a revokeExport(rref, error) message. The exporter remembers the Error, and any lingering inbound messages that target the revoked object will have their result promise rejected with that Error. The importer updates their own tables with the Error, and if it receives anything (from other remote comms) aimed there, it handles the rejection locally. The importer will also notify its local kernel, which will notify any local vats which have imported the object. When the dust has settled, rejections are generated in the same vat that would have otherwise done a syscall.send, and the object ID will no longer be used by anyone as a target.

We might be able to build a protocol that senses this dust-settled case and allow everyone (except the actual importing vats) to forget about the Error object. But if we're retaining identity, then we still need to keep the c-lists around, because someone might introduce the revoked object to a new comms or a new vat, and EQ should continue to work. So if we use a matched ack/drop -type message, the goal would be to let us know when it's safe to delete that Error, not to delete the c-list entry itself.

I think this ack/drop -type message (I'm waffling on the name, I'm not sure drop is correct) only needs rref as its argument. Revocation is irreversible and initiated by the exporter (unlike dropped imports, which are reversible if a new mention arrives, and are initiated by the importer). The exporter decides to revoke, sends the revokeExport(rref, error), and rejects any inbound messages itself until it receives the ack. Once the ack is received, it becomes an error for that remote to send any messages which target the revoked object.

If we hope to forget the Error but still retain identity for EQ, we'll need some sort of c-list -adjacent table to keep track of which remotes have acked the revokeExport and which ones have not. Once all remotes have acked, the exporting comms can forget the Error.

Identity-Losing Revocation

For this case, I think we'd use the same ack message as for the identity-retained case, because revocation is irreversible. The big question is what the former exporter should do with inbound messages that reference the revoked rref. It would be the importing vat's fault to send those after the ack, but we can't blame them for sending them before it ever had a chance to learn about the revocation.

I don't know how to deal with those messages.. we need to think more about what identity-losing revocation would look like. It's certainly useful to know when the importer has promised to never mention the rref again, but until we know what the former exporter should do with mentions in the interim, I don't think we can design a sensible protocol.

Identity-Retiring Promise Resolution

Automatically-retiring resolved promises is a lot like identity-losing revocation of objects, except perhaps more gentle. The decider announces the resolution, but must be prepared to accept messages which either target or mention the retired ID until it gets an ack from the subscriber. After that point, it can forget the identifier, knowing that the subscriber has enough information to handle any subsequent local references on its own.

The decider will stop emitting messages that reference the old ID right away (any such message will use a fresh ID), and resolution is irreversible, so it knows that the resolve it sent will be the last message ever sent in that direction which mentions the old ID. So the inbound stream of messages are partitioned into exactly two phases: before the resolve was received, and after.

I think the ack message only needs to cite the rref (old promise ID). I think we only need to use the seqnums for transitions that are reversible, and only dropImport is reversible.

Messages

So maybe this converges on the following messages:

dropImport(rref, lastSeenSeqnum)
revokeExport(rref, error)
ack something (rref)
resolve(rref, resolution) (but aggregated)

with the following handshakes:

dereferenced import: importer sends dropImport, exporter accepts or ignores based on seqnums
revoked export: exporter sends revokeExport, importer sends ack
resolved+retired promise: decider sends resolve, subscriber sends ack

And the new messages could maybe be aggregated too: dropImport would take a list of rrefs and a single lastSeenSeqnum, revokeExport could take a list of [rref, error] pairs, ack could take a list of rrefs.

I see the logic that resolve and the acknowledgement of a resolve don't need the sequence number. A notable win here is that if resolve is resolving, say, 5 promises, we don't need to send 5 sequence numbers and we don't need fancy logic to distinguish the promises that are "actually" being resolved from the ephemeral ones that are just being used instrumentally.

I think I tentatively buy the reasoning that the drop import case doesn't need the acknowledgement.

And also that the revoke export case does need the error. Revoke is interestingly parallel to resolve: both amount to "I disclaim this thing. Here is the value you should henceforth use for it instead."

I don't care for the proliferation of messages. I'm not sure there's anything to be done about it but it makes my spidey sense tingle a bit. Aesthetically, I'm slightly bothered by having a mixture of verb and verbObject, i.e., I feel like it should be drop/revoke/ack/resolve or dropImport/revokeExport/ackSomething/resolvePromise, but that might just be my OCD talking. The verbs with modifiers are clearer but longer, and in any case I don't know what to call the generic something that's being acknowledged in the ack.

A weird consequence of the reasoning above is that (I think) we would only need to track sequence numbers for presence mentions, but not for promise mentions. This is an asymmetry that kind of bugs me, and makes me worry that we're missing something.

Chip and I worked out some details yesterday, including some IMHO clever optimizations:

each outbound message carries an optionally-explicit sequence number (#2483)
each outbound message acks some span of the inbound messages
- the explicit approach would be to include a full copy of the sequence number of the last-received inbound message
- the compressed approach is to include the difference between that sequence number and the one cited by the previous outbound message
- so each time we receive a message, increment a counter
- when we send a message, include a copy of the counter, and then zero the counter
- each time we receive a message, increment our "they are acking message N" value by the counter, and then pretend we've just received an ack for all messages from (excluding) the previous value up to (including) the new value
each time an outbound message references an exported object-id, set the c-list entry's lastMentioned value to that message's outbound seqnum. we use this to remember in-flight messages (which may or may not have been seen by the importing side yet) that could cause the importing side to resurrect the ID
when an importing side no longer needs the import, it sends a DROP with the last-seen inbound seqnum as a payload
- when the exporting side receives the DROP, if the last-seen seqnum is equal or greater than the dropped reference's lastMentioned value, delete the c-list entry (possibly causing the exporting comms vat to send a drop to wherever it got the reference from, either the kernel or one of the other vats)
- if the last-seen seqnum is less than lastMentioned, ignore the drop: the outbound mention that wasn't known to the importer when it sent the drop will cause the ID to be resurrected
when a decider resolves a promise, it implicitly retires the promise-ID, but it must continue to handle inbound messages which either target or mention the promise until that resolve+retire message has been acked
- inbound messages which target the ID are either forwarded to a presence, rejected with a "cannot send to data" error, or rejected with the rejection of the promise
- inbound messages which reference the ID are forwarded to the normal target, but new short-lived promise-IDs are created (and then promptly resolved) for the recipient
- the ack is just the first naturally-occurring message that gets sent after the receipt of the resolve+retire
- when the ack is received, the c-list entry for the promise is deleted
- once all c-list entries (remotes and kernel) for a resolved promise are deleted, and if the local promise ID is not referenced by any other resolved promises, the comms promise table entry can be deleted (which might trigger further dereferences and deletions)

Revocations still need more thought, but:

a "soft" revocation is one that retains the identity of the object, but all messages sent to the object are rejected with a specific error
a "hard" revocation rejects messages but also drops the identity of the object, and possibly makes the mere mention of it "taboo"
- this will probably be easier if we can create the object from the very beginning as "lacking identity to importers": e.g. a Payment will arrive at any importer as a special kind of Presence that (like virtual-object Representatives) is not supposed to be compared for object identity, does not round trip from importer to other vat back to importer as the same object, but will arrive back at the exporter as the original object
both kinds of revocation move responsibility for message delivery (i.e. rejection) off to all importers, saying "you now know enough to handle messages sent to this object by yourself: just reject all those messages with the following error"
the "soft" form requires the revoking exporter to remember the object-id until it is dropped by the importer. They must also remember the error data until the REVOKE message is acked. The ack arrives piggybacked on some other naturally-occurring message just like the ack for a resolve+retire
the "hard" form requires the object-id and error data to be remembered until the REVOKE is acked. Once acked, they can forget both the object-id and error data. We may need a different kind of REVOKE message so the recipient/importer knows that they've been handed a "hard" revocation

So the resulting data structures are:

for each remote, there is a c-list that maps from local-id or remote-id to (local-id, remote-id, lastMentioned, needsErrorData)
- when a DROP arrives, we map from the dropped remote-id to lastMentioned, compare against the drop's lastSeen, and either execute the drop (delete the clist entry and trigger refcount checks) or ignore it
for each remote, there is a table of (seqnum, promise/object-id, type) kept sorted by seqnum
- for each inbound message, we process the table prefix subset whose seqnum is equal to or lower than the piggybacked ack number. for each table entry processed:
- if the type indicates a promise-id retirement, we delete the c-list entry and trigger refcount checks
- if it indicates a soft revocation, we retain the c-list entry but clear the needsErrorData flag, and trigger a check of the needsErrorData flags from other c-lists (i.e. trigger a refcount check on the error data, which is independent of the refcount for the object-id as a whole)
- if it indicates a hard revocation, we delete the c-list entry and trigger refcount checks
- in the steady-state, when both sides are communicative, this table will become empty promptly

The overall effect on message size and quantity is pretty modest:

each message adds an optional seqnum field, when omitted this adds one byte to each message (just the : separator and the empty/omitted number field)
each message adds an ack-delta field, if the "message trade balance" is equal (messages flow in both directions at equal rates) then this adds maybe two bytes to each message (a separator and a one-digit count)
DROPs add one message per non-immortal object that crosses the wire, and the message only needs the object-id (in addition to the normal ack-delta field)
resolve+retire adds one RESOLVE message per resolved promise, with the resolution data (and ancillary resolutions, if any), but no additional overhead
revocation will add one REVOKE message per revoked object, with the error data, and no extra messages or overhead

Agoric / agoric-sdk