Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
326 stars 206 forks source link

Cope with asynchrony in remote retirement of promise IDs #2509

Closed FUDCo closed 3 years ago

FUDCo commented 3 years ago

Background

When the kernel delivers a notify to a vat, informing the vat about the resolution of a promise it had previously imported, an implicit consequence is that the vat's identifier for that promise is retired. That is, the vat's c-list entry for the promise is deleted and the vat and kernel jointly agree never to make reference to this identifier in future interactions. In vats implemented with liveslots, the vat assures its compliance with this agreement by deleting entries in its internal tables that map between the (now retired) promise ID and the actual underlying JavaScript Promise object, so that any such future references become impossible.

Sometimes, when a promise is resolved, the value it is resolved to may contain references to other promises that are themselves now resolved. When the kernel delivers the notify for this, these subsidiary promises are included in the notification as additional promise resolutions. It's possible these promises had previously been known to the vat but retired when they were originally resolved, or they could be entirely new to the vat. In either case they are treated as new: the subsidairy promises are imported (or possibly reimported) under new identifiers. These new identifiers are then immediately retired as a consequence of the vat processing the notify. The new identifiers exist only for the duration of the notify delivery and processing.

The same reasoning applies to a vat resolving an exported promise via the resolve syscall. While there are substantial differences in the implementation details due to the asymmetry of the vat/kernel relationship, the logic is identical.

In both cases, the kernel, which actually holds and manages the c-lists themselves, benefits from the fact that the interactions across the vat/kernel boundary are synchronous. In particular, the kernel knows that when the vat returns from processing a notify, the vat has done everything necessary internally to ensure it will never mention the retired promise ID again.

The Problem

The above picture is more complicated in the case of a pair of comms vats linking two swingsets.

When a comms vat receives a notify from its local kernel, it forwards the notification (in the form of a resolve message) to its remote counterpart. However, because it cannot interact with that counterpart synchronously, it cannot immediately delete its local record of the promise identifer it shares with the remote end. Until the resolve has been received and processed, it remains possible for the remote end to send a message referencing the resolved promise. In other words, an inbound message referencing a promise and an outbound message announcing that promise's resolution can cross in transit.

Until it knows that the remote end has retired the promise identifier, in order to maintain its commitment to no longer use the identifier it had shared with its local kernel, the comms vat must maintain a record that will enable it to act appropriately if it receives a reference to that promise in a message from the other comms vat. (What "act appropriately" means depends on how the promise ID is used and what the promise resolved to, which we'll get to shortly.) Knowing that the remote end has received and processed the resolve message requires a network round trip.

How to deal with The Problem

As we considered the best way to deal with the above problem, we realized that essentially the same issue arises when a vat that imported a presence chooses to indicate that it's no longer using it or when a vat that exported a presence chooses to revoke the export. The issue occurs yet again in the context of garbage collection; even though in this case the discovery that presence or promise has become unreferenced may be non-deterministic, propagating this knowledge to other remote swingsets raises exactly the same asynchrony concern.

Considering the reference dropping problem independently from the particulars of the promise retirement problem provided some clarifying focus. From consideration of this narrower problem we came up with what we're calling the "distributed drop protocol". I will describe it generically, then discuss how it can be applied in the case of promise retirement.

Distributed drop protocol

Promise retirement, presence revocation, and distributed garbage collection all share a common underlying need for one end of a remote connection to be able to reliably drop a reference imported or exported by the other. Although there are some distinctive wrinkles to each of these different use cases, all three require a way to cope with the asynchrony imposed by the network. In particular, they need to be able to deal with the case where message that initiated the drop crosses in transit with one or more messages traveling in the opposite direction that mention the reference being dropped. The distributed drop protocol is our solution to that problem.

The first element of the solution is the addition of a sequence number to messages in the comms protocol. This sequence number is an ordinal that counts upward from 0 with each message sent over a particular connection. While the sequence number need not be made explicit in the protocol itself -- it is sufficient for opposite ends of a connection to simply count messages sent and recevied -- embedding the sequence number in the message can provide some measure of sanity checking. We chose to add a place in the message format for the sequence number but to make the number itself optional. If the sequence number is present in a message, the message receiver will compare it to the count of messages it has gotten from the remote host that sent it. If the two numbers don't match, something has gone wrong and the connection will be terminated in an error state with appropriate complains to the log and so forth.

Each time a comms vat sends a message, the sender records the message sequence number in the corresponding c-list entries for the message target and each reference contained within the message itself.

When a comms vat wishes to drop one of the references it holds, it sends a drop message containing the reference ID of the reference(s) being dropped, along with the sequence number of the last message received from the host to which the drop is being sent. After the drop message is sent, the sender must thereafter refrain from making use of the dropped reference in its communications with the remote vat.

Concurrent with sending the drop message, the sender must take whatever action is needed, according to use case, to render its end of the reference dead or resolved or revoked or whatever, though it must retain any information necessary to react in an appropriate way (e.g., by signalling an error) if a reference to the dropped entity is received from the remote end.

When a drop message is received, the receiver must take whatever actions are appropriate for dropping the reference on its end, and then respond with a drop message of its own. (In this case the "last received message" sequence number will likely be that of the drop message to which it is responding, though I don't think this is per se of any operational significance.)

Once the initiator of the original drop receives the reply drop, it is then free to release any resources it had retained in association with the reference and forget about it entirely.

From the either end's perspective, the successful dropping of a reference requires each end to both send and receive a drop message for the reference. In the initiator's case, the send precedes the receive, and the sender must be prepared to act upon a mention of the reference in the interval between the two events. In the replier's [need a better word] case, the receive precedes the send and so may be executed within a single crank, with no record keeping required for the intermediate state.

A simple and obvious generalization is allowing a drop message to indicate multiple references to be dropped in a single message, rather than requiring the sending of multiple drop messages one after another.

Application of the distributed drop protocol to promise retirement

In promise retirement, reference dropping flows from the decider to the subscriber(s). The resolve message implies the dropping of all the promises being resolved, though for purposes of the drop protocol it only really matters for those which are not being introduced by the resolve message itself -- promise references that are directly introduced can immediate be retired, since there is no possibility of intervening message traffic referencing them. We have a choice whether to follow each resolve with a corrresponding drop, or to treat the resolve itself as implying a drop for those promises for which it would be relevant (in that case we'd need to extent the syntax of the resolve message to incorporate the sequence numbers, but since it is an extremely common application of reference dropping -- perhaps the most common case, actually -- handling this as a special case would make the message log quite a bit less messy in general).

What to do with messages to or referring to resolved promises

Orthogonal to the question of how to keep track of a remote vat's dropping of a promise identifier is the question of how to appropriately react to the remote vat mentioning an as yet unacknowledged resolved promise.

This problem breaks in turn into two related sub-problems:

  1. how to handle messages addressed to the resolved promise identifier

  2. what to do when messages include the resolved promise identifier in their arguments

A key insight is to examine what the kernel does in such cases. Although such cases do not arise directly, since resolves and notifys immediately retire the pertinent promise identifiers. However, these cases do arise when the kernel delivers messages that had been queued on a promise awaiting its resolution. We believe that essentially the same logic can be replicated in the comms vat, though of course it will be expressed in terms of comms references rather than kernel references, comms c-lists instead of kernel c-lists, etc.

FUDCo commented 3 years ago

@warner

warner commented 3 years ago

In particular, the kernel knows that when the vat returns from processing a notify, the vat has done everything necessary internally to ensure it will never mention the retired promise ID again.

It's even sooner than that: the vat won't mention the retired ID in any syscalls it emits during the notify crank.

When a drop message is received, the receiver must take whatever actions are appropriate for dropping the reference on its end, and then respond with a drop message of its own.

Once the initiator of the original drop receives the reply drop, it is then free to release any resources it had retained in association with the reference and forget about it entirely.

My hunch is that only the revoked-export case needs two distinct drop messages. Revoked-export is kinda funny (and sort of rude, in a way: one side is unilaterally abandoning its commitment to recognize an identifier), so I wouldn't be surprised if it's an exception.

Drop Import

For the most common "drop import" case, which is initiated by the importing side, I suspect that we only need a single drop message from the importer. When the exporter receives a drop(rref, lastSeenSeqnum), it should look up rref in its clist and find the lastSentSeqnum. If lastSeenSeqnum >= lastSentSeqnum, then it knows there are no additional references to rref still in flight, so it knows rref is not about to be resurrected on the importing side, and it knows the importer has just deleted rref from its own clists. Because of the latter, it also knows that there are no remaining inbound messages from the importer that might mention rref. Therefore the exporter can safely delete rref from its own clist.

If lastSeenSeqnum < lastSentSeqnum, then the exporter knows that the importer has just deleted rref, but that it is about to be resurrected by a message that is still in-flight. The importer doesn't remember rref, so the arrival of the in-flight message will allocate a new localref for it, and add a new clist entry. The exporter can just ignore the drop: the exporter should keep using the rref as usual (it doesn't care what the importer does with the rref, it just knows that the rref is still in use).

Revoked Exports

For revoked exports (#2070), first off we need to figure how/whether identity is maintained for revoked objects. If revoked objects lose their identity, that will change a lot of the security model.

The easier part of revocation is the short-circuiting of message delivery: the exporter is telling all importers that messages aimed at the target do not actually need to be sent all the way through to the exporting vat. Instead, the importers are given enough information to handle those message deliveries locally, namely an Error with which all deliveries should be rejected.

So regardless of the identity-retention question, we'll need the revocation message to include an Error. That makes it different than a dropped import (which doesn't need to provide any additional vat-visible data).

Identity-Retaining Revocation

In the retained-identity case, I think the protocol should be for the exporter to send a revokeExport(rref, error) message. The exporter remembers the Error, and any lingering inbound messages that target the revoked object will have their result promise rejected with that Error. The importer updates their own tables with the Error, and if it receives anything (from other remote comms) aimed there, it handles the rejection locally. The importer will also notify its local kernel, which will notify any local vats which have imported the object. When the dust has settled, rejections are generated in the same vat that would have otherwise done a syscall.send, and the object ID will no longer be used by anyone as a target.

We might be able to build a protocol that senses this dust-settled case and allow everyone (except the actual importing vats) to forget about the Error object. But if we're retaining identity, then we still need to keep the c-lists around, because someone might introduce the revoked object to a new comms or a new vat, and EQ should continue to work. So if we use a matched ack/drop -type message, the goal would be to let us know when it's safe to delete that Error, not to delete the c-list entry itself.

I think this ack/drop -type message (I'm waffling on the name, I'm not sure drop is correct) only needs rref as its argument. Revocation is irreversible and initiated by the exporter (unlike dropped imports, which are reversible if a new mention arrives, and are initiated by the importer). The exporter decides to revoke, sends the revokeExport(rref, error), and rejects any inbound messages itself until it receives the ack. Once the ack is received, it becomes an error for that remote to send any messages which target the revoked object.

If we hope to forget the Error but still retain identity for EQ, we'll need some sort of c-list -adjacent table to keep track of which remotes have acked the revokeExport and which ones have not. Once all remotes have acked, the exporting comms can forget the Error.

Identity-Losing Revocation

For this case, I think we'd use the same ack message as for the identity-retained case, because revocation is irreversible. The big question is what the former exporter should do with inbound messages that reference the revoked rref. It would be the importing vat's fault to send those after the ack, but we can't blame them for sending them before it ever had a chance to learn about the revocation.

I don't know how to deal with those messages.. we need to think more about what identity-losing revocation would look like. It's certainly useful to know when the importer has promised to never mention the rref again, but until we know what the former exporter should do with mentions in the interim, I don't think we can design a sensible protocol.

Identity-Retiring Promise Resolution

Automatically-retiring resolved promises is a lot like identity-losing revocation of objects, except perhaps more gentle. The decider announces the resolution, but must be prepared to accept messages which either target or mention the retired ID until it gets an ack from the subscriber. After that point, it can forget the identifier, knowing that the subscriber has enough information to handle any subsequent local references on its own.

The decider will stop emitting messages that reference the old ID right away (any such message will use a fresh ID), and resolution is irreversible, so it knows that the resolve it sent will be the last message ever sent in that direction which mentions the old ID. So the inbound stream of messages are partitioned into exactly two phases: before the resolve was received, and after.

I think the ack message only needs to cite the rref (old promise ID). I think we only need to use the seqnums for transitions that are reversible, and only dropImport is reversible.

Messages

So maybe this converges on the following messages:

with the following handshakes:

And the new messages could maybe be aggregated too: dropImport would take a list of rrefs and a single lastSeenSeqnum, revokeExport could take a list of [rref, error] pairs, ack could take a list of rrefs.

FUDCo commented 3 years ago

I see the logic that resolve and the acknowledgement of a resolve don't need the sequence number. A notable win here is that if resolve is resolving, say, 5 promises, we don't need to send 5 sequence numbers and we don't need fancy logic to distinguish the promises that are "actually" being resolved from the ephemeral ones that are just being used instrumentally.

I think I tentatively buy the reasoning that the drop import case doesn't need the acknowledgement.

And also that the revoke export case does need the error. Revoke is interestingly parallel to resolve: both amount to "I disclaim this thing. Here is the value you should henceforth use for it instead."

I don't care for the proliferation of messages. I'm not sure there's anything to be done about it but it makes my spidey sense tingle a bit. Aesthetically, I'm slightly bothered by having a mixture of verb and verbObject, i.e., I feel like it should be drop/revoke/ack/resolve or dropImport/revokeExport/ackSomething/resolvePromise, but that might just be my OCD talking. The verbs with modifiers are clearer but longer, and in any case I don't know what to call the generic something that's being acknowledged in the ack.

FUDCo commented 3 years ago

A weird consequence of the reasoning above is that (I think) we would only need to track sequence numbers for presence mentions, but not for promise mentions. This is an asymmetry that kind of bugs me, and makes me worry that we're missing something.

warner commented 3 years ago

Chip and I worked out some details yesterday, including some IMHO clever optimizations:

Revocations still need more thought, but:

So the resulting data structures are:

The overall effect on message size and quantity is pretty modest:

FUDCo commented 3 years ago

Closed by #2752