Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
326 stars 206 forks source link

marshal: capdata 'slots' should be unique, call slotToVal exactly once per slot #1904

Open warner opened 3 years ago

warner commented 3 years ago

What is the Problem Being Solved?

We have two incoming features that may require slight changes to the way marshal works. The first is @FUDCo 's #1846 / #455 secondary-storage / "Virtual Objects + Representatives" feature. The second is my #1872 GC work.

The fundamental requirements are:

I think this translates into two new requirements on marshal:

The first requirement is currently honored, but only because obj1 !== obj2 currently implies that valToSlot(obj1) !== valToSlot(obj2). That will change when #1846 lands, because two different representatives for the same virtual object will be distinct objects (obj1 !== obj2), but both will convert into the same object ID (valToSlot(obj1) === valToSlot(obj2)).

The second requirement is currently honored because of the ibidTable, which will suffer from the same change.

Description of the Design

For serialization, I think marshal's serializeSlot needs to have slotMap map from "slot" (e.g. o+44) to slotIndex, rather than mapping val to slotIndex. This map should be consulted after calling convertValToSlot(), so that a second representative will get mapped to the same slotIndex even though the first representative was a different object. The code that currently does:

  function serializeSlot(val, slots, slotMap, iface = undefined) {
    let slotIndex;
    if (slotMap.has(val)) {
      slotIndex = slotMap.get(val);
    } else {
      const slot = convertValToSlot(val);

      slotIndex = slots.length;
      slots.push(slot);
      slotMap.set(val, slotIndex);
    }

should instead do something like:

  function serializeSlot(val, slots, slotMap, iface = undefined) {
    let slotIndex;
    const slot = convertValToSlot(val);
    if (slotMap.has(slot)) {
      slotIndex = slotMap.get(slot);
    } else {
      slotIndex = slots.length;
      slots.push(slot);
      slotMap.set(slot, slotIndex);
    }

Note that our ibidTable means that serializeSlot is never called twice for the same val, so the old slotMap was somewhat redundant. I remember either not having or not trusting the ibidTable when writing serializeSlot way back then.

For deserialization, I think we should call convertSlotToVal() on every entry of capdata.slots immediately upon entry, and stash the resulting values for later use by the rawTree['@qclass'] === 'slot' clause. The currently implementation defers calling convertSlotToVal() until it encounters a @qclass: 'slot' node, and the capdata.body object graph might or might not contain exactly one of these per capdata.slots element. (This will depend upon the serializer, and how the ibidTable handles matched Representatives, and I think it would take more work than described above to guarantee this behavior).

The code that currently does:

  function unserialize(data, cyclePolicy = 'forbidCycles') {
    const rawTree = harden(JSON.parse(data.body));
    const fullRevive = makeFullRevive(data.slots, cyclePolicy);
    return harden(fullRevive(rawTree));
  }

might instead do:

  function unserialize(data, cyclePolicy = 'forbidCycles') {
    const rawTree = harden(JSON.parse(data.body));
    const slots = data.slots.map(convertSlotToVal);
    const fullRevive = makeFullRevive(slots, cyclePolicy);
    return harden(fullRevive(rawTree));
  }

One significant complication is the handling of iface: currently these are stored in the data.body nodes, and are not visible from data.slots. This actually suggests a latent bug: if the capdata contains two references to the same slot but with different interface names, which interface name will the receiver see?

Motivation: GC refcounting

The GC work (#1872) I'm doing depends upon the kernel (who is sending object references into a vat) and the liveslots layer (which is receiving those object references within the vat) agreeing upon how many times each object ID has been seen. The vat's decref messages are arbitrarily delayed, since they are driven by FinalizationRegistry callbacks. We have a counter scheme to tolerate this delay: the kernel increments a counter for each object ID as it sends the message into the vat, and when the vat observes the import to be dropped, the vat sends the last counter value back up to the kernel. The kernel decrements its counter by the vat's delta, and only drops the object if the result is zero. This is tolerant of the kernel mentioning an object ID after the vat had dropped its last copy (such that the message will cause the vat to create a new Presence): the kernel must not drop the object in that case.

For this to work, the kernel needs a way to scan capdata for object references. The kernel should not have to parse the capdata.body string to find them: we've put a lot of work into making sure the kernel never needs to understand or modify capdata.body. Fortunately the kernel can get this information from capdata.slots instead, which it must enumerate anyways as it translates those slots through the c-list.

The kernel's approach must match the one used by liveslots. When liveslots deserializes a message, it will increment its counter once per call to convertSlotToVal (it uses this call to sense the existence of another refcount).

When a message contains multiple references to the same object (perhaps the sender called x!.foo(obj1, obj2, obj1), or (reprA1, reprA2)), the capdata might have multiple capdata.body nodes that all point to the same capdata.slots[index] entry. If the deserializer calls convertSlotToVal once per capdata.body node, but the kernel increments its counter once per capdata.slots entry, these could diverge, causing imports to be freed too early, or never freed at all.

Motivation: no communication channels

When a message references a virtual object, the receiver's arguments will contain a "Representative" that combines the configured behavior (methods) of the object type with the state of a particular object ID. The receiver may hold multiple Representatives that all refer to the same virtual object (we cannot avoid this without unbounded storage or user-level nondeterminism). The specific policy by which the inbound message delivery code creates new Representatives needs to ensure that it doesn't reveal anything about e.g. its internal LRU cache, because that would allow one object graph to communicate with an unrelated one: the speaker sends lots of messages (to flush the shared LRU cache) and the listener measures whether other representatives it receives are the same object or are new ones.

Whatever information is revealed in one vat, for safety's sake, we should make sure that information does not leak into other vats. No matter what collection of identical- or distinct- Representatives (for the same virtual object) we might put into the arguments, the receiver should not be able to tell anything beyond the virtual object IDs. Sending x~.foo(reprA1, reprA1) (same representative) or x~.foo(reprA1, reprA2) (different representatives of the same virtual object) should not be distinguishable by the receiver.

With our current ibidTable, I think these would produce distinct capdata bodies and slots. The first would have an ibid backref, and just one slot. The second would have two @qclass: slot nodes with distinct index values, and capdata.slots = [objA, objA]). The receiver (x.foo = (y, z) => { ... }) could compare y === z to distinguish the two cases.

My proposed change above would change this to producing distinct capdata bodies but the slots array would be identical (just [objA]). The receiver would call convertSlotToVal exactly once (because capdata.slots.length === 1), and foo() would observe y === z in both cases.

The strongest interpretation of this goal would suggest that both messages should result in identical capdata (both .both and .slots). I think that would require bypassing the ibidTable check when the value is recognized by convertValToSlot, which would need deeper surgery in the serializer than I want to apply. I might be content with knowing that user-level receiver code cannot distinguish between these cases, even if the liveslots-level receiver code could.

Security Considerations

Comparing representatives for object identity might allow some malicious code to learn something about the path those objects took on their way to the receiver (did they arrive in the same message, or different ones? Were they created in a vat that produced them during the same crank, or different ones?). This feels like a subtle issue that won't be a problem most of the time, but would be hard to audit for. (It might be nice to have a lint rule that complains about identity comparison between objects known to be representatives).

If the refcounts get out of sync (perhaps because malicious code is deliberately constructing object graphs to provoke different behavior on the two sides of the kernel-val boundary), objects could be GCed too early, which will probably cause a vat-fatal illegal syscall if/when the missing object is referenced.

warner commented 3 years ago

@erights let me know what you think, you're more familiar with the ibidTable than me, as well as the Interface considerations.

warner commented 3 years ago

Hm, now that I think about it, I bet I can address the refcounting issue by moving the increment from convertSlotToVal() out into whoever calls that (dispatch.deliver, etc). I have to do something like that anyways for the target of a notifyFulfillToPresence, where we don't use m.unserialize() on the resolution (since it's always a Presence), but we need to increment the refcount anyways.

I think the other considerations still apply.

warner commented 9 months ago

Note to assignees: this ticket is pretty old, and probably contains some no-longer-correct assumptions.. do not implement it as described without updating the priors first.

In particular, we decided against having multiple concurrent Representatives for a single vref (which would have exposed GC behavior to userspace, via rep1 === rep2). We found a way to track the existing ones without also having the syscalls become dependent upon GC.