Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
327 stars 206 forks source link

replace JSON-based marshal with dense binary encoding #2589

Open warner opened 3 years ago

warner commented 3 years ago

What is the Problem Being Solved?

(I thought we had a ticket for this already, but I couldn't find it.. feel free to close this as a duplicate if you find one)

Our marshal library currently uses a JSON-based encoding scheme to convert capability-bearing object graphs into pure data ("capdata", which is an object with a body string and slots array of object reference strings). It works by building a tree of objects from the original object graph, then using the built-in JSON.stringify to convert the tree into a single string. The deserialize function does the same thing in reverse.

The swingset comms vat has a related encoding scheme to convert messages to/from the "comms protocol", which adds additional control messages to the mix (deliver, notify, dropImport, and sequence numbers). It basically takes the two-part capdata object and encodes it into a single string.

The resulting output formats are pretty easy to specify, and nicely neutral (it would be pretty straightforward to port them to other languages than JavaScript), and convenient for humans to read the encoded output. But they're awfully fluffy. Common JS objects like undefined and BigInt(1) expand to 23- and 33- byte strings respectively.

1661 touches on shrinking the representation of undefined, but points out that it would be a distraction: the real task is to replace the encoding entirely, with some dense binary format.

Given our heritage and community, CapnProto is a likely choice. #1827 explores the idea (in a different context). We might also consider CBOR (RFC 8949), which I think is the encoding of preference in the IPFS world.

This touches on the question of how we want to represent binary data in our system: currently we can use the JS ByteArray in vat code, but they are mutable, and marshal can't handle them, so we're left in the awkward position of recommending base64-encoded strings, eww. #46 ("large string device") touches on this, as the data it manages might be arbitrary bytes.

Compatibility Issues

Every vat within a single kernel should probably use the same encoding scheme, because the messages created by liveslots in one vat are sent into the kernel, (sometimes) stored in kernel promise queues, and delivered to other vats. The kernel should not be in the business of interpreting the body of capdata, so I don't think I'd want it responsible for conversion. User-level vat code does not generally see the output of marshal, because liveslots and/or the virtual objects layer abstracts that away, although I believe userspace is given access to serialize (but not deserialize) for some reason I don't currently understand.

Converting an existing kernel to the new format could be tricky. We'd need to convert all vats' liveslots code at the same time, and simultaneously convert anything on the run-queue and the various unresolved-promise message queues (which assumes we could write a function that converts an old-format message to a new-format message, which at least sounds plausible).

Two different swingsets, connected by a comms connection, have a similar problem. We might lean more heavily on that hypothetical format-conversion function, and make the comms vat tolerant of the other version (upconverting upon receipt, downconverting on transmit).

dtribble commented 3 years ago

We should also switch before this to a more efficient text representation. That makes integration and debugging in browsers and other web infrastructure much easier. There's no reason not to have a format that is more readable and incidentally more efficient than the JSON encoding.

erights commented 3 years ago

There's no reason not to have a format that is more readable and incidentally more efficient than the JSON encoding.

There are plenty of reasons. Whether this is the right thing to do or not, it is a tradeoff. For one thing, any text format other than JSON will need a parser written in JS, so it will be hard to beat JSON.parse on speed.

dckc commented 3 years ago

I'm struggling to understand the problem statement. All I can find is "they're awfully fluffy". What are the symptoms of this problem? What suggests these symptoms are going to have sufficient impact that it's worth a change?

dckc commented 3 years ago

another candidate? https://github.com/Moddable-OpenSource/moddable/blob/public/documentation/xs/XS%20Marshalling.md

erights commented 3 years ago

Wow, that is much more different than I expected! Definitely something we should look at closely.

dckc commented 2 years ago

prompted by #6038 @erights suggests considering https://github.com/Agoric/agoric-sdk/blob/master/packages/store/src/patterns/encodePassable.js

dckc commented 2 years ago

@dtribble also had an idea:

'{"body":"{\"current\":{\"Electorate\":{\"type\":\"invitation\",\"value\":{\"brand\":\"$0.Alleged:
  Zoe Invitation brand\",\"value\":[{\"description\":\"questionPoser\",\"handle\":\"$1.Alleged:
  InvitationHandle\",\"installation\":\"$2.Alleged:
  BundleInstallation\",\"instance\":\"$3.Alleged:
  InstanceHandle\"}]}},\"GiveStableFee\":{\"type\":\"ratio\",\"value\":{\"denominator\":{\"brand\":\"$4.Alleged:
  IST brand\",\"value\":\"+10000\",\"numerator\":{\"brand\":\"$4\",\"value\":\"+3\"}}},\"MintLimit\":{\"type\":\"amount\",\"value\":{\"brand\":\"$5.Alleged:
  AUSD brand\",\"value\":\"+20000000000000\"}},\"WantStableFee\":{\"type\":\"ratio\",\"value\":{\"denominator\":{\"brand\":\"$4\",\"value\":\"+10000\"},\"numerator\":{\"brand\":\"$4\",\"value\":\"+1\"}}}}}}",
  "slots":["board03125","board03928","board05311","board00613","board0074","board05815"]}'
erights commented 2 years ago

Provided there is an outer novel qclass record to prevent misinterpretation across the version skew, I endorse @dtribble 's idea. Both in absolute terms and in comparison to the current state of encodePassable. @warner 's criticisms of encodePassable at https://github.com/endojs/endo/pull/1260#discussion_r960369826 are currently all correct, and we only know how to mitigate some of these.

erights commented 1 year ago

See https://github.com/endojs/endo/issues/1349 . Should be closed by whatever closes that.

Well, except the replacing part. It will co-exist with smallcaps, not replace it.

erights commented 1 year ago

Hopefully fixed by https://github.com/endojs/endo/pull/1594

Though still, except the replacing part. It will co-exist with smallcaps, not replace it.

mhofman commented 9 months ago

I am wondering more and more if direct CBOR serialization is not a better approach to avoid all the string manipulation logic, even with more efficient passable encodings, and approaches like https://github.com/endojs/endo/issues/1478 which effectively defers some serialization with marshal only taking care of encoding into a serializable object.

CBOR seems to support streaming and can easily nest pre-serialized data by wrapping it (which can be handled using a list of ArrayBuffers in the related APIs). The main problem is that the story for "frozen" ArrayBuffer in JS is still not good at all (and handling of binary data still very primitive).

mhofman commented 2 days ago

If we ever get around to using the compact-order encoding from https://github.com/endojs/endo/pull/1594, updating liveslots might be complex. Besides migrating existing data in legacy encoding, the legacy encoding had the property that all output sorted before | but all compact-order sorts after | because it's prefixed with ~. Some liveslots logic does rely on this ordering of data before metadata in collections.

gibson042 commented 2 days ago

Endo issue: https://github.com/endojs/endo/issues/2600