Encoding for on-chain structures

anorth commented 5 years ago

The spec is very light on details about the serialization/encoding of on-chain structures. At present, CBOR is not mentioned but a few structs have a // representation tuple comment.

I believe the intention at present is for all structures to be CBOR-tuple encoded (i.e. a CBOR array with items corresponding to struct fields in their order of declaration). This is efficient but has some potential problems. I'm filing this issue so that we have them written down somewhere.

@jbenet's most recent declaration is:

i'm OK with tuple encoding for testnet

we MAY ship mainnet w/ tuple encoding

we MAY have to change from tuple to int-keyed map for structs for mainnet

we will prioritize this along other changes that come out of security review during testnet

IF [int-keyed maps are already implemented] we can motivate realignment to that now.

IF NOT (gfc has string maps, but not int-keyed maps and those would be a lot of work), proceed w/ tuple. but keep it easy to change this

anorth commented 5 years ago

Problem: future-proofing.

Quoth @jbenet

in light of evolving protocols, security oriented protocols that serialize into non-self-describing formats take great care to ensure fields are appropriately tagged to ensure the right serialized field value is serialized/deserialized into the right in-memory field. protobuf, capnp, and more enforce this, and have for decades, for precisely protocol evolution and security. deserializing field A into field B is a class of bug trivially defeated and not worth exposing ourselves to.

this compounds as formats change and programs (which do not all update in lockstep_ continue to read old and new versions of structures).

this is made specially worse in hash-linked data structures which cannot be upgraded by migrating data, but instead tend to force all programs in the future to read old structure versions. field tagging is key for secure schema evolution

anorth commented 5 years ago

Another annoyance I have just learned about is that tuple-encoding does not play nicely with graphsync. IPLD selectors operate over the encoded IPLD nodes, which in this case will be lists. So selectors for chain syncing need to be expressed with indices, rather than field names.

This is not the end of the world, of course. We can declare (or even reflect) a mapping of field name->index and use symbolic constants to construct queries.

anorth commented 5 years ago

cc @hannahhoward @icorderi @whyrusleeping

filecoin-project / specs

Encoding for on-chain structures #621