Concrete proposals for what data gets committed to in a crosslink

Currently, the AttestationData structure asks crosslink committee members to sign a shard_block_hash, the hash of a shard block at some recent slot (perhaps the epoch start slot of the epoch during which the crosslink is made, though this has not yet been specified). There is also a custody bit, which is the first bit of the hash of something, which has also not yet been specified. This post will attempt to come up with concrete proposals to tackle these issues.

First, some facts to keep in mind:

The AttestationData structure includes a latest_crosslink_root, so validators signing a crosslink do know what the last crosslink was, and therefore what slot it included blocks up to.
The shard chain does not know when the latest crosslink was, and it fundamentally can't. A crosslink of a shard chain up to slot N will take up to slot N + EPOCH_LENGTH before we know whether or not that crosslink will be included in the beacon chain, so the shard chain between slots N and N + EPOCH_LENGTH does not know whether it's part of the "new crosslink period" or the old one
If the last crosslink was up to slot N, the curent crosslink could be up to slot N2 > N for an arbitrarily high difference N2 - N; there is no upper bound on the amount of data that needs to be included
Some slots have blocks, other slots are skips
A shard block has a header (currently min 344 bytes, max 856 bytes) and a body (~16 kb); there is also a desire to add state roots, which could add another few hundred bytes to the "header".

The simplest approach would be to have shard_block_hash point to the shard block at the end of the epoch before the epoch in which the crosslink was included, and make a proof of custody of just the shard block data. Note that because all shard block data is perfectly 2^k sized (16 KB = 2^14 bytes, or 2^9 32-byte chunks), a Merkle tree of Merkle trees of shard data from N blocks is the same as a Merkle tree of the concatenated shard data, as long as we make sure to fill unused leaves in the tree with merkle_root([b'\x00' * 32] * 512) instead of the zero-hash.

However, this would require a client seeking to gain a guarantee on the chain's integrity to download not just the beacon chain but also the shard chains, which adds a significant amount of data: up to 1/16 the data of all shards in the entire chain! It would also make fraud proof enforcement harder, as the beacon chain would not have access to any state roots in between the crosslinks. A better approach would have shard_block_hash (and therefore the custody bit) include the block bodies and also the headers.

For the proposals, we rename shard_block_hash to custody_commitment_data.

A philosophical note

We can consider the crosslinks being included into the beacon chain as being the "real" shard blocks' headers. So all "real shard block headers" get included into the beacon chain. The shard blocks that appear in the intermediate stages are merely a coordination device to assist the proposal committee on coming together to agree what block to propose, and in such a way that transaction senders can get assurance within one slot that their block will (likely) get included. From this point of view, we want to be able to fully verify state transitions inside of these "real shard blocks" and fully verify that the coordination game was actually followed, so we should include shard headers in the data to be committed.

Proposal 1: two sub-trees

Let custody_commitment_data = hash(header_root, body_root), where body_root is a Merkle root of all block data, and header_root is a Merkle root of all header data, zero-padded to 16 KB (for skipped slots, block data is fully zero, and header data is some placeholder containing the most recent block header root and state_root). We can add a state transition validity fraud proof by asking for a Merkle branch for the header and a Merkle branch in the corresponding block root in the body data, and checking that the latter does not match the data_root in the former.

Proposal 2: interlacing

For each block, have 32 kilobytes of data, where the first 16 kilobytes are the header (zero-padded to 16 KB) and the second 16 kilobytes are the data (for skipped slots, block data is fully zero, and header data is some placeholder containing the most recent block header root and state_root). We can add a state transition validity fraud proof with a Merkle branch for the header and for the body data as in proposal 1, also checking that the latter does not match the data_root in the former. However, the fraud proof will be shorter, because most of the two Merkle branches are shared because the header and body are beside each other in the tree.

Proposal 3: optimizing interlacing

For each block, have 32 kilobytes of data, where the first 2 kilobytes are the header (zero-padded to 2 KB) and the remaining 30 kilobytes are the data (for skipped slots, block data is fully zero, and header data is some placeholder containing the most recent block header root and state_root). A block header now contains four data roots for the 2+4+8+16 kilobytes of the data respectively. We would add a fraud proof type for each of the four roots and parts of the block data.

Proposal 3 is more efficient than proposal 2 when we want to add data availability proofs because it does not add the ~80% overhead from hashing many zero bytes, but it also adds some extra complexity. We could mitigate the 80% overhead by simply using the space for other purposes. Possible ways to use the space include:

Putting in STARKs of validity
Putting in erasure coded values for the other data
Putting in random Merkle branches from the state

These ideas would be easier to implement if there was a large contiguous pool of data, which is an advantage of proposal 1 over proposal 2.

ethereum / consensus-specs