Generating data availability roots from shard block roots

vbuterin commented 5 years ago

In a beacon block, we have N data roots r[1] ... r[n] (in the worst case, N = 256), each of which represent data d[1] ... d[n] of different sizes: on average 128 kB, in the worst case 512 kB. We want to combine these roots into a combined root R = f(r[1] ... r[n]) which is a Merkle root of some kind of concatenation of d[1] ...d[n].

The main desiderata are:

R can be computed from r[1] ... r[n] either directly or without adding much auxiliary data
Minimal "wasted space" in R (consider that if we force all r[i] to have the same height, on average there will be 75% wasted space)

Possibilities I see:

Allow r[i] to have different heights depending on its size, so there's on average ~30% maximum 49% wasted space. Then create R by combining all r[i] values of each size, adding a bit of padding if needed and then combining the different sizes. Main weakness is that variable heights are not SSZ-friendly.
Expose a maximum of 4 data sub-roots where r[i] = root(r[i][0], r[i][1], r[i][2], r[i][3]). Reduces wasted space even further, and is cleaner (including being SSZ compliant), but increases number of block hashes from 1 to average 1.5 max 4, increasing beacon chain load by 1.2x average-case and 2.33x worst case.
Expose 3 data sub-roots where r[i][0] is the root of the first half of the data if the data is more than half the max size and otherwise zero, r[i][1] is the root of the first quarter of the data not covered by r[i][0], and r[i][2] is the root of the first eighth of the data not covered by r[i][1]. Reduces wasted space heavily (maybe ~10% wasted), but increases beacon chain load ~1.89x, and is more complicated (including non-SSZ-friendliness from (1)).

For (1) here is an example algorithm:

def combine_roots(roots: List[Hash], sizes: List[int]) -> Hash:
    roots = []
    cur_size = 1
    ZERO_HASH = b'\x00' * 32
    while cur_size < max(sizes) * 2 or len(roots) > 1:
        if len(roots) % 2:
            roots.append(ZERO_HASH)
        roots = [hash(roots[i] + roots[i+1]) for i in range(0, len(roots), 2)]
        ZERO_HASH = hash(ZERO_HASH + ZERO_HASH)
        roots.extend([roots[i] for i in range(len(roots)) if cur_size//2 < sizes[i] <= cur_size])
    return roots[0]

For (2), here are concrete estimates of data overhead. With 64 shards, and a MAX_CATCHUP_RATIO of 4, and 4x slack for different attestations, there are a maximum of 1024 data roots in a block, plus 1024 state roots and 1024 lengths (that's 32 + 32 + 8 = 72 kB). Proposal (2) would increase the former to 1536 data roots hence 48 + 32 + 8 = 88 kB average case and 160 + 32 + 8 = 168 kB worst case. Note that average-case figures are expected to be ~8x less than these worst-case figures.

I am currently favoring (2) but open to other ideas.

dankrad commented 5 years ago

I think (2) sounds very reasonable. Could we save a lot of beacon chain load by making the target shard block size just a little bit smaller than 128 kB, say something in the range of 100-120 kB? I we are expecting that the distribution around the target does not have a huge variance, then this would make a lot fewer of the second data roots necessary (not sure if there is an obvious way to get the expected distribution from EIP1559?)

vbuterin commented 5 years ago

I do expect the distribution to have a considerable variance! Sure, if you assume it's a normal distribution, on average a block has ~120 txs so the standard deviation should be ~10, but that ignores uneven sizes of transactions, multi-transactions, etc. If I had to guess I'd expect the 25-75 confidence interval to be ~0.6-1.5 times the average.

ethereum / consensus-specs

Generating data availability roots from shard block roots #1438