Inefficiencies in data availability proofs

vbuterin commented 5 years ago

There are three major sources of inefficiency in the current construction for data availability proofs:

The structure demands a 2^k * 2^k square, so data needs to be padded to the next power of 4. This may on average double the amount of data that needs to be erasure-coded and stored by the network.
Due to EIP 1559 implementation, on average a block will be ~half full, requiring further zero-padding within the space allocated for each block.
The space allocated for each block is 2x the size of the block data itself; the remaining half is allocated for the block header and witnesses for state reads.

Possible ideas to improve efficiency:

Adjust EIP 1559 so that blocks are on average eg. 80% full. This will however nullify a large part of the benefit of EIP 1559, as blocks will much more frequently be full and so the first price auction mode will more frequently dominate.
Adjust the crosslink structure so that while there is a static per-epoch limit, individual blocks can take up different amounts of space. Allows EIP 1559 to keep functioning as long as the gasprice does not need more than ~half an epoch to adjust.
Adjust the crosslink structure so that it stores the concatenation of all block structures in SSZ format as dynamic-sized lists (so any extra zero bytes are not included, positions of individual blocks are variable, and the indices of each block appear at the left end of the tree structure). The main disadvantage is the need to do two Merkle tree accesses instead of one to access a shard block.
Be liberal with the witnesses for state reads allowed in the block (eg. allow an unlimited number of "top-level transactions") so more of the non-block-data space is used.
Somehow modify the erasure coding mechanism to be able to process large contiguous zero bytes trivially or at least more efficiently. This is ideal if possible, but not sure how to actually do this.
Remove the 2^k * 2^k requirement and replace it with an arbitary n * n (not sure this is actually an efficiency improvement; FFTs are easiest to use when handling exact powers of two)
Allow 2^k * 2^(k-1) rectangles.

This is still in "think mode"; better ideas may come up. So far (3) seems least bad, plus possibly (7). The issues are far from fatal, but hopefully there are easy ways to remove at least some of the current theoretical ~8x inefficiency.

dankrad commented 5 years ago

Very interesting problem. I would love to think about these. How final are these ideas? Could there be other schemes out there and would we be willing to consider them at this point?

One question about https://arxiv.org/pdf/1809.09044.pdf (is this a good reference?), section 7.1: I am currently thinking about polynomial and functional commitment schemes. Is it possible that such a scheme could be used, rather than a full zk-SNARK (which is currently prohibitively expensive)?

I am currently looking at some polynomial commitment schemes that could potentially replace the proof of custody with a non-interactive version. They aren't quite ready for prime time yet (still some problems with efficiency and outsourcability), but I'm planning to post my thoughts to get some more outside thoughts about it.

vbuterin commented 5 years ago

In terms of how much is final and how much can be changed, I'd say there's good arguments for the use of erasure coding and commitment to erasure coded data in some form being optimal:

In a sharded system there's less replication of each piece of data and hence more reason to pay attention to making sure to maximize the chance of survival of each piece of data. Erasure coding strictly outperforms simple replication as a way of doing this for any specific redundancy factor.
If erasure-coded extension of block data is done anyway, then the marginal cost to commit to it and let people make data availability queries is low.

As far as choice of scheme goes, I think biggest pragmatic reason to stick to hash-based things in this case, aside from simplicity of implementation, future-proofness and all the other usual reasons, is that erasure codes and Merkle trees are really fast to generate (eg. I can do it within an epoch for 512 MB of data in python) but I don't see any more complicated scheme having the same property (as precedent, think of how the proving time for STARKs is 10x faster than for SNARKs). Additionally purely hash-based schemes can work over binary fields and one reason binary fields are especially nice is that they can represent 32-byte chunks with no overhead, whereas any finite field runs the risk that even if a block is entirely composed of values < 2**256, the extended chunks will not be (the same goes for any byte size less than 32).

Though it does seem to me like polynomial commitment schemes are a superior alternative to SNARKs over Merkle trees; they're more "direct", rather than one construction wrapping over a completely foreign one and hence incurring arithmetization inefficiencies. And the assumptions in polynomial commitment schemes are definitely milder than the assumptions in SNARKs.

Block layout in commitment data is still very flexible, and my intuition is that whatever efficiencies remain would remain the same regardless of the scheme used.

vbuterin commented 5 years ago

Idea from a conversation with Justin yesterday: if there are large segments of contiguous zero bytes, fill them with any of:

Portions of beacon chain state
Portions of shard state
Recent blocks of other shards (to incentivize more availability of that data)

All of these objects are larger than 64kb, so we can either randomly choose indices from where to include data each time, or store indices and try to cycle through the data.

I actually like this. There's definitely something unclean about "grab as many bytes of this thing as possible", but it allows us to preserve the existing (quite clean!) structure of the crosslink data.

JustinDrake commented 4 years ago

@vbuterin @dankrad Can we close this issue for staleness? (Trying to address old Github issues, and data availability research has progressed significantly since June 2019.)

dankrad commented 4 years ago

I'd say it can be closed. I would say the data availability proof now will mostly follow from whatever we use for shard data commitments.

On Tue, 21 Apr 2020 at 13:38, Justin notifications@github.com wrote:

@vbuterin https://github.com/vbuterin @dankrad https://github.com/dankrad Can we close this issue for staleness? (Trying to address old Github issue, and data availability research has progressed significantly since June 2019.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ethereum/eth2.0-specs/issues/1194#issuecomment-617153307, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOYXL56QI2SPZCOFUN74K3RNWHWHANCNFSM4HZE7KNQ .

--

Dr. Dankrad FeistResearcher at Ethereum Foundation

dankrad@ethereum.org +44 750 3267828

ethereum / consensus-specs

Inefficiencies in data availability proofs #1194