Closed vbuterin closed 5 years ago
I would like to make an alternative proposal: Replace list by a type List[A, max_length]
, which is still a dynamic list type but with maximum length max_length
.
The reason why I would propose this is that I think it is preferable that the type is aware of its current dynamic length. The code for this is much easier to write, reason about, and check statically and dynamically. It is also consistent with the other changes we are making to SSZ (like unions and bitlists) which go into the direction of externalising typing work into SSZ and making it work "naturally" as you would expect from a modern programming language.
Merkleization would be like a Vector but with the length mixed in (in other words: it is the hash tree root where on the left is a static vector filled with zeroes and on the right there is the length as a uint64)
I can see how this can be valuable. But there is one question: do we want length mixed in as an item at the top, or as an element of the list? Huffman theory would say that mixing length in at the top is only optimal if half of all queries to the list are queries of its length, but that seems implausibly high. It would also say that reserving the first element of the array to represent length is only optimal if a length query is only as common as a query to a randomly selected element, which seems implausibly low. Which of these two extremes is less bad? The former is better from the PoV of not changing things as much as possible, the latter is better from the PoV of simplicity.
Huffman theory would say that mixing length in at the top is only optimal if half of all queries to the list are queries of its length, but that seems implausibly high.
I'm not sure, but if we assume that Merkle proofs typically access one element, and every proof hast to come with a proof of the length of the list, then actually half the accesses are to the length. (We could skip this if we use zero-checking logic instead of checking the length, but this might need nasty special cases for lists where zero is a valid value. Not what I would prefer from a simplicity point of view)
and every proof hast to come with a proof of the length of the list
Ah, I suppose if we want our list get proofs to return what list[i]
would actually return, which is IndexError
if the index is out of bounds, then you're right. OK, I can see the rationale for mixing in at the top as in the status quo.
See if we can describe the new extended SSZ better/more minimally (or at least my idea of it, all just an idea):
Any collection is defined as <header><body>
<body>
is always defined as the concatenation of the serialized elements in the collection.
Bitfields require the <body>
to be exactly bit-length bits, padded to the next multiple of 8. With the right-most byte left-padded to a full byte.
BigInts are the same as bitfields, but interpreted as little-endian integer.
There are 4 different types, specialized in <header>
and hash-tree-root definition:
` (zero bytes), root:
merkle(pad_right(elements, M))`
Body size and elements location are known on compile time.
H(merkle(pad_right(elements, M)), len)
<offset>*
merkle(pad_sparse(elements, indices, M))
:
Indirectly: header: <<index><offset>>*
:
<index>*
<index>
, root: H(H(element), index)
:
*
indicates zero or more elements, concatenated.
<index>
, <offset>
and <length>
are always 4 byte little endian integers (i.e. uint32
).
Offsets always count the bytes up to (but not including) the element.
M
is defined as the power of 2 defined as collection limit.
merkle
chunkifies (partition elements into 32 byte chunks) the input if it is of a basic type or a bitfield.
merkle
computes the binary-merkle-root of the given elements.
pad_right(elements, M)
pads the elements with zeroes to the given power of 2. **
pad_sparse
pads the gaps and ends to match the given power of 2. **
**: Optimized away within merkleization in real-world implementations.
<index>*<offset>*
too, if we want to merkle just part of the header (<index>*
) efficiently for basic presence proofs (not quite like hard inclusion proofs still) constructed in limited contexts (contracts?).<body>
: encoders for bodies can be remixed with different header types to implement all the types easily:
I'm very confused by your post... at least it does not feel like a simplification to me ;) Have we defined sparse objects yet? Also the bitfields is not how it works (we pad them on the left)
Dynamic, root: H(len, merkle(pad_right(elements, M)))
Mixing in the length is always on the right
Have we defined sparse objects yet?
No, but I posted this serialization earlier in the chat, as it is consistent with offsets, and simple. But improvements are very welcome. (also see note on ordering of indices)
Also the bitfields is not how it works (we pad them on the left)
See edits, that is what I did previously. But that's not consistent with the current verify-bitfield behavior, which is like a little endian int. Also, a big little endian integer sounds more consistent too me. But considered big-endian for formatting/reading reasons (prefer it too, as there's no "gap" in the bits when you align a 9 bit integer in bytes)
Mixing in the length is always on the right
Whoops, wrote it a bit quick, but you get the idea
But that's not consistent with the current verify-bitfield behavior, which is like a little endian int.
Oh yes, forgot it's little endian. Then it should be on the right.
I may add information here to help others who just arrive.
An SSZ partial is an object that can stand in for an SSZ object in any function, but which only contains some of the elements in the SSZ object. It also contains Merkle proofs that prove that the values included in the SSZ partial actually are the values from the original SSZ object; this can be verified by computing the Merkle root of the SSZ partial and verifying that it matches the root of the original SSZ object.
Source: https://github.com/ethereum/research/tree/master/ssz_research/partials
Addressed in #1180
Background reading: https://github.com/ethereum/eth2.0-specs/issues/1115
From my experience implementing SSZ partials (https://github.com/ethereum/eth2.0-specs/blob/ssz-impl-rework/test_libs/pyspec/eth2spec/utils/ssz/ssz_partials.py), I've come to the conclusion that the fact that paths do not correspond to specific generalized indices (as generalized indices depend on the depth of a tree which for dynamic-sized lists is dynamic) leads to a large amount of complexity in the SSZ partials implementation.
append
andpop
and the complexities around rebalancing lists around powers of 2 are particularly problematic.My proposed solution to this is as follows. From the perspective of SSZ hashing, we remove lists, so the only data types are (i) base types, (ii) fixed-size vectors, (iii) containers (we could also add unions by hashing a
Union[A, B... Z]
identically toContainer(a=Vector[A, 1], b=Vector[B, 1] ... z=Vector[Z, 1])
).All existing lists in the spec (validator list in the state, transaction lists in blocks...) get converted to fixed-size lists of some size (for lists in blocks, we know the maximums already; for the validator set we can pick a very generous limit, eg. 2**40). Hence, from the perspective of hashing, all data types become constant-sized types, and so the generalized index that a given path (eg.
state -> state.validator_registry[123514].pubkey
) corresponds to is always the same value.However, in the SSZ vector data type, we now add a flag, "serialization type". To start off, the serialization types are FULL and DYNAMIC; a third one to be added later is SPARSE. A FULL vector is serialized like vectors are today; a DYNAMIC vector is serialized by serializing the items in the vector up until the highest nonzero item the same way that a list is serialized today. This makes DYNAMIC vectors essentially equivalent to current lists in terms of serialization, except that a maximum length must be specified. A SPARSE vector would be serialized by prepending the list of items with a (item, position) table of what all of the nonzero items are. Current vectors would become vectors with a FULL serialization type, current lists would become vectors with a DYNAMIC or possibly SPARSE serialization type (if DYNAMIC is used, then it may make sense to add a validation rule in those cases that verifies that there is no non-empty object that appears after any empty object in the list).