ethereum / py-ssz

Python implementation of the Simple Serialize encoding and decoding
MIT License
32 stars 22 forks source link

Hash tree memoization #39

Open jannikluhn opened 5 years ago

jannikluhn commented 5 years ago

Problem

The tree hashing algorithm describes a hash function for SSZ data structures. For lists and containers, it is defined recursively: First, hash each element and, second, hash the results together. Thus, updating only a single element of a list or a field in a container does not require a full recalculation of the top level tree hash. We should take advantage of this to save unnecessary computation that can take quite long for big values (in particular, the beacon chain state). At the same time, we'd like our data structures to stay immutable.

Vitalik came up with a (mutable) example implementation here: https://github.com/ethereum/research/blob/2d3ed6e42087d5b14cdf107c897e8d3e5db3ee7a/ssz_hashable_list/hashable_list.py

Proposed Implementation

Container

ssz.Serializable should memoize its tree hash. ssz.copy should copy the tree hash as well.

List

We need a new class (maybe named, as Vitalik proposed it, HashableList) that similar to ssz.Serializable memoizes its tree hash, but in addition to that the full hash tree. It should be an immutable sequence (i.e. define __getitem__, __len__, __reversed__, __contains__, and __iter__), and have a copy method with the following interface:

def copy(
    replace: Dict[int: TElement] = None,
    insert: Dict[int: TElement] = None,
    remove: Dict[int: TElement] = None,
    prepend: Sequence[TElement] = None,
    append: Sequence[TElement] = None,
) -> HashableList[TElement]:
    ...

The new list will copy the elements of the old one, but replace/insert/remove/prepend/append some elements depending on the given parameters. The old hash tree should be copied as well, updating the necessary branches only. For safety, we should probably either only allow a single argument or enforce a certain order of the arguments so that its clear to what the indices refer to.

Usage example


state.copy(
    x=4,
    ys=state.ys.copy(
        replace(4, b"foo"),
    ),
)
hwwhww commented 5 years ago

Thanks for the write-up! I think it's the right way to rescue us from the immutable copy for huge data. /cc @pipermerriam, basically we will have trouble when serialize and tree hash the BeaconState with 4M validators. The benefit of tree hashing has been discussed here and here, and we have to find a way to build the tree structure inside SSZ.

pipermerriam commented 5 years ago

I'm aware of this problem space and I think @jannikluhn is correct that we likely need a formal API to be able to take advantage of this. I don't have opinions about specific implementation but will be happy to review one once it's ready or to weigh in on concepts for the API.