celestiaorg / celestia-specs

Celestia Specifications
Creative Commons Zero v1.0 Universal
51 stars 14 forks source link

Merkle trees misc cleanup and fixes #41

Closed liamsi closed 4 years ago

liamsi commented 4 years ago

https://github.com/lazyledger/lazyledger-specs/blob/master/specs/data_structures.md#sparse-merkle-tree

For leaf node of leaf message m with key k, its value v is:

Better to use a different word than message here (otherwise this might be conflated with the other use of the word messages)

Also, it would help getting a quick understanding, if there was 2-3 simple to follow examples how an smt and its will look like (including trivial edge-case like all-empty tree, only one entry, and one realistic example with a few entries).

adlerjohn commented 4 years ago

Things to add:

liamsi commented 4 years ago

The annotated Merkle tree abstraction isn't suitable for prefixing to differentiate between internal and leaf nodes. Need to make that an explicit feature.

Independent of the prefixing, I'm not sure the annotated and the namespaced merkle tree hashes the correct thing.

Looking at how @musalbas defined the hashing in his paper:

We define a wrapper function nsHash(x) that produces hashes prefixed with namespace identifiers. A namespaced hash has the format minNs||maxNs||hash(x), where minNs is the lowest namespace identifier found in all the children of the node that the hash represents, and maxNs is the highest.

OK, and:

If x is a leaf, then minNs = maxNs = ns(x), as the hash contains only one leaf with a single namespace

Using the notation in the spec, my understanding is that this would result in:

n_min = m_1_l(m) = m.namespace_id
n_max = m_2_l(m) = m.namespace_id
v = n_min || n_max || h(serialize(m))

but we have:

n_min = m_1_l(m) = m.namespace_id
n_max = m_2_l(m) = m.namespace_id
v = h(serialize(m))

https://github.com/lazyledger/lazyledger-specs/blob/master/specs/data_structures.md#namespace-merkle-tree

Similarly, for the inner nodes, I think the hash of should be something like: v = nsh(l, r) = n_min || n_max || h( l.v, r.v ) instead of currently: v = h(l, r) = h(l.n_min, l.n_max, l.v, r.n_min, r.n_max, r.v)

I know the spec's version is just a logical abstraction but at least it isn't obvious how both are compatible.

liamsi commented 4 years ago

Ah, while writing this up, I think I see how they can be seen as equivalent.

adlerjohn commented 4 years ago

The two ways of computing node values aren't identical but should provide the same guarantees maybe? I need to double-check, but if not the annotated Merkle tree abstraction can be changed to reflect the way NMTs are described in the paper.

liamsi commented 4 years ago

I think it's actually closer to the paper as it seem on 1st sight, with the difference that leaf nodes are "namespace-hashed" as n_min || n_max || h(m) in the paper and as h(serialize(m)) in the spec.

Something else I've noted: the spec makes the assumption that each leaf message m, somehow contains the namespace id via: m.namespace_id. Later, we speak of shares that are committed to the NMT, which do not explicitly have such a field: https://github.com/lazyledger/lazyledger-specs/blob/master/specs/data_structures.md#share

My understanding is that a share implicitly belongs to a namespace but reading through it again, the connection between share and namespace doesn't seem clear enough to me.

liamsi commented 4 years ago

I would suggest to remove the "Annotated Merkle Tree" section as it is not really needed as part of the LL specification and directly introduce the concrete and simpler special case we need, namely the NMT. Unless we foresee any use for the generalisation (the annotated tree), which I currently don't see.

musalbas commented 4 years ago

The min and max values of a node should be outside of the hash of the children, so that a client that wants to verify that they have received all the messages for their namespace should not need to download more data than they need to (specifically, the values of nodes that they aren't interested in).