AljoschaMeyer / bamboo

Creative Commons Attribution Share Alike 4.0 International
141 stars 4 forks source link

Causal guarantees #2

Closed mafintosh closed 5 years ago

mafintosh commented 5 years ago

Do you mind going into a bit deeper detail on why you consider the causal guarantees stronger with bamboo vs hypercore?

AljoschaMeyer commented 5 years ago

@mafintosh

The merkle tree construction is not inherently preserving causal ordering, I can freely choose the order in which I concatenate things before hashing. The trivial special case: If I have two messages A and B, I can create either the log A | B (| denoting concatenation) or B | A. Worse, this applies to larger feeds as well: Suppose I have a hypercore of four data items (D0, D2, D4, D6), the corresponding root is the merkle node M3. Now I could create four additional data items (D0', D2', D4', D6'), and create the corresponding merkle nodes, calling the root M3'. I can now create the feed D0' D2' D4' D6' D0 D2 D4 D6 with the root hash(M3' | M3), thus prepending to the log rather than appending.

I can't do this permutation at arbitrary points, since the size of the trees that make up the log must be (non-strictly) decreasing. But whenever two trees of equal size are merged, I get to freely choose the ordering, violating causality if I choose to do so.

To prevent this, you could use something like threaded authentication trees (section 5.3 of the paper), but as far as I'm aware hypercore doesn't do any of this. I'd be happy to be wrong about this.

(The threaded authentication trees as presented in the paper have O(n log(n)) space overhead, but there's a fairly simple optimization to get it back to O(n) by adding additional edges to inner nodes and thus covering more leaves per edge. I can go into more detail if you are interested in this, but I haven't done a writeup I could link you to and I haven't seen this optimization in the literature either.)

mafintosh commented 5 years ago

No, you cannot prepend to the Hypercore or reorder it, even in the even sized sub tree case It's all guaranteed by the tree hash of the Merkle tree that is constructed like so:

# a root is the top of a each "factor of two" subtrees the merkle tree consists of
treeHash = hash(root1.hash + root1.index + root1.size + root2.hash + ...)

The treeHash is the hash the secret key signs (ie the entire ordered history)

The important part here is peak.index as that encodes the position of the Merkle peak, making sure to lock that tree down in terms of causal time.

(See the actual impl here https://github.com/mafintosh/hypercore-crypto/blob/master/index.js#L62-L76)

For example let's say you have your example

      M3
   /      \
 /  \    /  \
D0  D1  D2  D3

The tree hash (ie the hash the keypair signs) is

treeHash = hash(M3 + 3 (the flat tree index of M3) + byteSizeOfTree)

If you append D0' the tree looks like this

              M''
       /             \
      M3             M3'
   /      \       /      \
 /  \    /  \   /  \    /  \
D0  D1  D2  D3 D0' D1' D2' D3'

With the tree hash looking like this:

treeHash = hash(M'' + 7 + byteSize)

Let me know if that makes sense or if I'm missing something. Perhaps an example with actual data makes it easier to grok.

The protocol paper def needs an update I can see, as that does not reflect the tree hashing stuff that has always been there (ETOOMANYDOCS :))

AljoschaMeyer commented 5 years ago

It's all guaranteed by the tree hash of the Merkle tree that is constructed like so: [...]

My concern are malicious authors. Are you kindly asking the author to use your code and to not lie about the index, or is there some verification logic to enforce it? I'll be satisfied when you can show me the code that rejects a hypercore where the indexes are inconsistent, i.e. the code that rejects going from

treeHash = hash(M3 + 3 + firstSize + Mfoo + 9 + secondSize + D2' + 12 + D2Size)
// up to me not knowing whether the trivial leaf is hashed or not

      M3    
   /      \     Mfoo   
 /  \    /  \   /  \
D0  D1  D2  D3 D0' D1' D2'

to

treeHash = hash(M'' + 7 + byteSize)

              M''
         /         \
       M3'          M3
   /      \       /    \
 /  \     / \   /  \   / \
D0' D1' D2' D3' D0 D1 D2 D3

This is not the only reordering a malicious author could attempt in this step, there is one for each tree merge, i.e. log_2(n) for a hypercore of n entries in the worst case (whenever the log length reaches a power of two). I'm not sure whether this puts the time complexity for validating (and thus receiving) a hypercore append-by-append in O(n * log(n)) or whether it is amortized O(n), but it's definitely not the plain O(n) that bamboo provides. And I'm getting a headache trying to sketch the verification logic in a way that correctly handles partially replicated logs.

(Probably obvious but still worth stating explicitly (the verification section of DEP-0002 doesn't): The verification code also has to check the sizes that the treeHash claims.)

mafintosh commented 5 years ago

The replicating peer would reject your above example as the Merkle roots don't add up anymore. It already has M3, so when trying to build M'' it will compute a different hash rejecting the incoming replication.

# On the replicating peer
# The peer asks for seq=4 - the author tries to cheat and sends the
# reordered tree you show

              M''
         /         \
       M3'          M3
   /      \       /    \*
 /  \     / \   /  \*  / \
D0' D1' D2' D3' D0 D1 D2 D3

# It receives data=D0 seq=4 plus the hashes and sizes I've marked * above
# Using those it reproduces M3 in the above tree.

# Using this M3, it produces the root M'' using the locally stored sibling hash
# Note that the M'' it produces is different than the author's as
# the author is trying to cheat!

# When the peer then produces the treeHash the signature wont match and the replication
# is rejected. It now also has proof that the author is malicious as they tried to sign
# bad data

The tree verification is run here https://github.com/mafintosh/hypercore/blob/master/index.js#L935

The treeHash is re-produced by the peer that is trying to replicate the data, no data is implicitly trusted from the author.

AljoschaMeyer commented 5 years ago

That's looking reassuring @mafintosh. One last (hopefully) question before I'm happy: Suppose a malicious author Mal decides on eight data items D1, ..., D8 that should make up their log. Mal decides that the first item should be D1, the second D2, the fifth D5, the sixth D6, the seventh D7 and the eights D8. Mal's aim is to not commit to whether D3 and D4 are the third and fourth or the fourth and third item respectively, until it is absolutely necessary.

A peer who has no prior information about Mal's log requests the fifth entry, so Mal supplies D5 and the paths through the merkle tree of eight leaf nodes. Does the additional data that Mal has to send to prove that claim already imply the ordering between D3 and D4? Or could Mal, if at a later point the same peer requested the fourth entry still decide whether to place D3 or D4 there?

mafintosh commented 5 years ago

Let me try and sketch out the example you mention

So the replicating peer, P will receive the following merkle tree from Mal on their initial request:

            # 
      /           \
     *            #
               /     \
              #      *
             / \
D1 D2 D3 D4 D5 D6* D7 D8

(* is the hashes it receives from Mal, and # are the ones it locally produces using the received ones and hash(D5 + size) - it also receives the size of each of the trees but i'll leave that out for simplicity)

It uses the top # to produce the treeHash and then check that the signature is correct. Since it's does not have any other data the signature matches and P stores the data PLUS all the # and * hashes.

Now in an evil act, Mal reorders D3 and D4 as you mention.

Let's say P asks for D3. Mal will send back the following tree

            # <-- root hash ($)
     /            \
     #            $
  /     \            
  *     # 
       / \
D1 D2 D3 D4* D5 D6 D7 D8

($ above means that it is a hash P already has, so it does not use the one Mal sends but rather the one it has locally).

Now in this case when it produces the top root hash it will compare this hash against the one it already produced from the previous request that it has stored. It won't match as D3 and D4 has been reordered changing the hash of that tree.

Hence P rejects the data and drops the connection to Mal.

Makes sense? :)

AljoschaMeyer commented 5 years ago

Yup, I'm satisfied now. Not really happy - proving that hypercore's scheme provides these guarantees seems to be far more finicky than for antimonotone graphs - but I'll remove the claim about hypercore's guarantees being weaker from the readme :)

Thanks for taking the time explaining this.

Bamboo still has the better time complexity of O(1) per append and verify operation though :P Then again, hypercore achieves better constants for the O(log(n)) certificate size.

mafintosh commented 5 years ago

Yep, pros and cons rule the world :) A final note is that hypercore uses the above design for maximum deduplication through the merkle trees.

I'll make sure the docs are updated.