Motivation

How we create merkle roots is core to our product. Building it in a way that encourages others to re-use it for their business documents could be a driver for adoption of Centrifuge beyond just our own documents.

Building this as an open source component & getting people to adopt it/play around with it outside of centrifuge could be a big help in making the code more stable and secure. This is still an empty space that we have the chance to fill.

Shippable Version 1

For a shippable version 1, I think the goal should be to support single level (no nested) trees. The proof format, the salt document and dynamic generation of the root however should already be implemented. There is a prototype for creating a tree out of a struct with golang's reflection library here: https://github.com/CentrifugeInc/go-centrifuge/pull/30

[x] Create tree out of protobuf message according to field order
[x] Test coverage for tree generation
[x] make sure hashes are calculated with field prepended
[x] Create SaltedInvoice document and adapt tree generation to look up salts in document
[ ] Create proof message in protobuf
[ ] Validate proofs
[ ] https://github.com/matttproud/gochecklist & https://github.com/golang/go/wiki/PackagePublishing

Specification

I suggest we create a separate library for merkle root calculation and proof verification that is reusable outside of the cent node with the goal of open sourcing it and establishing a standard around it. It's a very essential part that this library is extremely stable and doesn't allow any sort of exploit (creating a a fake proof).

There are a few considerations when building this component:

Obviously: The binary tree always needs to be constructed the same way.
Salts for each object need to be stored somehow linked to the document
A format for proofs needs to be defined where hashes, values & salts can be provided along with the document schema so that a proof can be validated.

Salts

For every value, we must be able to store a salt. There are two ways of doing this: creating custom types that store both the value and the salt:

message SaltedInt {
    int value = 1;
    bytes salt = 2;
}

message SaltedInvoice {
    SaltedInt amount = 1;
    ...
}

// JSON representation:
{ 
    "amount": {
        "value": 1000,
        "base64string"
    }
}

Alternatively there could be a second protobuf message that mirrors the structure of the message containing the data:

message Invoice {
    int amount =1;
}
message InvoiceSalts {
    bytes amount = 1;
}

Advantages/Disadvantages

Embedding the salts has one big advantage: it is a lot easier to write the code that calculates the tree if the salts & values are "next to each other". Especially with reflection in go it is a bit harder to dynamically get a value from a different object. Once this is implemented however, there are no drawbacks.

When we separate the salts, there is one big advantage, the data object is now a clean object and could be directly exposed in a public API. In most scenarios I would expect the Cent node to abstract the salts and only expose values on the user API.

This means there would need to be a conversion process where a SaltedDocument gets converted to a ValuesOnlyDocument for the public. This means that we will always have to maintain a ValuesOnlyDocument and a SaltedDocument and there is an extra conversion between them. It seems cleaner to keep the salts & values in separate documents during the entire time as opposed to converting between the two and the extra overhead in the merkle root calculation should be manageable.

Generating the tree/root

When generating the tree, it is important that the the tree is always generated in the same way and doesn't leave any room for interpretation.

Example of a complicated merkle tree:

complex_tree_example

Therefore the following should be taken into account:

There are no "unordered" types. In JSON the keys of an object are not ordered, other data structures have the same issue (Go struct). To avoid any confusion, we adhere to the protobuf field order when serializing documents.
If there are any repeated values (lists of values) in the document, we always add the length of the list as the first leaf node.
The merkle tree doesn't flatten nested structure into one larger tree. For nested structures (lists & embedded messages, e.g. LineItems) a sub tree is created and the root is added as a leaf node to the root tree.
When hashing a value, the value is prepended with the "dot notation" of the field name, e.g.: line_item[1].amount, the full hash for such a node would be: field_name,salt,value This allows a validator to look at a merkle proof without having to reconstruct the entire tree and verify that the value provided is actually for the field specified.
If a field is blank, a leaf should still be created for it, otherwise it's impossible to prove that a value is blank.

Proof format

There are a number of ways that a proof could be transmitted. I will discuss two potential ways to provide a proof:

Format A: Providing a tree with hashed values only where a field should not be revealed and values for where you want to share it.
Format B: Create one proof per value that should be revealed. Each proof contains a list of hashes needed to verify the field up to the root.

A proof in Format B would look like this:

{
    "field": "invoice.customer.country",
    "salt": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    "value":"USA",
    "nodes": ["hash1 (sibling)", "hash2 (uncle)", "hash3 (great uncle)", "root hash (great grand father)"]
}

Advantages/Disadvantages

The first structure is significantly more complicated to validate and transmit. It also potentially reveals how many items there are in an embedded field. If only a few values in a document need to be shared, the second format is more efficient, in situations where many fields are shared transmitting the entire tree is more efficient.

The one big advantage of format B is that it's pretty universally readable and easier to implement. Especially on validating a merkle proof in this format on chain is much simpler than having to construct & validate an entire tree. It's not just easier but it also allows someone to validate that a field matches without having any knowledge of the overall structure of the document which is nice to have in cases where you want to validate something outside of a cent node.

Great write up! A couple thoughts:

The ordering based on the protobuf field order: Would it make sense to keep ordering separate from the actual message? E.g. having some other standard that defines the object structure independently from the transmitted object? One could argue that the protobuf specification of the object already is this "object structure". However, if we want to make this library more open and standardized one could argue to have the field order description be laid out in a more accessible format. (I hate to say it) XML or whatever else to describe objects in a way that any/most languages and tools will understand. Why did you ditch the alphabetical ordering by field name? Especially with the line items being just one leaf this could work again, no?
For the proofs: I haven't thought it all the way through but what is the more likely place a proof validation happens? On-chain or off-chain? In the long run probably on-chain. The list approach for the proofs is pretty standard I think - based on what I remember from using the tierion lib for example. We would have to include left/right concatenation as part of the proof layout though. Having multiple proofs (one per field) seems like a fair tradeoff compared to reveal more about the document structure.
Data types: I would probably stick to standard types and have the separate representation of the document and salts. The implementation might be a bit trickier with reflection but we shouldn't let the implementation details dictate how the format looks like IMO. In this case wouldn't we have three representations? Invoice (original) + InvoiceSalts = SaltedInvoice?

The ordering based on the protobuf field order: Would it make sense to keep ordering separate from the actual message? E.g. having some other standard that defines the object structure independently from the transmitted object? One could argue that the protobuf specification of the object already is this "object structure". However, if we want to make this library more open and standardized one could argue to have the field order description be laid out in a more accessible format. (I hate to say it) XML or whatever else to describe objects in a way that any/most languages and tools will understand. Why did you ditch the alphabetical ordering by field name? Especially with the line items being just one leaf this could work again, no?

Yes it could, you are right. I don't think XML is a good idea and not just because I don't like it ;) If we add an xml schema, then we are adding yet another place to put data, my goal was to have as few

For the proofs: I haven't thought it all the way through but what is the more likely place a proof validation happens? On-chain or off-chain? In the long run probably on-chain. The list approach for the proofs is pretty standard I think - based on what I remember from using the tierion lib for example. We would have to include left/right concatenation as part of the proof layout though. Having multiple proofs (one per field) seems like a fair tradeoff compared to reveal more about the document structure.

Yes, and really it's only a matter of what format we implement first. There's nothing keeping us from supporting both at some point if we see that the proofs we create for a large number of fields are just too big.

Data types: I would probably stick to standard types and have the separate representation of the document and salts. The implementation might be a bit trickier with reflection but we shouldn't let the implementation details dictate how the format looks like IMO. In this case wouldn't we have three representations? Invoice (original) + InvoiceSalts = SaltedInvoice?

I completely agree. It's cleaner and nicer to have a separate object for salts and the extra work is not that big.

@lucasvo
regarding naming of "complex-merkle" -> what we are actually doing is providing proof of the authenticity of single fields within a semi-structured document (e.g. an invoice).

Currently, the proof of a whole document (e.g. a hash of a file) as well as the merkle proof of a set of flat-hierarchy values (simple merkle tree) exists and is implemented.

What we do that hasn't been done before is to have a semi-structured format: you don't know how many line items or even if there are all fields of a document provided. And then doing proofs of the whole document as well as dynamic proofs of single values within that semi-structured document. I would look at naming that doesn't have anything with "merkle" or other implementation details in it but come up with a name that describes the outcome of our proofs. (Blockchain is also not called linked merkle proof generator 😄 )

I am wondering which area we want to focus on in terms of naming and features. Is it the timestamping of the whole doc (boring... already done...) or more about the proofs that are possible to selectively reveal information while keeping document integrity intact (more interesting IMO).

Maybe something along the lines of "granular document proofs" or "comprehensive document proofs". I like the prefix "comprehensive" or "detailed" or "precise" because it would also distinguish this sort of proof from the more simplistic way of generating proofs in other ways.

So maybe "precise document proofs (of nested documents)" or "extensive proofs". Some ideas!

on order: I like an independent merkle document structure. I think that needs to be a binary tree and would be anyway only following, but not repeating the protobuf/JSON/XML structure. I definitly would try to group logically connected fields like header, items, currency/amount, unit/quantity (here maybe make it even one field?)

@pstehlik Thomas calls all of his open source libraries two words that both start with the same letter (task tiger, clean cat, timezone trout). Your "precise merkle proofs" reminds me of this. I think actually calling it "precise proofs" is a good idea.

precise proofs - create & validate field-level merkle proofs

@mr-optimax nested fields (e.g. a list of line items) will be grouped, top level fields will be sorted alphabetically.

@lucasvo regarding your comments

Nice. I like the naming with two words! "precise proofs - create & validate field-level merkle proofs with nested object structures" (adding more complexity to the end - doesn't have to add that part. But sounds cool.)
If it is not XML to define the structure, what could it then be? Protobuf being relatively specific...
I'd say let's implement the simple right/left concat proofs then first so it could be validated in a smart contract easily. If we have multiple proofs to generate then that's fine I'd say - in favor of simplicity of the proof generation/validation. We can optimize for speed/throughput or whatever at a later point.

Closing this. Everything except for nested structures is implemented in precise-proofs.

centrifuge / pod

Merkle root calculation & proof verification #32

Motivation

Shippable Version 1

Specification

Salts

Advantages/Disadvantages

Generating the tree/root

Proof format

Advantages/Disadvantages