centrifuge / pod

Go implementation of Centrifuge POD (Private Off-chain Data) node
https://docs.centrifuge.io/build/p2p-node/
MIT License
77 stars 35 forks source link

Merkle root calculation & proof verification #32

Closed lucasvo closed 6 years ago

lucasvo commented 6 years ago

Motivation

How we create merkle roots is core to our product. Building it in a way that encourages others to re-use it for their business documents could be a driver for adoption of Centrifuge beyond just our own documents.

Building this as an open source component & getting people to adopt it/play around with it outside of centrifuge could be a big help in making the code more stable and secure. This is still an empty space that we have the chance to fill.

Shippable Version 1

For a shippable version 1, I think the goal should be to support single level (no nested) trees. The proof format, the salt document and dynamic generation of the root however should already be implemented. There is a prototype for creating a tree out of a struct with golang's reflection library here: https://github.com/CentrifugeInc/go-centrifuge/pull/30

Specification

I suggest we create a separate library for merkle root calculation and proof verification that is reusable outside of the cent node with the goal of open sourcing it and establishing a standard around it. It's a very essential part that this library is extremely stable and doesn't allow any sort of exploit (creating a a fake proof).

There are a few considerations when building this component:

Salts

For every value, we must be able to store a salt. There are two ways of doing this: creating custom types that store both the value and the salt:

message SaltedInt {
    int value = 1;
    bytes salt = 2;
}

message SaltedInvoice {
    SaltedInt amount = 1;
    ...
}

// JSON representation:
{ 
    "amount": {
        "value": 1000,
        "base64string"
    }
}

Alternatively there could be a second protobuf message that mirrors the structure of the message containing the data:

message Invoice {
    int amount =1;
}
message InvoiceSalts {
    bytes amount = 1;
}

Advantages/Disadvantages

Embedding the salts has one big advantage: it is a lot easier to write the code that calculates the tree if the salts & values are "next to each other". Especially with reflection in go it is a bit harder to dynamically get a value from a different object. Once this is implemented however, there are no drawbacks.

When we separate the salts, there is one big advantage, the data object is now a clean object and could be directly exposed in a public API. In most scenarios I would expect the Cent node to abstract the salts and only expose values on the user API.

This means there would need to be a conversion process where a SaltedDocument gets converted to a ValuesOnlyDocument for the public. This means that we will always have to maintain a ValuesOnlyDocument and a SaltedDocument and there is an extra conversion between them. It seems cleaner to keep the salts & values in separate documents during the entire time as opposed to converting between the two and the extra overhead in the merkle root calculation should be manageable.

Generating the tree/root

When generating the tree, it is important that the the tree is always generated in the same way and doesn't leave any room for interpretation.

Example of a complicated merkle tree:

complex_tree_example

Therefore the following should be taken into account:

Proof format

There are a number of ways that a proof could be transmitted. I will discuss two potential ways to provide a proof:

A proof in Format B would look like this:

{
    "field": "invoice.customer.country",
    "salt": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    "value":"USA",
    "nodes": ["hash1 (sibling)", "hash2 (uncle)", "hash3 (great uncle)", "root hash (great grand father)"]
}

Advantages/Disadvantages

The first structure is significantly more complicated to validate and transmit. It also potentially reveals how many items there are in an embedded field. If only a few values in a document need to be shared, the second format is more efficient, in situations where many fields are shared transmitting the entire tree is more efficient.

The one big advantage of format B is that it's pretty universally readable and easier to implement. Especially on validating a merkle proof in this format on chain is much simpler than having to construct & validate an entire tree. It's not just easier but it also allows someone to validate that a field matches without having any knowledge of the overall structure of the document which is nice to have in cases where you want to validate something outside of a cent node.

pstehlik commented 6 years ago

Great write up! A couple thoughts:

lucasvo commented 6 years ago

The ordering based on the protobuf field order: Would it make sense to keep ordering separate from the actual message? E.g. having some other standard that defines the object structure independently from the transmitted object? One could argue that the protobuf specification of the object already is this "object structure". However, if we want to make this library more open and standardized one could argue to have the field order description be laid out in a more accessible format. (I hate to say it) XML or whatever else to describe objects in a way that any/most languages and tools will understand. Why did you ditch the alphabetical ordering by field name? Especially with the line items being just one leaf this could work again, no?

Yes it could, you are right. I don't think XML is a good idea and not just because I don't like it ;) If we add an xml schema, then we are adding yet another place to put data, my goal was to have as few

For the proofs: I haven't thought it all the way through but what is the more likely place a proof validation happens? On-chain or off-chain? In the long run probably on-chain. The list approach for the proofs is pretty standard I think - based on what I remember from using the tierion lib for example. We would have to include left/right concatenation as part of the proof layout though. Having multiple proofs (one per field) seems like a fair tradeoff compared to reveal more about the document structure.

Yes, and really it's only a matter of what format we implement first. There's nothing keeping us from supporting both at some point if we see that the proofs we create for a large number of fields are just too big.

Data types: I would probably stick to standard types and have the separate representation of the document and salts. The implementation might be a bit trickier with reflection but we shouldn't let the implementation details dictate how the format looks like IMO. In this case wouldn't we have three representations? Invoice (original) + InvoiceSalts = SaltedInvoice?

I completely agree. It's cleaner and nicer to have a separate object for salts and the extra work is not that big.

pstehlik commented 6 years ago

@lucasvo
regarding naming of "complex-merkle" -> what we are actually doing is providing proof of the authenticity of single fields within a semi-structured document (e.g. an invoice).

Currently, the proof of a whole document (e.g. a hash of a file) as well as the merkle proof of a set of flat-hierarchy values (simple merkle tree) exists and is implemented.

What we do that hasn't been done before is to have a semi-structured format: you don't know how many line items or even if there are all fields of a document provided. And then doing proofs of the whole document as well as dynamic proofs of single values within that semi-structured document. I would look at naming that doesn't have anything with "merkle" or other implementation details in it but come up with a name that describes the outcome of our proofs. (Blockchain is also not called linked merkle proof generator 😄 )

I am wondering which area we want to focus on in terms of naming and features. Is it the timestamping of the whole doc (boring... already done...) or more about the proofs that are possible to selectively reveal information while keeping document integrity intact (more interesting IMO).

Maybe something along the lines of "granular document proofs" or "comprehensive document proofs". I like the prefix "comprehensive" or "detailed" or "precise" because it would also distinguish this sort of proof from the more simplistic way of generating proofs in other ways.

So maybe "precise document proofs (of nested documents)" or "extensive proofs". Some ideas!

mr-optimax commented 6 years ago

on order: I like an independent merkle document structure. I think that needs to be a binary tree and would be anyway only following, but not repeating the protobuf/JSON/XML structure. I definitly would try to group logically connected fields like header, items, currency/amount, unit/quantity (here maybe make it even one field?)

lucasvo commented 6 years ago

@pstehlik Thomas calls all of his open source libraries two words that both start with the same letter (task tiger, clean cat, timezone trout). Your "precise merkle proofs" reminds me of this. I think actually calling it "precise proofs" is a good idea.

precise proofs - create & validate field-level merkle proofs

@mr-optimax nested fields (e.g. a list of line items) will be grouped, top level fields will be sorted alphabetically.

pstehlik commented 6 years ago

@lucasvo regarding your comments

lucasvo commented 6 years ago

Closing this. Everything except for nested structures is implemented in precise-proofs.