bitcoin-sv / BRCs

Bitcoin Request for Comments
26 stars 13 forks source link

BSV Unified Merkle Path (BUMP) Format #56

Closed sirdeggen closed 1 year ago

sirdeggen commented 1 year ago

This pull request:

Proposes a new standard by creating a new markdown file in the appropriate directory and requests discussion and assignment of a BRC number

Summary

This is a proposal for a new unified compound merkle path format which supersedes all historic formats, capturing the best of each proposal and unifying them into one standard to allow compound proofs, with a binary and json encoding, with all necessary data for optimal storage and transport, without compromise.

tonesnotes commented 1 year ago

2023-10-15, Tone Engel, BRC_0074 Feedback

Meta feedback:

  1. Both formats (JSON and binary) should be optimized for generation by transaction processors which also happens to be convenient for proof validating clients. As specified here, the binary format is root to leaf while the JSON is leaf to root. Both should be leaf to root. Especially the binary.
  2. I really like that this standard's focus is all the data a client needs per block instead of per transaction.

Specific feedback:

  1. Block height. A height requiring 8 bytes is over 80+ thousand years in the future. Consider just a four byte int?
  2. Tree height. "one below the root" is confusing? Tree Levels perhaps? BUMP tree levels === tree height - 1 as root is always computed.
  3. Clarify early that first level hashes are always the client's txids in this block.
  4. "The CMP does not contain the txid" => "The BUMP does not contain the txid"?
  5. Size of nLeaves is 1-9 bytes, not nLeaves x 1-9 bytes.
  6. Size of leaves is sum of leaf sizes.
  7. Size of leaf is 32 or 0 bytes.
tonesnotes commented 1 year ago

"Size of leaf is 32 or 9 bytes" => A leaf is { offset, data flag, leaf hash }, Size of leaf hash is 32 or 0 bytes.

tonesnotes commented 1 year ago

Lol.... Finally realized what was bothering me. You aren't computing merkleroots of txid's for which this data is part of the proof, but rather for the tree level sibling of the txids. If I actually wanted to start with a txid I wouldn't know where to start. The first level must include BOTH the client transaction txids and their siblings (or *).

tonesnotes commented 1 year ago

Which you effectively allow for... Just call out explicitly that at level 0, some of the leaves are client txids, some are proof completing siblings, and some are both.

tonesnotes commented 1 year ago

"so we start with level 0 which is the txids themselves" => "so we start with level 0 which includes all of the txid's of interest to a client in this block and the txid's of additional transactions required to begin the merkle root computation".

dorzepowski commented 1 year ago

As @tonesnotes there is a lack of information that at level 0, there would be txids for which the BUMP is generated and the siblings needed to calculate Merkle Root

tonesnotes commented 1 year ago

Should we add a bit to the "data flag" byte for level zero to indicate whether the txid is a client txid? Pros: Adds certainty to client processing. Cons: Complicates spec and removes accidental obfuscation.

dorzepowski commented 1 year ago

Both formats (JSON and binary) should be optimized for generation by transaction processors which also happens to be convenient for proof validating clients. As specified here, the binary format is root to leaf while the JSON is leaf to root. Both should be leaf to root. Especially the binary.

I would agree with that statement, it occurs that it's actually counterintuitive to have different order in binary and JSON format, and actually I don't see benefits for having reversed order in binary format.

dorzepowski commented 1 year ago

Should we add a bit to the "data flag" byte for level zero to indicate whether the txid is a client txid?

If so, then we should have something similar to that in json maybe a separated field at the root level of object that is specifying target txid and it's offset.

But actually I'm not sure if this is actually needed?

sirdeggen commented 1 year ago

2023-10-15, Tone Engel, BRC_0074 Feedback

Meta feedback:

  1. Both formats (JSON and binary) should be optimized for generation by transaction processors which also happens to be convenient for proof validating clients. As specified here, the binary format is root to leaf while the JSON is leaf to root. Both should be leaf to root. Especially the binary.

Agreed. The BUX team were already thrown off with the binary root first order, will make both leaf first. My original concern was unfounded in hindsight.

Specific feedback:

  1. Block height. A height requiring 8 bytes is over 80+ thousand years in the future. Consider just a four byte int?

At the moment it's VarInt, not 8 bytes.
A Uint32 would only be a one byte reduction for current heights, and a few byte addition for the early blocks. On balance I think I prefer we stick with VarInt as it provides a good balance and is used across so many Bitcoin formats already including elsewhere in this specification that it will be well understood.

  1. Tree height. "one below the root" is confusing? Tree Levels perhaps? BUMP tree levels === tree height - 1 as root is always computed.

Agreed. Let's just use the standard, which is to call it merkle tree height, and that is a number which includes the root hash, we only use this to determine when to stop anyway, so as long as the parser knows that we are not including the root it should be ok.

  1. Clarify early that first level hashes are always the client's txids in this block.

Good thinking, will update the initial paragraphs to include.

  1. "The CMP does not contain the txid" => "The BUMP does not contain the txid"?

Done by Ty.

  1. Size of nLeaves is 1-9 bytes, not nLeaves x 1-9 bytes.

Agreed. I think this was updated but if not will check.

  1. Size of leaves is sum of leaf sizes.
  2. Size of leaf is 32 or 0 bytes.

Updated, thanks

sirdeggen commented 1 year ago

Should we add a bit to the "data flag" byte for level zero to indicate whether the txid is a client txid?

If so, then we should have something similar to that in json maybe a separated field at the root level of object that is specifying target txid and it's offset.

But actually I'm not sure if this is actually needed?

I'll think about this feedback and make sure to highlight the conclusion when I update this PR. Thanks all very helpful so far.

tonesnotes commented 1 year ago

On flagging level zero txids as "client" or not:

  1. Without an indicator of "client" vs "for-proof" txids, it isn't possible to determine whether the encoding is valid.

  2. For the JSON encoding, the dual use of the string to encode either a hash or no-hash is a hack. When efficiency matters, the binary format should be used. The JSON format supports clarity and simplicity when efficiency doesn't matter. To be clear, the leaves could be encoded as:

    {
      "3048": {"hash": "304e737fdfcb017a1a322e78b067ecebb5e07b44f0a36ed1f01264d2014f7711"},
      "3049": {"hash": "d888711d588021e588984e8278a2decf927298173a06737066e43f3e75534e00"},
      "3050": {"hash": "98c9c5dd79a18f40837061d5e0395ffb52e700a2689e641d19f053fc9619445e"},
      "3051": {"flags": 1}
    },

if the sense of data flag bit 0 is flipped and default flags value is zero.

This can then be extended to explicitly call out client txids with flags bit 1.

    {
      "3048": {"hash": "304e737fdfcb017a1a322e78b067ecebb5e07b44f0a36ed1f01264d2014f7711"},
      "3049": {"flags": 2, "hash": "d888711d588021e588984e8278a2decf927298173a06737066e43f3e75534e00"},
      "3050": {"flags": 2, "hash": "98c9c5dd79a18f40837061d5e0395ffb52e700a2689e641d19f053fc9619445e"},
      "3051": {"flags": 1}
    },

to indicate that only the txids at block offsets 3049 and 3050 are client txids

tonesnotes commented 1 year ago

Alright punt the validity point. That doesn't hold water. The only "invalidity" would be padding the tree with more data than required.

tonesnotes commented 1 year ago

Is it valid to include hashes not required by client txids? It seems it would be.

Transaction processors might prefer to always include the entire set of offsets for the higher tree levels for example.

ty-everett commented 1 year ago

I don't see a reason to distinguish between client and non-client TXIDs within this data structure.

It adds significant complexity, and if the TXID a particular verifier is interested in cannot be proven, the operation should always fail.

ty-everett commented 1 year ago

Is it valid to include hashes not required by client txids? It seems it would be.

Yes. Because this is a composite proof standard, there are a few reasons why the tree might contain additional data. Perhaps it is in the process of being constructed (during block creation), or perhaps subtrees are later going to merge into it.

Simplicity is our friend here. Previous specifications have become too complicated due to these things. Later on, robust vectors and implementations across languages will help solidify things.

tonesnotes commented 1 year ago

Re client txids.

The workflow I foresee is signing up for automatic merkle proof delivery service from one or more transaction processors.

A block is found, and the processor reviews my outstanding txids not yet confirmed by them that are in the new block, constructs this datastructure and forwards it to me.

At this point, roughly half the txids at level zero will be siblings, not txids I care about, but I'll have to do a lookup on all of them to figure out which to update. Half my lookups will fail, which I'll expect. This may obscure some lookup failures I shouldn't ignore and doubles the work to be done.

If there was a transaction processor that didn't force me to do unnecessary work I'd be inclined to choose them, which creates pressure on this standard.

sirdeggen commented 1 year ago

@tonesnotes

With clarity rather than efficiency in mind, my recommendation would be to encode flags as booleans, only for level 0:

 {
      "3048": { "requested": true, "txid": "304e737fdfcb017a1a322e78b067ecebb5e07b44f0a36ed1f01264d2014f7711"},
      "3049": { "requested": false, "txid": "d888711d588021e588984e8278a2decf927298173a06737066e43f3e75534e00"},
      "3050": { "requested": true, "txid": "98c9c5dd79a18f40837061d5e0395ffb52e700a2689e641d19f053fc9619445e"},
      "3051": { "requested": false, "duplicate": true }
},
{
      "1524": "811ae75c80fecd27efff5ef272c2adf7edb6e535447f27a4087d23724f397106",
      "1525": "82520a4501a06061dd2386fb92fa5e9ceaed14747acc00edf34a6cecabcc2b26"
},
// ...etc.
ty-everett commented 1 year ago

I understand the need for indicating which ones are requested and which not. I agree with Deggen that in JSON, we should focus on expressing things clearly, and I like the approach of using booleans.

sirdeggen commented 1 year ago

Maybe the first set of leaves out to be keyed by txid, to avoid the initial lookup, and the rest of the sets left as is, but add the boolean.

sirdeggen commented 1 year ago
 {
      "3048": { "requested": true, "txid": "304e737fdfcb017a1a322e78b067ecebb5e07b44f0a36ed1f01264d2014f7711"},
      "3049": { "txid": "d888711d588021e588984e8278a2decf927298173a06737066e43f3e75534e00"},
      "3050": { "requested": true, "txid": "98c9c5dd79a18f40837061d5e0395ffb52e700a2689e641d19f053fc9619445e"},
      "3051": { "duplicate": true }
},
{
      "1524": { "hash": "811ae75c80fecd27efff5ef272c2adf7edb6e535447f27a4087d23724f397106" },
      "1525": { "duplicate": true  },
},
// ...etc.
sirdeggen commented 1 year ago

The problem I have with this now is that if it's a duplicate then it cannot be a requested or client txid.

sirdeggen commented 1 year ago

An alternative here would be to keep duplicate as boolean and indicate whether something is a hash or a txid as the item label:

 {
      "3048": { "hash": "304e737fdfcb017a1a322e78b067ecebb5e07b44f0a36ed1f01264d2014f7711"},
      "3049": { "txid": "d888711d588021e588984e8278a2decf927298173a06737066e43f3e75534e00"},
      "3050": { "txid": "98c9c5dd79a18f40837061d5e0395ffb52e700a2689e641d19f053fc9619445e"},
      "3051": { "duplicate": true }
},
{
      "1524": { "hash": "811ae75c80fecd27efff5ef272c2adf7edb6e535447f27a4087d23724f397106" },
      "1525": { "duplicate": true  },
},
// ...etc.
sirdeggen commented 1 year ago

New idea, send the index numbers of the ones which are "client txids".

{
  "blockHeight": 813706,
  "requested": ["3049", "3050"],
  "path": [
      {
            "3048": "304e737fdfcb017a1a322e78b067ecebb5e07b44f0a36ed1f01264d2014f7711",
            "3049": "d888711d588021e588984e8278a2decf927298173a06737066e43f3e75534e00",
            "3050": "98c9c5dd79a18f40837061d5e0395ffb52e700a2689e641d19f053fc9619445e",
            "3051": "duplicate"
      },
      {
            "1524": "811ae75c80fecd27efff5ef272c2adf7edb6e535447f27a4087d23724f397106",
            "1525": "82520a4501a06061dd2386fb92fa5e9ceaed14747acc00edf34a6cecabcc2b26"
      },
      // ...etc.
  ]
}
sirdeggen commented 1 year ago

Another attempt

[
  {
      "304e737fdfcb017a1a322e78b067ecebb5e07b44f0a36ed1f01264d2014f7711": { 
        "index": "3048", 
        "requested": false
      },
      "d888711d588021e588984e8278a2decf927298173a06737066e43f3e75534e00": {
        "index": "3049", 
        "requested": true
      },
      "98c9c5dd79a18f40837061d5e0395ffb52e700a2689e641d19f053fc9619445e": {
          "index": "3050", 
          "requested": true
      },
      "duplicate": "3051"
  },
  {
    "1524": "811ae75c80fecd27efff5ef272c2adf7edb6e535447f27a4087d23724f397106",
    "1525": "82520a4501a06061dd2386fb92fa5e9ceaed14747acc00edf34a6cecabcc2b26"
  },
  {
    "763": "duplicate"
  },
  // ...etc.
]
sirdeggen commented 1 year ago

@tonesnotes going with the hearted one above unless I hear otherwise. I like your binary encoding suggestion of using it as another flag.

UPDATED to reflect a true for each flag: requested, and duplicate, as Tone suggested

bits byte meaning
0000 0000 00 data follows, not a client txid
0000 0001 01 nothing follows, duplicate working hash
0000 0010 02 data follows, and is a client txid
tonesnotes commented 1 year ago

I can get behind either the flags expanded in JSON as named booleans, or as "flags" with a numeric value.

Perhaps flip the sense of the "data follows" bit? Make it explicitly "duplicate working hash - no hash follows". Then the flag values become just zero, one or two. There is no three as a client TXID is never also a duplicated hash.

tonesnotes commented 1 year ago

Here are the changes I made the sample code to experiment with these changes:

const { createHash } = require('crypto')
const { Br, Bw } = require('bsv')

// Displaying hashes as hex strings in reverse byte order is a matter of convention with respect to txids. 
// The functions below handle the conversions such that when we "hash()" something, we are running sha256 - 
// digesting the reverse bytes of a hex string, and returning the reverse bytes encoded as a hex string.
const hexRevToBuf = (str) => Buffer.from(str, 'hex').reverse()
const bufRevToHex = (buf) => Buffer.from(buf.toString('hex'), 'hex').reverse().toString('hex')
const hash = (str) => bufRevToHex(createHash('sha256').update(createHash('sha256').update(hexRevToBuf(str)).digest()).digest())

function bumpHexToJSON(str) {
  const reader = new Br()
  reader.buf = Buffer.from(str, 'hex')
  let blockHeight = reader.readVarIntNum()
  let treeHeight = parseInt(reader.read(1).toString('hex'), 16)
  let path = Array(treeHeight).fill(0).map(() => ({}))
  let flags, offset, hash, nLeavesAtThisHeight, leaf
  for (let level = 0; level < treeHeight; level++) {
    nLeavesAtThisHeight = reader.readVarIntNum()
    while (nLeavesAtThisHeight) {
      offset = reader.readVarIntNum()
      flags = parseInt(reader.read(1).toString('hex'), 16)
      //flags = flags ? 0 : 1
      leaf = flags === 0 ? {} : { flags }
      if ((flags & 1) === 0) {
        leaf.hash = reader.read(32).reverse().toString('hex')
      }
      path[level][offset] = leaf
      nLeavesAtThisHeight--
    }
  }
  return { blockHeight, path }
}

function bumpJSONtoHex({ blockHeight, path }) {
  const bw = new Bw()
  bw.writeVarIntNum(blockHeight)
  let treeHeight = path.length
  bw.writeUInt8(treeHeight)
  for (let level = 0; level < treeHeight; level++) {
    let nLeaves = Object.keys(path[level]).length
    bw.writeVarIntNum(nLeaves)
    for (const offset in path[level]) {
      bw.writeVarIntNum(Number(offset))
      const leaf = path[level][offset]
      const flags = Number(leaf["flags"] || 0)
      bw.writeUInt8(flags)
      if ((flags & 1) === 0)
        bw.write(hexRevToBuf(leaf["hash"]))
    }
  }
  return bw.toBuffer().toString('hex')
}

function calculateMerkleRootFromBUMP(bump, txid) {
  // Find the index of the txid at the lowest level of the Merkle tree
  let index
  for (let i in bump.path[0]) {
    if (txid === bump.path[0][i]) index = Number(i)
  }
  if (!index) throw Error(`The BUMP does not contain the txid: ${txid}`)
  // Calculate the root using the index as a way to determine which direction to concatenate.
  let workingHash = txid
  bump.path.map((leaves, height) => {
    const offset = index >> height ^ 1
    const leaf = leaves[offset]
    if (!leaf) throw new Error(`We do not have a hash for this index at height: ${height}`)
    if (((leaf['flags'] || 0) & 1) === 1) {
      workingHash = hash(workingHash + workingHash)
    } else if (offset % 2) {
      workingHash = hash(leaf['hash'] + workingHash)
    } else {
      workingHash = hash(workingHash + leaf['hash'])
    }
  })
  return workingHash
}

const json = {
  "blockHeight": 813706,
  "path": [
    {
      "3048": {hash: "304e737fdfcb017a1a322e78b067ecebb5e07b44f0a36ed1f01264d2014f7711"},
      "3049": {flags: 2, hash: "d888711d588021e588984e8278a2decf927298173a06737066e43f3e75534e00"},
      "3050": {flags: 2, hash: "98c9c5dd79a18f40837061d5e0395ffb52e700a2689e641d19f053fc9619445e"},
      "3051": {flags: 1}
    },
    {
      "1524": {hash: "811ae75c80fecd27efff5ef272c2adf7edb6e535447f27a4087d23724f397106"},
      "1525": {hash: "82520a4501a06061dd2386fb92fa5e9ceaed14747acc00edf34a6cecabcc2b26"}
    },
    { "763": {flags: 1} },
    { "380": {hash: "858e41febe934b4cbc1cb80a1dc8e254cb1e69acff8e4f91ecdd779bcaefb393"} },
    { "191": {flags: 1} },
    { "94": {hash: "f80263e813c644cd71bcc88126d0463df070e28f11023a00543c97b66e828158"} },
    { "46": {hash: "f36f792fa2b42acfadfa043a946d4d7b6e5e1e2e0266f2cface575bbb82b7ae0"} },
    { "22": {hash: "7d5051f0d4ceb7d2e27a49e448aedca2b3865283ceffe0b00b9c3017faca2081"} },
    { "10": {hash: "43aeeb9b6a9e94a5a787fbf04380645e6fd955f8bf0630c24365f492ac592e50"} },
    { "4": {hash: "45be5d16ac41430e3589a579ad780e5e42cf515381cc309b48d0f4648f9fcd1c"} },
    { "3": {flags: 1} },
    { "0": {hash: "d40cb31af3ef53dd910f5ce15e9a1c20875c009a22d25eab32c11c7ece6487af"} }
  ]
}

const bump0 = bumpJSONtoHex(json)
console.log(bump0)
tonesnotes commented 1 year ago

And the corresponding binary expansion:

### Bytewise Breakdown
```javascript
fe8a6a0c00 // blockHeight (813706), VarInt
0c // treeHeight (12), byte
// Level 0, client TXIDs and sibling TXIDs (TXIDs required only to compute internal tree hash).
04 // nLeaves, VarInt
fde80b // offset, VarInt
00 // flags
11774f01d26412f0d16ea3f0447be0b5ebec67b0782e321a7a01cbdf7f734e30 // hash
fde90b // offset VarInt
02 // flags = CLIENT_TXID
004e53753e3fe4667073063a17987292cfdea278824e9888e52180581d7188d8 // hash
fdea0b // offset VarInt
02 // flags = CLIENT_TXID
5e441996fc53f0191d649e68a200e752fb5f39e0d5617083408fa179ddc5c998 // hash
fdeb0b // offset VarInt
01 // flags = DUPLICATE_WORKING_HASH
// Level 1, internal merkle tree hashes
02 // nLeaves, VarInt
fdf405 // offset, VarInt
00 // flags
0671394f72237d08a4277f4435e5b6edf7adc272f25effef27cdfe805ce71a81 // hash
fdf505 // offset VarInt
00 // flags
262bccabec6c4af3ed00cc7a7414edea9c5efa92fb8623dd6160a001450a5282 // hash
// Level 2, internal merkle tree hashes
01 // nLeaves VarInt at level 2
fdfb02 // offset VarInt
01 // flags = DUPLICATE_WORKING_HASH
// Level 3, internal merkle tree hashes
01 // nLeaves VarInt at level 3
fd7c01 // offset VarInt (three hundred and eighty)
00 // flags
93b3efca9b77ddec914f8effac691ecb54e2c81d0ab81cbc4c4b93befe418e85 // hash
// Level 4, internal merkle tree hashes
01 // nLeaves VarInt at level 4
bf // offset VarInt
01 // flags = DUPLICATE_WORKING_HASH
// Level 5, internal merkle tree hashes
01 // nLeaves VarInt at level 5
5e // offset VarInt
00 // flags
5881826eb6973c54003a02118fe270f03d46d02681c8bc71cd44c613e86302f8 // hash
// Level 6, internal merkle tree hashes
01 // nLeaves VarInt at level 6
2e // offset VarInt
00 // flags
e07a2bb8bb75e5accff266022e1e5e6e7b4d6d943a04faadcf2ab4a22f796ff3 // hash
// Level 7, internal merkle tree hashes
01 // nLeaves VarInt at level 7
16 // offset VarInt
00 // flags
8120cafa17309c0bb0e0ffce835286b3a2dcae48e4497ae2d2b7ced4f051507d // hash
// Level 8, internal merkle tree hashes
01 // nLeaves VarInt at level 8
0a // offset VarInt
00 // flags
502e59ac92f46543c23006bff855d96f5e648043f0fb87a7a5949e6a9bebae43 // hash
// Level 9, internal merkle tree hashes
01 // nLeaves VarInt at level 9
04 // offset VarInt
00 // flags
1ccd9f8f64f4d0489b30cc815351cf425e0e78ad79a589350e4341ac165dbe45 // hash
// Level 10, internal merkle tree hashes
01 // nLeaves VarInt at level 10
03 // offset VarInt
01 // flags = DUPLICATE_WORKING_HASH
// Level 11, internal merkle tree hashes
01 // nLeaves VarInt at level 11
00 // offset VarInt
00 // flags
af8764ce7e1cc132ab5ed2229a005c87201c9a5ee15c0f91dd53eff31ab30cd4 // hash
sirdeggen commented 1 year ago

I like that, adding to PR.