bitcoin-sv / BRCs

Bitcoin Request for Comments
26 stars 13 forks source link

[DISCUSSION] - BUMP format JSON format consistency and easier for statically typed languages #58

Closed sirdeggen closed 1 year ago

sirdeggen commented 1 year ago

BRC ID

74

Discussion

In short, update the BUMP format to include a path as an array of arrays rather than an array of object. Use a key "offset" rather than using the value itself as a key.

Update this:

path: [
{
   "20": { hash: '0dc75b4efeeddb95d8ee98ded75d781fcf95d35f9d88f7f1ce54a77a0c7c50fe' },
   "21": { txid: true, hash: '3ecead27a44d013ad1aae40038acbb1883ac9242406808bb4667c15b4f164eac' }
},
 //  ...etc.
]

To this:

path: [
    [
      {
        offset: 20,
        hash: '0dc75b4efeeddb95d8ee98ded75d781fcf95d35f9d88f7f1ce54a77a0c7c50fe'
      },
      {
        offset: 21,
        txid: true,
        hash: '3ecead27a44d013ad1aae40038acbb1883ac9242406808bb4667c15b4f164eac'
      }
    ]

Thanks to @shruggr for the feedback and suggestion.

Motivation fleshed out: Using values as keys in JSON can make serialization and deserialization difficult in statically typed programming languages because these languages rely on compile-time type checking. Statically typed languages require that the data types of variables, including keys in a JSON object, be known at compile-time.

When values are used as keys in a JSON object, the key's value can be of any type, which makes it challenging for statically typed languages to determine the appropriate data type of the key during compilation. This ambiguity can result in errors during serialization or deserialization processes.

For example, if a statically typed language expects a specific data type for a key but encounters a value that doesn't match that expected type during deserialization, it may throw an error or fail to properly deserialize the JSON object. Similarly, during serialization, a statically typed language may have trouble determining the type of a key's value, leading to issues in generating the correct JSON output.

To overcome this difficulty, statically typed programming languages typically require keys to be of a specific known data type, such as a string or a numeric value. This ensures that the type of the key can be properly inferred and validated during compilation, making serialization and deserialization more straightforward and reliable.

sirdeggen commented 1 year ago

Issues arose when attempting to create and the parse a JSON BUMP in golang, updating in this way would allow us to order the inner array by offset as a convention thus giving us the ability to always keep BUMP format in consistent order which means we could use a checksum to ensure nothing has changed for example.

sirdeggen commented 1 year ago

@wregulski pushing back on this - open to the possibility that we should revert back to offset as key. Thoughts here please so that people can see the arguments in favor of each?

sirdeggen commented 1 year ago

I have both options working here now: https://github.com/libsv/go-bc/pull/73

Whatever concern was originally raised may not actually be a problem.

sirdeggen commented 1 year ago

Comparing the approaches here 1 = slice of maps and 2 = slice of slices

goos: darwin
goarch: amd64
pkg: github.com/libsv/go-bc
cpu: Intel(R) Xeon(R) W-2140B CPU @ 3.20GHz
BenchmarkCalculateRootOption1-16          101724         11349 ns/op        6846 B/op         96 allocs/op
BenchmarkCalculateRootOption2-16          121094          9840 ns/op        4328 B/op         93 allocs/op
PASS
ok      github.com/libsv/go-bc  2.686s
dorzepowski commented 1 year ago

Benchmark comparison

I made more detailed Benchmarks, and create a comparison of it: benchmark_comparison.txt :

                                                               │ BenchmarkBUMPasArrayOfArrays.txt │    BenchmarkBUMPwithOffsetKey.txt    │
                                                               │              sec/op              │    sec/op     vs base                │
BUMP/from_string-12                                                                   1.504µ ± 0%    1.943µ ± 0%  +29.19% (p=0.000 n=20)
BUMP/from_json-12                                                                     4.215µ ± 0%    4.569µ ± 0%   +8.39% (p=0.000 n=20)
BUMP/to_string-12                                                                     1.083µ ± 0%    1.846µ ± 0%  +70.45% (p=0.000 n=20)
BUMP/to_bytes-12                                                                      766.7n ± 0%   1522.0n ± 0%  +98.51% (p=0.000 n=20)
BUMP/calculate_merkle_root-12                                                         1.643µ ± 0%    1.917µ ± 0%  +16.71% (p=0.000 n=20)
BUMP/receive_hex_bump_(decode_&_merkle_root_calculation)-12                           3.225µ ± 0%    3.933µ ± 0%  +21.95% (p=0.000 n=20)
BUMP/use_hex_bump_(encode_&_decode_&_calculate_merkle_root)-12                        4.360µ ± 0%    5.906µ ± 0%  +35.46% (p=0.000 n=20)
geomean                                                                               1.985µ         2.723µ       +37.17%

If I didn't make any mistake, according to this comparison in all cases BUMP with array of arrays is faster.

Disclaimer: order

In array of arrays there is an assumption that BUMP is already sorted, this means that when building BUMP structure to serialise it later the code that is building that structure is obligated to ensure that leafs are in correct order.

Conclusion

Although array of arrays version seem to be much faster, it have a specific assumption about the order of leafs and there is nothing that ensures that order. Neither in BRC nor in code itself. So if client of BUMP will build the struct with wrong order, serialised version will also be in wrong order while in an offset version there is always sorting before serialisation. Because of that, in my opinion we can't rely on current results of benchmark, until code operation will be unified.

Source code

Sources can be found on my branch in 4chain fork, in the directory bumpbench

Beside that, I did small refactoring and I moved files for BUMP implemenations to separated packages. I did this because it makes it easier to create the benchmarks and ensure I didn't make a mistake in it.

sirdeggen commented 1 year ago

Conclusion from today's discussion is that:

  1. Performant systems will use binary format so it's not that important.
  2. Array of Arrays is easier to read and understand at a glance.
  3. Let's add explicit ordering to all formats, such that low offsets per level is first. This feels natural and allows to match data by checksum since there'd be 1 correct way only.