Non-canonical CBOR Serialization (Optional)

solidsnakedev commented 2 months ago

Tools like Aiken and Plutarch add an extra underscore _ to items during CBOR serialization.

This inconsistency becomes problematic when serialized data is hashed, as the resulting hash differs between tools.

In Aiken, the below serialization will produce a CBOR value like [ _ <hashed_value>], whereas CML's serialization would produce [ <hashed_value>] without the underscore. Example:

  let datum_tag =
          out_ref
            |> serialise_data
            |> blake2b_256

Even though CML follows RFC-7049-section3.9 The issue has resulted in some developers discarding CML in favour of other tools that maintain the non-canonical serialization.

CML should offer an option to choose between canonical and non-canonical serialization to allow compatibility across different tools.

rooooooooob commented 2 months ago

We already offer exact round-trip encoding with to_cbor_bytes() (this will be the same bytes as you fed into from_cbor_bytes() to create it if you haven't changed anything, and will try its best to maintain the same type of format when possible even if changes are made) and also canonical CBOR bytes using to_canonical_cbor_bytes().

This inconsistency becomes problematic when serialized data is hashed, as the resulting hash differs between tools.

This is exactly why we overhauled cddl-codegen and CML to support the round-trip deserialization (old behavior was like CSL is). Is it breaking anywhere? I haven't done a full sweep of all networks in half a year I think (could be wrong) but when I did every block round-tripped fine.

I'm a little confused what the underscore here is. Why is it added, and what does that have to do with canonical vs non-canonical CBOR? That only impacts:

order of items in a map (must be lex order for canonical)
size used to encode (must that fits e.g. 4 bytes for u32::MAX)
definite vs indefinite length encodings (definite only for canonical)

Maybe I'm just not understanding what the underscore is meant to be.

What tools are they using instead, and how do they handle it? I know CSL just disregards any encoding details from the deserialization and serializes using an arbitrary format (mostly canonical with a couple changes).

rooooooooob commented 2 months ago

Is this about when you're creating something from scratch (not deserialized)? Generally that's not great for hashing and you should always be working with direct deserialization or raw bytes instead of hoping two tools line up when creating both separately. I think that's what all the safe hash/original bytes stuff in cardano node is for.

You can change the FooEncoding struct for each Foo to tweak the serialization to use the exact CBOR format you want (note: this is not exposed to the WASM API - only rust). If there is a common format for something that many users would want we could also provide a method that automatically sets that like maybe PlutusDatum::encode_like_cardano_node() or something along those lines.

solidsnakedev commented 2 months ago

AFAIK The underscored is an optional indicator, https://datatracker.ietf.org/doc/html/rfc7049#section-6.1 I have reports that some dapps like vify, muesliswap and wingriders are having issues.

The original lucid which uses a CML internal fork it's working with the non-canonical format and it outputs cbor like this

solidsnakedev commented 2 months ago

Is this about when you're creating something from scratch (not deserialized)? Generally that's not great for hashing and you should always be working with direct deserialization or raw bytes instead of hoping two tools line up when creating both separately. I think that's what all the safe hash/original bytes stuff in cardano node is for.

You can change the FooEncoding struct for each Foo to tweak the serialization to use the exact CBOR format you want (note: this is not exposed to the WASM API - only rust). If there is a common format for something that many users would want we could also provide a method that automatically sets that like maybe PlutusDatum::encode_like_cardano_node() or something along those lines.

ideally it would be great to have something like to_cbor_hex_non_canonical() maybe

rooooooooob commented 2 months ago

It says those are for the diagnostic notation, not for the CBOR bytes itself, if I'm understanding this correctly. The diagnostic notation is a human readable way to write what the underlying bytes there are, since in the notation something like 1 could be written as 1 byte (inlined in type tag), 2 bytes (type tag + 1 byte), 3 bytes, 5 bytes, or 9 bytes (type tag + 8 byte int).

Are you ended up with different bytes? Could you post them along with how you got them (e.g. CML -> other tool or other tool -> CML)?

rooooooooob commented 2 months ago

ideally it would be great to have something like to_cbor_hex_non_canonical() maybe

I think there must be some confusion here. There are a combinatorial amount or ways to have non-canonical encodings. Something complex like a whole transaction could have millions of ways (maybe billions when you consider strings/bytes in them! indefinite string/bytes serialization options can be pretty wild).

The default to_cbor_hex() already is potentially non-canonical - it will (or should! if there is a bug here please post the CBOR) use whatever encoding you tell it to. If you create the types from scratch that will be canonical, if you deserialized from bytes, it will be precisely the encoding that was used for deserialziation.

solidsnakedev commented 2 months ago

this is the ouput from lucid-cardano using the internal cml fork

import * as LucidCardano from "lucid-cardano";
console.log(LucidCardano.Data.to(new LucidCardano.Constr(0, ["deadbeef"])));

d8799f44deadbeefff

this is the ouput from lucid-evolution using dcspark cml

import * as LucidEvolution from "@lucid-evolution/lucid";
console.log(LucidCardano.Data.to(new LucidEvolution.Constr(0, ["deadbeef"])));

d8798144deadbeef

you can paste the above in https://cbor.nemo157.com to get the diagnostic notation

rooooooooob commented 2 months ago

You should not be creating the two separately in two different tools and expecting the hash to line up as there are myriad ways the two could differ for any non-trivial CBOR structure. Hashing CBOR data should always be done on the original bytes. If you need to use CML you can create it in lucid, then use PlutusData.from_cbor_bytes(serializedBytesFromLucid) to create it using CML which will remember the encoding information that was passed in then round-tripping it will result in those original d8799f44deadbeefff.

try:

import * as LucidCardano from "lucid-cardano";
let datum1 = LucidCardano.Data.to(new LucidCardano.Constr(0, ["deadbeef"]));
let bytes = datum1.to_cbor_bytes(); // not sure what it's called in lucid

import * as LucidEvolution from "@lucid-evolution/lucid";
console.log(LucidCardano.Data.from_cbor_bytes(bytes));

This looks exactly like a case for what I mentioned earlier with PlutusDatum::encode_like_cardano_node().

I can implement this for plutus datums if it is a common issue. It can also be resolved with better general developer understanding about CBOR/hashing, but I guess it can't hurt to have too. I put a CBOR section in the new docs (with 6.0.0) on this but to be honest CBOR + hashing is a bit of a mess. Everything would be so much simpler if IOHK enforced canonical-only.

The way it works is the cardano node defaults to definite for empty and indefinite for plutus datums. In CSL/old CML we had this hard-coded to immitate that since we didn't have proper round-tripping like we do now.

rooooooooob commented 2 months ago

tl;dr: as long as you are creating it using from_cbor_bytes() then everything will work regardless of whatever you put in there, doesn't matter what tool, etc.

relying on a specific cbor encoding without that being documented is not good and is precisely why we spent a lot of time on ensuring that cddl-codegen/CML remember every little encoding option. Just on that simple datum there are (5+5*5)*(10)*(5 + (5) + (5*5) + (5) + (5*5*5) + (5)) = 51000 ways to encode it, and that's before you do weird stuff like 0-len chunks in indefinite bytes, in which case there are an infinite amount of ways. CML will gladly take in any of those 51000 ways when using from_cbor_bytes() then when you hash it it will match as will to_cbor_bytes().

solidsnakedev commented 2 months ago

I’ve also don’t like fact about having different standards that’s the main reason we decided to build lucid-evolution with CML which follows the right one, but the dev community is having issues with transitioning to our library just because of the serialization discrepancy. It’d nice if you can enable that option .

btw We don’t want to use lucid-cardano for this data transformation

rooooooooob commented 2 months ago

There is no "right one" - that's the problem. You have to handle ANY CBOR variation. You saw how there are over 50000 ways to encode a simple datum of a 4 byte field. The only time there is a "right way" to encode CBOR is when the protocol you're working with explicitly sets one (usually this is canonical CBOR, which is why it was invented). If that hasn't happened then all bets are off - any of those 50000+ encodings is just as "right" as any other one.

There is no such specified encoding for the cardano protocol, there are just some tools that happen to match in some spots (likely just datums too, I'd bet that Lucid doesn't match the node in most other spots). How the cardano node serializes CBOR can and has changed before and is entirely an implementation detail.

You can submit a tx/datum to the network using ANY weird encoding and the cardano-node will accept it, that is the reason why the node itself has to remember the original bytes of everything everywhere so hashes still work.

When working with datums from another tool or on-chain you simply cannot try re-creating it using the construction API - you need to be working on the bytes created from the other tool or on-chain.

It’d nice if you can enable that option .

That said, as mentioned earlier, if you need a specific format for PlutusData that follows a specific format (e.g. CSL/lucid) we can absolutely provide an API to set the encoding format to that. I'll put up a PR later tonight as this is actually something I had thought about before as a potential way to exploit our exact control over CBOR in the new CBOR-format-respecting CML.

btw We don’t want to use lucid-cardano for this data transformation

What exactly is the data transformation that needs being done? I assume it's not on the datums, as changing them in any way would change the hash so you should only care about the hash/cbor format of datums that are already finalized and ready for the chain.

rooooooooob commented 2 months ago

I hope I didn't rant too much earlier, it's just a really, really bad idea to rely on any tools 100% agreeing with all possible CBOR encodings. Always work with raw bytes (and things in CML deserialized from those bytes) whenever possible to avoid sneaky hash mismatches with some specific datums. They might match for most when you're testing, but that doesn't mean it will match always unless both tool makers seriously try to match each other and never change and not accidentally having 1 small thing different in some cases that causes the whole hash to rarely mismatch even if it usually is the same.

I just put up a PR #357 does this work for you?

solidsnakedev commented 2 months ago

yes this is perfect, thank you!

dcSpark / cardano-multiplatform-lib

Non-canonical CBOR Serialization (Optional) #356