IntersectMBO / cardano-ledger

The ledger implementation and specifications of the Cardano blockchain.
Apache License 2.0
258 stars 158 forks source link

Serialization round-trips considered dangerous #2943

Closed Quantumplation closed 2 years ago

Quantumplation commented 2 years ago

Hello!

While testing our changes for the upgrade to Vasil, we produced this transaction:

https://testnet.cexplorer.io/tx/c8b91a27976836f5ba349275bba3f0b81eb1d51aaf31a4340035ce450fdf83a7

This has the seemingly scary property that the datum in the witnesses doesn't hash to the datum on the output. In particular:

https://testnet.cexplorer.io/datum/06269f665176dc65b334d685b2f64826c092875ca77e15007b5f690ecc9db1af

The bytes seen at this datum produce a different hash than 06269.... This suggested EITHER that the datum hashing logic had changed, or that the ledger accepted a transaction where previously it would have reported a Non-Supplementary Datum error.

With the help of the wonderful @KtorZ, we were able to identify that the issue is a lot more mundane: the ledger code (which tools like ogmios and dbsync rely on) serializes datums differently, so that data can't be round-tripped safely. If any down-stream consumer forgets to use the same memoization strategy, they will end up presenting different bytes. This can be quite risky, for example, since dApps may rely on those bytes to spend the datum at a later date.

A more detailed writeup can be found here: https://github.com/CardanoSolutions/ogmios/commit/3f614c3cf15b6e418fd461a34853b750a3408c4f

I know this has been discussed before and the decision was made not to offer any roundtrip guarantees, but it still seemed worthwhile to open a ticket to document the findings for anyone else, and highlight again that this is a risky / non-intuitive behavior that people should likely be aware of when depending on this as a library.

Perhaps there's some piece of documentation that can be improved around this?

JaredCorduan commented 2 years ago

thanks for bringing this up @Quantumplation , this does indeed look like an unfortunate trap for anyone to fall into.

This has the seemingly scary property that the datum in the witnesses doesn't hash to the datum on the output.

If that were true at the level of the wire, that would be terrible. I suspect/hope that the transaction itself has the same datum hash as is inside the transaction output. In other words, the explorer is re-serializing and hashing the datum. I tried to figure this out myself, but I can't see how to get the serialized transaction from the explorer. I would also need to see the exact serialization of the transaction output (maybe it's still unspent and I can get it from a node?). Do you have this easily at hand?

Quantumplation commented 2 years ago

Yep, @KtorZ grabbed it with ogmios, and the datum on the wire is:

D8798441FFD87982D87982D87982D87981581CC279A3FB3B4E62BBC78E288783B58045D4AE82A18867D8352D02775AD87981D87981D87981581C121FD22E0B57AC206FEFC763F8BFA0771919F5218B40691EEA4514D0D87A80D87A801A002625A0D87983D879801A000F4240D879811A000FA92E

Which does indeed hash correctly (at least in our tests, would be good to have a third confirmation just in case lol)

Also worth noting that it's not the explorer that reserializes it, but cardano-db-sync, hence the other ticket. So anyone depending on that is also likely impacted.

Finally, the reason we have to use definite length arrays is in persuit of hardware wallet support: hardware wallets sign a round-tripped transaction, which we believe would corrupt the datum hashes, and they make the opposite choice when serializing. This also messes with multi-party dApp signatures since the signature is for a different set of bytes. Overall it's all rather painful heh 😅

JaredCorduan commented 2 years ago

ok, that's great news :relieved: So, just to confirm, the problem is that third parties might be using a combination of tools, some from IOG that chose one (cbor compliant) serialization, and others that chose a different (cbor compliant) serialization. right?

Quantumplation commented 2 years ago

Yes, especially dangerous when it changes what's on the wire, because it makes the utxo effectively unspendable until you correct the bad datums

JaredCorduan commented 2 years ago

open a ticket to document the findings for anyone else

@Quantumplation do you have a suggestion for where such documentation would have been most likely to have caught your attention? The formal spec? Haddocks? Other documentation that I am not aware of? Or did you just want an issues here that others can find (even if I close it)? I do not myself know where the best place would be to put out this public service announcement "do not re-serialize data that you intend on hashing!".

Quantumplation commented 2 years ago

I think this issue (even if you close it) is a good first step; I'd imagine a warning in the haddock would help too.

@KtorZ do you have thoughts here? i.e. is there a specific method where a warning in the haddock string would have caught this?

@JaredCorduan as mentioned in the commit I linked, updating the arbitrary instance to generate alternate canonical forms might also be helpful; even if the ledger itself prefers to work in one serialization format, obv there are other formats that are valid, so we probably want those to get captured in the property based tests, either for the ledger, or for projects that import your implementations as a library.

JaredCorduan commented 2 years ago

updating the arbitrary instance to generate alternate canonical forms might also be helpful; even if the ledger itself prefers to work in one serialization format, obv there are other formats that are valid, so we probably want those to get captured in the property based tests, either for the ledger, or for projects that import your implementations as a library.

This is sage advice, I completely agree. I made #2944 to capture it. thanks!

KtorZ commented 2 years ago

Thanks for opening this ticket @Quantumplation :pray:

To answer your question @JaredCorduan about how to best address that, my best recommendation would be to stick to one canonical encoding (likely the definite one since that's also the one used by hardware devices) and not only rely on the memoization. I know this was raised in the past but as far as I remember the only rebuttal I heard was that "it's too much effort / complicated to enforce a single canonical serialization" but I still can't quite picture why.

If this isn't an option then, I think having good arbitrary generators or at least, blessed test vectors, that can help implementators pay attention to those traps. The worse part is even that I was perfectly aware of the memoization strategy (which is why I was able to pin-point the problem fairly rapidly) but I still got it wrong because the memoization was only applied to the outer-most newtype and not the inner type (which comes from the Plutus codebase). And even property tests didn't help here because the generators are skewed towards one preferred format. Which made it even harder to catch.

In Ogmios, and tools in general, I've now taken the habit to always rely on raw binary data and to also carry along those raw data in the API, because there's otherwise no sane way for a client to really know how to serialize something. I think a canonical serialization as suggested here would be truly beneficial (yet sadly, we can't modify the past and there would be plenty of non-canonical examples in the chain history whatever choice we make).

Another example of this non-canonical serialization issue regards the serialization of values. It is useful for a client to know how to serialize values if only to estimate the minimum ada value it should assign to an output.. Yet if you look at the value serialization, it uses finite encoding when there are less than 23 assets and indefinite beyond 23 assets. Why 23 one would ask? Because this is the point where in CBOR, length starts being encoded over 2 bytes instead of 1, which makes declaring a list (or a map) of elements a 2-byte sequence sequence instead of a single byte (CBOR major type + length combined in one byte in this case); Beyond 256 assets, it would require 3 bytes (instead of only 2 with the indefinite structure). While I get that this was probably done as an optimization, this is really saving 1 byte on values that have more than 256 assets.. at the cost of some extra complexity for third-party clients who would want to re-serialize things down the line :/

I feel that at this stage, most of the choices made regarding serializations are pretty much arbitrary and partly due to that choice of not enforcing a canonical format. Yet, because it's ultimately CBOR down the line, and because CBOR has various downsides (multiple choice of encoding for list / map structures, no enforced ordering of keys in maps etc..), and because of the chain history there will always be some ambiguity in the format. Yet, it would be nice that if any core component enforces by default canonical CBOR serialization (see https://datatracker.ietf.org/doc/html/rfc7049#section-3.9) and stick to it. That includes Plutus data serialization as well. Sadly at this stage, this would break everything because the choices from the start were different from those listed in the section 3.9 of the RFC :|

JaredCorduan commented 2 years ago

As always, thank you for all your help @KtorZ .

my best recommendation would be to stick to one canonical encoding (likely the definite one since that's also the one used by hardware devices) and not only rely on the memoization. I know this was raised in the past but as far as I remember the only rebuttal I heard was that "it's too much effort / complicated to enforce a single canonical serialization" but I still can't quite picture why.

The decision to abandon canonical CBOR happened just before I joined IOG, but I can try to relay the history that I've pieced together. This topic comes up a lot, perhaps we could write down this history somewhere for posterity. @dcoutts probably remembers it best.

Canonical CBOR was definitely not abandoned because it was too much effort or too complicated for IOG to implement. I have a vague memory that it was a tough imposition on the hardware wallets, but that might not be true either. I believe the reason was that a lot of folks were re-serializing things (ie depending on canonicity), which lead to some nasty bugs. By abandoning canonical CBOR, we made it much more apparent that folks should not be re-serializing.

23

I remember the day @redxaxder and I pair up on that. :) It doesn't just save one byte, it saves one byte every time anyone uses a list or a map above length 23.

at the cost of some extra complexity for third-party clients who would want to re-serialize things down the line :/

one of our goals is to stop folks from re-serializing :)

having good arbitrary generators or at least

yep! I think we are all on the same page here (see the issue I made above).

KtorZ commented 2 years ago

*it saves one byte every time anyone uses a list or a map above ~length 23~ length 256 !

Encoding a list of 23-256 elements only demand 2 bytes in the definite version. So the benefit is only for lists above 256 elements which I am not even sure are a thing because of the current maximum size of a value :sweat_smile: ... I haven't looked at the mainnet but I bet you 14 ADA that there's not a single UTxO carrying more than 256 assets :grimacing: !


Anyway, thanks for the reply and consideration :pray:, I hope we can at least improve the QuickCheck generators to get better test coverage on properties. As for the canonical CBOR.. it really sounds like supporting it now would have little benefits given that clients would need to support the non-canonical version forever to cope with the past 5 years of chain history. So perhaps if we do Cardano V2 one day :sweat_smile: ...

JaredCorduan commented 2 years ago

*it saves one byte every time anyone uses a list or a map above ~length 23~ length 256 !

Encoding a list of 23-256 elements only demand 2 bytes in the definite version. So the benefit is only for lists above 256 elements which I am not even sure are a thing because of the current maximum size of a value sweat_smile ... I haven't looked at the mainnet but I bet you 14 ADA that there's not a single UTxO carrying more than 256 assets grimacing !

good catch on the 23. but note that this optimization wasn't just made for Value. it was made for every list and every map!

https://github.com/input-output-hk/cardano-ledger/blob/93b5d5421a5b5c3a3b6ad89a74e8bdb2ccc9258c/libs/cardano-data/src/Data/Coders.hs#L1118-L1119

https://github.com/input-output-hk/cardano-ledger/blob/93b5d5421a5b5c3a3b6ad89a74e8bdb2ccc9258c/libs/cardano-data/src/Data/Coders.hs#L1147-L1148


it really sounds like supporting it now would have little benefits given that clients would need to support the non-canonical version forever to cope with the past 5 years of chain history. So perhaps if we do Cardano V2 one day sweat_smile ...

:smile: yea, that ship :ship: has sailed.

dcoutts commented 2 years ago

It's a very bad security practice to check signatures or hashes on re-serialised data. It must only be done on the original bytes. Otherwise it leads to all sorts of nasty security problems (think txs with different bytes but that have the same hash).

Given that one must use the original bytes and not a re-serialisation, there is no need for a canonical representation. Indeed it would be quite misleading for there to be a canonical representation since it would make people think they can get away with re-serialising! So in a way it's better to have some variation just to stop people being lulled into a bad practice.

That's why the ledger does not enforce a canonical representation.

dcoutts commented 2 years ago

In Ogmios, and tools in general, I've now taken the habit to always rely on raw binary data and to also carry along those raw data in the API, because there's otherwise no sane way for a client to really know how to serialize something.

Yes. You're doing it correctly! :-)

I think a canonical serialization as suggested here would be truly beneficial (yet sadly, we can't modify the past and there would be plenty of non-canonical examples in the chain history whatever choice we make).

Mwahahah! My evil plan to push people in the right direction (never re-serialise) worked!

It's entirely deliberate that it's not canonical, and indeed it was my hope and intention that there would be actual variation so that tool authors are not led down the wrong path (of re-serialising).

Yet, it would be nice that if any core component enforces by default canonical CBOR serialization (see https://datatracker.ietf.org/doc/html/rfc7049#section-3.9) and stick to it. That includes Plutus data serialization as well. Sadly at this stage, this would break everything because the choices from the start were different from those listed in the section 3.9 of the RFC :|

You see what would go wrong though right? Suppose we had said the ledger will do things right (carry the original bytes around and check sigs and hashes on the original bytes), but also enforce a canonical form. Then what will happen in practice? Somewhere either in the ledger or in 3rd party tools, people will make a mistake and rely on re-serialising when calculating hashes etc. Then any bug in the 100% perfect enforcement of canonical form will lead to confusion that could lead to exploits: tools could get confused by transactions (or datums or whatever) with different representations but having the same hash! Exploits ahoy!

So it's much better to not enforce a canonical form, and to have some actual variation, so that such design mistakes (re-serialising) in tooling will be caught. This wasn't an accident of history, but a deliberate design decision for Shelley, learning the lessons of the mistakes in Byron (which does enforce canonical representation and used to re-serialise too).

I'm not claiming there are never any use cases where a canonical form would be convenient and also safe, but the above danger I think outweighs them all.

Quantumplation commented 2 years ago

@dcoutts you might want to start a conversation with the ledger team; so far as I know, the reason we're being squeezed into this bottle is because the ledger hardware wallets will deserialize the tx to show data to the user, and then when it comes to sign things, it re-serializes to their choice of encoding. This invalidates datums, multiparty signatures, etc.

I tried to do some convincing but was unsuccessful; perhaps coming from you directly it would have more weight 😅

Quantumplation commented 2 years ago

(and by ledger team, I mean the hardware wallet Ledger, and potentially trezor, not the cardano ledger team, which would be... Yourself lol 😅)

dcoutts commented 2 years ago

I know the hardware ledger people want to serialise the txs themselves, rather than checking a serialised tx that is passed in to them. For that reason they have created a CIP for a canonical encoding that hardware ledger devices can use, so that they can be interoperable with each other.

dcoutts commented 2 years ago

the ledger hardware wallets will deserialize the tx to show data to the user, and then when it comes to sign things, it re-serializes to their choice of encoding

It's a bit more subtle than that. They do not deserialize the tx at all. They get passed the info needed to construct a tx. So they actually construct, serialise and sign the tx on the hardware device. This avoids them having to have a decoder, in their limited resources.

So similarly, there's no re-serialisation. They just serialise once.

But if you want the same tx to be signed by different hardware devices then you can see you have a bit of a problem. Use cases would be multi-sig. Those different hardware devices each want to be the one that serialises the tx. That can work, but only if all of them agree on exactly the same serialisation. Then each one can be passed the various bits of info, and then serialise and sign the same representation.

That's why the hardware device people have a CIP on a single consistent serialisation style, so that they can follow this strategy and still interoperate. Tools that don't need to interoperate with hardware ledgers don't need to follow that standard.

I am a little grumpy that the hardware devices want to do it this way, since it does rather push other tooling to follow that same exact serialisation, which looses some of the diversity that helps to keep tooling robust. But that's the limitation of the tiny hardware devices.

Quantumplation commented 2 years ago

@dcoutts I'm not sure if that aligns with my experience. If that is the case, that makes a decent amount of sense, if still a bit gumbly. But if that were the case, I don't see why Ledger transactions on sundae would be failing right now. When I was speaking to someone about it, they told me it was because the map keys on the txs we were building were out of order and the collateral was serialized as an empty list and not omitted. I'll follow up again to try to get to the bottom of this.

waalge commented 2 years ago

It's a very bad security practice to check signatures or hashes on re-serialised data.

I think this is reasonable, but I don't think this is where we are. Web wallets do deserialize (and some reserialize, while others... are black boxes?)... In any case, they only return the signature without the data they signed. CIP30 is vague on what should actually happen here ( and return a signed transaction ).

If the serialized form should be treated as a (the) first class citizen in cardano software... I think that message has been lost.


apologies for thread jacking

JaredCorduan commented 2 years ago

@Quantumplation - I've added warnings to the Shelley formal spec about not re-serializing (in the appendix about all the cryptographic details). And we've added more noise to the serialization generators to account for the ambiguities (see #2944). Are you happy to close this issue?

Quantumplation commented 2 years ago

@JaredCorduan Excellent, more than happy :) thanks for going above and beyond for a seemingly small thing even when so much else is going on ❤️

KtorZ commented 2 years ago

❤️