IntersectMBO / cardano-ledger

The ledger implementation and specifications of the Cardano blockchain.
Apache License 2.0
260 stars 157 forks source link

Versioned serialization schemes for types #3014

Closed JaredCorduan closed 1 year ago

JaredCorduan commented 2 years ago

The problem

When we encounter a problem with our deserializers, such as in #3003, #2965 and #2444, it can be very difficult to implement a fix. It is difficult because we can only fix such issues during a hard fork, and leading up to the hard fork we must maintain two serializations for the same type in order to not cause unintended network splitting (the problematic version must be used before the hard fork, and the fixed version is used afterwards). This can be especially tricky with the FromCBOR typeclass, since it is not always easy to search for where all the problematic uses are located.

A solution

Create a new typeclass to replace ToCBOR and FromCBOR which depends not only on the type but also on the protocol version. The first instances will all be implemented identically to our current ToCBOR and FromCBOR instances for all protocol versions up to the current one.

Note that this change that will effect the downstream components. We will need to coordinate with consensus and node, and many other teams.

michaelpj commented 2 years ago

My $0.02.

We have a similar problem in plutus with our serialization for scripts. I think the considerations are the same although t they're slightly different libraries. What we do at the moment is we don't use the typeclass that the library provides, but rather we just use explicit decoder values. For CBOR this would be something like:

-- don't do this
instance Serialise Foo where
  decode :: Decoder s Foo

-- do this
decodeFoo :: ProtocolVersion -> Decoder s Foo

It's a bit clunky, but it works okay. You can also then take a mixed approach: if you need to decode something that isn't version-sensitive (an integer, say), you can just call decode and use the typeclass version.

I guess we could define a typeclass for this, but that does force you to pick an interface for all the decoders to share. And in fact I lied above: we don't just pass the protocol version, in some cases we also pass a predicate that indicates whether particular builtins are allowed. So we do make use of the variability. Something to consider, anyway.

HeinrichApfelmus commented 2 years ago

I second @michaelpj 's suggestion — for a slightly different reason.

In cardano-wallet, we always have trouble finding the piece of code that is responsible for serializing/deserializing a specific type, because it's hard to figure out which instance of which type class is selected (this information is inferred from the code, but not written explicitly). So, usually we end up with "Hm, I think it's ok to use toCBOR here, but I'm not quite sure". The "not quite sure part" makes me feel uneasy — it's acceptable for the Show type, but not if I want to explicitly control the format.

Put differently: I think that specifying the format explicitly on the value-level as opposed to implicitly on the type-level would help with whatever issues we have. 🤔

JaredCorduan commented 2 years ago

In cardano-wallet, we always have trouble finding the piece of code that is responsible for serializing/deserializing a specific type, because it's hard to figure out which instance of which type class is selected (this information is inferred from the code, but not written explicitly). So, usually we end up with "Hm, I think it's ok to use toCBOR here, but I'm not quite sure". The "not quite sure part" makes me feel uneasy — it's acceptable for the Show type, but not if I want to explicitly control the format.

The reason that you probably cannot always trust the inferred serialization is because when we do find bugs, we cannot easily fix the bug, and have to work around it at an upstream type. My hope for a versionedToCBOR would be that you could always trust it.

HeinrichApfelmus commented 2 years ago

The reason that you probably cannot always trust the inferred serialization is because when we do find bugs, we cannot easily fix the bug, and have to work around it at an upstream type. My hope for a versionedToCBOR would be that you could always trust it.

Yes and no. The trouble is that I'm often not sure which type I call toCBOR on — it might not even be the type Tx that I'm interested in. In that sense, serializeTx :: Tx -> ByteString would give me some additional type safety / visibility.

JaredCorduan commented 2 years ago

Yes and no. The trouble is that I'm often not sure which type I call toCBOR on — it might not even be the type Tx that I'm interested in. In that sense, serializeTx :: Tx -> ByteString would give me some additional type safety / visibility.

Our types are often too polymorphic for that. Here's a good example that I stole from @lehins :

data AlonzoTx era = AlonzoTx
  { body :: !(Core.TxBody era),
    wits :: !(Core.TxWits era),
    isValid :: !IsValid,
    auxiliaryData :: !(StrictMaybe (AuxiliaryData era))
  }

We need to rely on type classes (or something equivalent) in order to decode body and wits.

lehins commented 2 years ago

@HeinrichApfelmus This a classic argument type classes vs monomorphic function. Some people say you should embrace them, some say they are evil. You can make the same argument for almost any prominent type class in Haskell community. Be it To/FromJSON, Vector and MVector type class, any serialization library (binary, serialize, persist, etc), mtl vs transformers, etc.

My argument is, type classes are strictly superior in most cases, because you can always specialize the type class function, but you can't make a specialized function more polymorphic.

The trouble is that I'm often not sure which type I call toCBOR on — it might not even be the type Tx that I'm interested in.

So, to your trouble I can say, restrict the type and you'll know what you are interested in:

lehins commented 2 years ago

To expand on the example @JaredCorduan posted above, Core.TxBody era and Core.TxWits era are type families parameterized on era, which means if we were to go the monomorphic route we'd have to write a separate deserializer for AlonzoTx for every era that it is used in (i.e. Alonzo, Babbage, Conway and possibly later ones as well) and for every crypto the era is parameterized on. So all of the sudden instead of one function that does deserialization we end up with six separate, albeit very similar deserializing functions. We have many types like this in ledger codebase, so it will simply not work for ledger. The only reason why monomorphic deserializing functions work well in plutus codebase, is because types in Plutus aren't parametrized on type arguments.

@michaelpj and @HeinrichApfelmus Thank you for your suggestions though.

michaelpj commented 2 years ago

I'm not saying that the deserialisers need to be monomorphic! In fact, the Plutus ones are highly polymorphic. Just that having them be functions rather than typeclass methods avoids tying you to an upstream interface, which can be helpful.

lehins commented 2 years ago

@michaelpj Do you mind sharing a link to a few modules in a Plutus repo where there examples of highly polymorphic deserializers. I'd like see some concrete examples of what you mean.

HeinrichApfelmus commented 2 years ago

So, to your trouble I can say, restrict the type and you'll know what you are interested in:

toCBOR @Tx or toCBOR :: Tx -> Encoding

Fair enough. 🤔 My trouble is about imposing some type constraint, whether that is done through @ or a name suffix Tx does not matter.

I guess that the specific function from the ledger that made me nervous is toCBORForMempoolSubmission — I don't know if that is equal to, or supposed to be equal to the toCBOR instance unless I look at the source code. The alternative is toCBORForSizeComputation, and I felt that I would be better off using one explicitly rather than hoping that the toCBOR implementation does the right thing for me.

lehins commented 2 years ago

I guess that the specific function from the ledger that made me nervous is toCBORForMempoolSubmission — I don't know if that is equal to, or supposed to be equal to the toCBOR instance unless I look at the source code. The alternative is toCBORForSizeComputation, and I felt that I would be better off using one explicitly rather than hoping that the toCBOR implementation does the right thing for me.

I have no idea why toCBORForMempoolSubmission is even exported, it is an internal detail and toCBOR is always the right choice.

michaelpj commented 2 years ago

Do you mind sharing a link to a few modules in a Plutus repo where there examples of highly polymorphic deserializers. I'd like see some concrete examples of what you mean.

Ah, I see the issue. We don't have any examples with type families, that does make things tricky because you need to pass around deserializers for the parts, either as explicit values or as typeclass dictionaries, and the typeclass dictionaries sure are more convenient.

In plutus we do indeed use typeclasses for values of the actual types that we are polymorphic over, but usually that's just a few bits deep down. Anyway, here's the file for reference: https://github.com/input-output-hk/plutus/blob/master/plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Core/Instance/Flat.hs