Sharing Data Types is Tight Coupling – Brandon Chinn

Brandon Chinn's personal website

https://brandonchinn178.github.io/posts/2023/04/15/sharing-data-types

Have you thought about squeezing the type ErrorDB from the data base schema instead?

I think this problem should in the ideal be solved by generating a set of types for the back end and a schema for the data base from the same single source of truth programmatically. I do not think there is a ready made package in Haskell that does this but it should not be too hard to write. This thought has been bothering me for a long time and I made some stabs at the problem — it seems Template Haskell can do what is needed, though only barely.

While it makes sense for Error and ErrorDB to be different, it never makes sense for ErrorDB and the data base schema to be different, does it?

@kindaro IIUC you're asking about making an errors table in the database, and using the generated Persistent entity? Sure, that could work too. I think we went this route to keep errors in the relevant table, instead of having a centralized errors table + foreign references to it in all the tables that could store errors. If the error can appear nested in a JSON blob, it'd also be less nice to store a number in the JSON blob without a FOREIGN KEY constraint.

Yep, in the ideal I should have liked to have a type for storing errors that fits nicely in the relational conceptualization of data. If you need to store JSON with errors in the midst, then of course my idea would not work. But then, perhaps a relational data base is not the best way to persist the data of this kind?

If your persistence layer only needs to encode stuff to JSON and write it no matter where, as opposed to adapting the structure of data to the relational conceptualization, then maybe you could take your Error and make it parametric in a type constructor of kind ★ → ★ — higher kinded, as it is called? Something like this:

data Error (switch ∷ ★ → ★) = Error
  { …
  , sensitiveData ∷ switch SensitiveData
  , …
  }

data REDACTED = REDACTED

type DangerousError = Error Identity
type SafeError = Error (Const REDACTED)

Now, SafeError could not ever hold any dangerous information — it will all be redacted. But everything else will be the same — for example, error code, location, time. Most of the logic will be shared. The two types will be exactly as different as you need them to be.

Could something like this have worked?

You're saying this implementation would allow sharing the same data type between the database (a redacted version) and the rest of the system? Sure, this would allow sharing, although in our system, we allowed configuring whether errors in the database were redacted or not, so instead of Const Redacted, it would be more like a Maybe.

But that's still missing the point. We didn't duplicate the data type because we needed to guarantee redacting before storing. We duplicated the data type first, then as a bonus, realized that we can take advantage of the new type to enforce redacting before storing. The rest of the article talks about the benefits of duplicating over sharing, but as a recap:

Better builds
- If the module defining the error also defines the logic for storing the error in the database (e.g. derives the PersistField instance), then anything throwing an error is blocked on building the third party database libraries
Better separation of responsibilities
- Maybe an error is stored as json in the database and is also serialized as json when sending back to the user. those json instances dont have to (and possibly even shouldnt) be the same, e.g. if you want a compact JSON representation to store in the database but a more user friendly representation when returning
- Or maybe errors stored in the database should be forwards compatible, but when thrown, should always be the latest version of an error. with duplicate types, the database representation of an error should never remove a field, only ever add, while the normal error should always reflect the latest version. its possible to share the types for this still (e.g. trees that grow), but duplicating the types is much easier

I hear you but I am not ready to accept your conclusions. It would be easy to agree but I think there is more to this issue than has been said. I am going to try and spell out what bothers me.

time of compilation

The compilation problem can be easily solved with a few newtype wrappers, can it not? You can define Error in a lightweight module with no dependencies, then the serialization code will say newtype ErrorDB = ErrorDB Error deriving newtype … elsewhere and attach heavy machinery to this new type. The rest of the code does not need to know about ErrorDB and can be built in parallel.

independent change

I spot a recurring theme in your article and comments: independent change.

You want to change the way you show your types to the users.
You want to change the internal types used in complicated algorithms.
You must persist stuff in a data base such that it can be retrieved later.

You never say it but I think your article is about resilience of an architecture against changes in requirements.

one type — linear cost

At the beginning, you can easily have one way to serialize your types and one type for Error. Later on, requirements change in such a way that you are forced to change your types, and the type system makes these changes ripple throughout your whole code base. The cost of change in this «one type» architecture is at best linear in the size of the code base.

split concept — constant cost

split types

The solution you offer is to make out of every type A a handful of types, say API (A), Internal (A) and Persistence (A). In the example with Error, Error itself will be Internal (Error) and ErrorDB will be Persistence (Error). We shall call A itself «concept». It can be represented in Haskell as a data family, or we can understand it informally.

However, simply enumerating types is not architecture. We need to also say what we guarantee. What are the equations? What are the invariants? What can we say about these types?

invariants

If we have something in our data base that we cannot work with or do not care about, it is no point to have it in the data base to begin with. So, it seems fair to ask that Persistence (A) ⊂ Internal (A) (in common notation) or fetch: Persistence (A) ↣ Internal (A) (in categorial notation). We ensure this by checking the property store ∘ fetch = id: Persistence (A) → Persistence (A). Likewise, it would be sad if the API could send to us something we cannot understand, so we should ask that decode: API (A) ↣ Internal (A).

Now we can change our types at cost constant in the size of the code base, by changing the functions fetch and decode, while retaining their systematic meaning.

constant cost achieved

But this is only the synchronous picture. What is the diachronous picture? What happens over time?

versions

Let us mark our types with version, so for example we shall write API (n) (A) for the way the concept A can be communicated to our program from the outside world at version n. If we are to decouple our internal logic from our API and our persistence, we should allow that versions for these three components vary independently. API (n + 1) (A) should work with Internal (n) (A) and so on.

internal type

We can try to grow our internal type by writing an inlay (n): Internal (n) (A) ↣ Internal (n + 1) (A). But inlay must have a canonical inverse. It is not clear what this inverse should do with the values that we have in Internal (n + 1) (A) but not in Internal (n) (A). Maybe it should turn these values into errors.

external types

We ask that API be backwards compatible. This means that it can only grow, so API (n) (A) ↣ API (n + 1) (A). But it cannot grow bigger than Internal (n) (A). You cannot add stuff to your API that your internal logic does not understand.

I am not sure what to do with Persistence. If our data base is a typical SQL data base, then it is hard to change. Say Persistence (1) (A) is a type like data A = A {x, y ∷ Int}. Can we make Persistence (2) (A) be like data A = A {x, y ∷ Int, z ∷ Int}? We shall need to add a new column to some table in our data base, or else a whole new table. If we add a new column, then some kind of migration will need to be done. And this is only a simple case — it hurts me to think of more complicated cases, like sum types.

In short, I think there is much to be gained from formalizing the notion of separation and from spelling out the idea of versions and evolution of types over time.

The compilation problem can be easily solved with a few newtype wrappers, can it not?

No, not with the scenario I laid out where you might want different JSON serialization logic.

module Types.User where

-- don't derive JSON, let the external/persistent User types derive JSON
data User = User { name :: String, age :: Int }
  deriving (Show)

module DB.User where

-- You can't automatically derive FromJSON/ToJSON here:
--   * Newtype deriving won't work because underlying User
--     doesn't have JSON representation
--   * Generic deriving won't work because you can't derive
--     Generic on a newtype
newtype UserDB = UserDB User

You never say it but I think your article is about resilience of an architecture against changes in requirements.

In the article, I write:

But eventually, we found that reusing code as much as we did caused subsystems to be tightly coupled, preventing them from evolving independently. Specifically, this friction occurred because the subsystems reused code for different concepts. This meant that the two subsystems couldn’t independently iterate within their own domains.

So yes, the article is about allowing teams to iterate on subsystems independently. I guess there is an implicit "iterate independently as requirements change", but I feel like "iterate" already implies requirements changing (or something changing).

The cost of change in this «one type» architecture is at best linear in the size of the code base.

It's not just the cost to change being linear. It's also ongoing cost of understandability of the system or cost of having dead code in your internal logic:

If you want to name a field appropriate for external use and you reuse that type in all the internal logic, all your internal logic will have to refer to that field with the external-facing name, even if it has a better name within the domain of a subsystem.
If you want to add an external-facing constructor that gets normalized into another constructor, you now have to handle that case throughout your internal logic, even if it gets normalized at an earlier phase.
If you want to add an external-facing field that gets normalized into another field, and you have to create a new value of that type to pass internally (after the normalization phase), you have to come up with some fake value that won't get used

Yes, I think your discussion about invariants and versioning are all good. But one limitation is that your discussion only talks about the actual data types, nothing about things like JSON serialization. e.g. in the case where you want two JSON representations for API(A) and Persistence(A), we could represent that as API(A) -> _APIjson(A) and Persistence(A) -> _Persistencejson(A). And note that at this point, if you reuse types[^1], you force Persistence(A) = API(A) = Internal(A), and thus impossible to have separate API(A) -> _APIjson(A) + Persistence(A) -> _Persistencejson(A) functions.

Also, there's a bit of ambiguity about whether Persistence(A) means storing A as serialized bytes into the data or as a relational model. I've only been talking about storing A as serialized bytes; A as a relational model would just fall under normal SQL migration practices. If you generate types (e.g. with persistent), there wouldn't be a version; _Persistencerelational(A) would always be the latest version. Our strategy so far was that Persistence(A) would also always be the latest version, and any changes to A had to involve writing SQL to transform the JSON blobs in the database. This is why we wanted to think about splitting types, so that we could have a Persistence(n)(A) type that supported all Persistence(i)(A) (i <= n) types, and then just define a transformation Persistence(n)(A) -> Internal(A).

I'm not sure why you'd need Internal(n)(A) -> Internal(n+1)(A); this is the data type for use inside our internal logic, so we could just use the latest version of the data type and update all uses in the codebase. No need to support backwards compatibility with the runtime of a previously running system.

[^1]: Yes, you could use Trees That Grow to reuse types while allowing the different type-variants to iterate independently. It might make sense in a complex enough system, but I don't think it's ergonomic to use everywhere. Duplicating types is much simpler than trying to understand a data type using TTG.

to `newtype` wrappers

I do not follow why you think newtype wrappers cannot solve the issue of compilation efficiency. I think they can. Perhaps I do not understand the problem you are solving? Take this example:

x.cabal:

cabal-version: 3.0
name: x
version: 0

library
build-depends: base, aeson
exposed-modules:     Types DB
default-language:    GHC2021

Types.hs:

module Types where

import GHC.Generics

data User = User { name :: String, age :: Int }
deriving (Show, Generic)

DB.hs:

{-# LANGUAGE DerivingStrategies #-}
{-# LANGUAGE UndecidableInstances #-}

module DB where

import Data.Aeson
import GHC.Generics

import Types (User)

newtype UserDB = UserDB User deriving newtype Generic deriving stock Show

instance ToJSON UserDB

λ import DB
λ x = UserDB User {name = "x", age = 0}
λ import Data.Aeson
λ encode x
"{\"age\":0,\"name\":\"x\"}"

In short, we derive Generic on the root type with the stock strategy and then derive Generic on the newtype wrapper with the newtype strategy. Then we can ask for generic instances on the newtype wrapper from aeson, or write our own serialization logic.

to dead code

I see other ways of solving the issues of dead code. They are on the fancy side of the language but I think less fancy than Trees That Grow. No type families will be needed.

If you want to name a field appropriate for external use and you reuse that type in all the internal logic, all your internal logic will have to refer to that field with the external-facing name, even if it has a better name within the domain of a subsystem.

…

If you want to add an external-facing field that gets normalized into another field, and you have to create a new value of that type to pass internally (after the normalization phase), you have to come up with some fake value that won't get used We can write a pattern synonym that conflates these fields when building a value and picks the canonical field when pattern matching on it.

{-# language DuplicateRecordFields #-}
{-# language PatternSynonyms #-}
{-# language UnicodeSyntax #-}

module Types where

import GHC.Generics

data User = User { name, nickname ∷ String, age ∷ Int} deriving (Show, Generic)
pattern UserV2 {name, age} ← User {name = name, age = age}
  where UserV2 name age = User {name = name, nickname = name, age = age}

λ UserV2 {name = "x", age = 0}
User {name = "x", nickname = "x", age = 0}
λ let UserV2 {name, age
λ x = UserV2 {name = "x", age = 0}
λ x
User {name = "x", nickname = "x", age = 0}
λ let UserV2 {name = name', age = age'} = x
>
λ name'
"x"
λ age'
0

You can put any code into your patterns with the extension ViewPatterns, so this solution is general.

If you want to add an external-facing constructor that gets normalized into another constructor, you now have to handle that case throughout your internal logic, even if it gets normalized at an earlier phase.

You can have an algebraic data type tagged such that some constructors can be tagged with any tag and others can be tagged only with the special tag that indicates normalization having been done. Like so:

data Normalization = IndeedNormalized | NotNormalized
data PerhapsNormalized normalization where
  ThisConstructorShouldBeNormalizedAway ∷ PerhapsNormalized NotNormalized
  ThisConstructorIsNormal ∷ PerhapsNormalized anyNormalization

function ∷ PerhapsNormalized IndeedNormalized → Int
function ThisConstructorIsNormal = 0

Here, GHC knows that the pattern matching of function is complete.

what I am trying to say

Your blog post was not easy for me to understand because you put all the reasons to copy a type definition together. I had to make this effort of taking these reasons asunder and looking at every one of them in itself, together with you.

I think the problem of versioning is central to your blog post even though you only mention it in passing. The other problems you are mentioning can be solved in other ways, some of which I have offered. But the problem of versioning cannot be solved in ways other than copying a type and then letting the copies evolve independently, with some restrictions.

You also never reveal what restrictions should be imposed on the evolution of the copies. To identify these restrictions was a hard task for me and, as you point out, I have made a bunch of mistakes. I am also not aware of any other writings that talk about this problem systematically, even though it seems important to me. No one ever told me that I should copy types and write functions (at first identities) that convert between them and add property tests that make these functions into split idempotent pairs or something such.

I am deeply thankful for your taking the time to walk through your blog post with me.

Interesting!

I did not think that newtype deriving Generic would do the right thing. Good to see it does!
View patterns certainly can get more use, and perhaps that solves a lot of my problems. There's only one issue I can think of with ViewPatterns: you have to manually write a COMPLETE pragma to get GHC to recognize that it's exhaustive. For us, we had a huge AST with many sum types, and this limitation would probably have hit us.
- https://github.com/ghc-proposals/ghc-proposals/issues/542
I'm still unsure how this solves the problem of build parallelization (ignoring our previous discussion on database serialization). Let's say you have a bunch of packages in a monorepo that do stuff with users. One option is to have all of them depend on some user-subsystem package. This means all of these packages are bottlenecked on building all the code related to users, even if the other packages don't need all the functionality.
- See my other blog post on the Services design

Thanks for discussing this with me! This was really educational for me.

brandonchinn178 / brandonchinn178.github.io