Usability of smart constructors and large records with required, optional, and default parameters

Here lie vague notes on code generation taken from my current work on gogol, focused on the UX around constructing large records.

TL;DR: Given the issues with record sizes (# of fields), required vs optional vs default parameters, and parameter reordering/significance, I've documented some thoughts focused on UX improvement when constructing/manipulating the large records found in both amazonka and gogol. Feedback and alternative approaches/ideas welcome, add or links them in the comments below.

Problem
Improvements/Solutions

Problem

Breaking down the current products emitted by the code generator, the fields of the product (parameters) fall into three cases:

Required parameters are either part of the path, or have explicit metadata in the service description stating they must be present. They have no sensible default value.
Optional parameters are not required, typically part of the query string, headers, or request body where we have no further metadata whether the type is required or not, so the generator emits a Maybe wrapper.
Default parameters can either be required or optional, but have a default value specified by the service definition.

The code generator emits the following record with each of these cases, as:

data Foo = Foo
  { a :: Int       -- ^ Required.
  , b :: Maybe Int -- ^ Optional.
  , c :: Int       -- ^ The service definition explicitly states that field 'c' is required, but has a default of 123.
  }

Haskell's standard record syntax is still the gold standard in terms of ergonomics and error messages when constructing new records, assuming you use the field names from Foo above:

Foo
  { a = 1
  , b = Nothing
  , c = 123 -- You'd have to remember this, or float out defaultFooC in codegen, etc.
  }

Which doesn't scale given the size of records that we're often dealing with in libraries like amazonka-ec2. To improve this, we generate a "smart constructor" (not really, but internal terminology) of the form:

newFoo :: Int -> Foo
newFoo a = Foo { a, b = Nothing, c = 123 }

This helps, but introduces a new problem - as we ratchet this up to support larger and larger records with more and more Int fields, the order of parameters becomes significant:

data Bar = Bar
  { a :: Int       -- ^ Required.
  , b :: Maybe Int -- ^ Optional.
  , c :: Int       -- ^ Default to 123.
  , d :: Int       -- ^ Required.
  , e :: Int       -- ^ Required.
  }

newBar :: Int -> Int -> Int -> Bar
newBar a d e = Bar { a, b = Nothing, c = 123, d, e }

Which is more or less where amazonka and gogol are today.

Improvements/Solutions

(Some of these are sensible, some are (hopefully) clearly not.)

1. Named parameters

There are various implementations of named parameters available on Hackage, but pretty much all of them force the user to deviate away from plain record syntax.

One example, using the named package for Bar:

newBar ::
  "a" :! Int ->
  "d" :! Int ->
  "e" :! Int ->
  Bar
newBar (arg #a -> a) (arg #d -> d) (arg #e -> e) =
  Bar { a, b = Nothing, c = 123, d, e }

And usage via

newBar ! #a 1  ! #d 2 ! #e 3

I'm personally unconvinced of the ergonomics in this specific case - the use of OverloadedLabels in combination with custom operators just seems like a lot of noise.

2. Newtyped parameters

The generator could float out newtypes for every single field/Type, ie:

newtype A a = A a
newtype B a = C a
newtype C a = C a

data Baz = Baz
  { a :: A Int         -- ^ Required.
  , b :: Maybe (B Int) -- Optional.
  , c :: C Int         -- ^ Default to 123.
  }

newBaz :: A -> C -> Baz
newBaz a c = Baz { a, b = Nothing, c }

Note: you can't use only the field label since it's ambiguous across all possible {field, Type} permutations, so need the type, either in the datatype name newtype AInt .., or polymorphically, as in the example above.

I've attempted variations of this in the past and you get to choose either:

Horrible naming due to the sheer number of conflicting types (field names).
Imprecise types that look safe, but are in fact, not. ie. you union/merge all newtypes for fields matching a :: Int into AInt, but a :: Int from record Foo has no semantic equivalence to field a :: Int from record Bar, but you think they do, because .. types.

Generate required specifications

The generator could similarly float out records containing only the required parameters:

data BazRequired = BazRequired
  { a :: Int
  }

data Baz = Baz
  { a :: Int       -- ^ Required.
  , b :: Maybe Int -- ^ Optional.
  , c :: Int       -- ^ Default to 123.
  }

newBaz :: BazRequired -> Baz
newBaz BazRequired {a} -> Baz { a, b = Nothing, c = 123 }

This results in 2 * n record datatypes, though.

3. Record pattern synonyms

This is by far the nicest approach I've encountered so far as it allows using regular record syntax along with a sane way to specify default record values. But, it requires GHC >= 9.2 due to the use of PatternSynonyms having the following constraint:

We define field labels of P to be the set {f1, ..., fn}. No two record pattern synonyms can have duplicate field labels.

This appears to be lifted if NoFieldSelectors is enabled.

See: Record Pattern Synonyms

-- | Permanently deletes an empty bucket.
data StorageBucketsDelete = StorageBucketsDelete
  { bucket :: Text
    -- ^ Name of a bucket.
  , ifMetagenerationMatch :: Maybe Int64
    -- ^ If set, only deletes the bucket if its metageneration matches this value.
  , ifMetagenerationNotMatch :: Maybe Int64
    -- ^ If set, only deletes the bucket if its metageneration does not match this value.
  , provisionalUserProject :: Maybe Text
    -- ^ The project to be billed for this request if the target bucket is requester-pays bucket.
  , userProject :: Maybe Text
    -- ^ The project to be billed for this request. Required for Requester Pays buckets.
  , prettyPrint :: Bool
    -- ^ Whether the response should be pretty-printed. Defaults to 'True'.
  } deriving stock (Eq, Show, Generic)

-- | A replacement for the existing smart constructor idiom:
pattern StorageBucketsDeleteDefaults
  { bucket
  } = StorageBucketsDelete
      { bucket                   = bucket
      , prettyPrint              = True
      , ifMetagenerationMatch    = Nothing
      , ifMetagenerationNotMatch = Nothing
      , provisionalUserProject   = Nothing
      , userProject              = Nothing
      }

Usage:

StorageBucketsDeleteDefaults { bucket } :: StorageBucketsDelete

(StorageBucketsDeleteDefaults { bucket })
  { prettyPrint = False
  , userProject = Just "foo"
  }

4. HKD

This is probably the easiest method to support right now for both the supported GHC versions + state of the code generator, the UX also seems decent. Taken from here.

data Fields
  = Defaults -- ^ A partially-specified record.
  | Complete -- ^ A fully-specified record.

type family Required (f :: Fields) a where
  Required Defaults a = () -- When defining defaults, required fields are ().
  Required Complete a = a

-- | Permanently deletes an empty bucket.
data StorageBucketsDelete f = StorageBucketsDelete
  { bucket :: Required f Text
    -- ^ Name of a bucket.
  , ifMetagenerationMatch :: Maybe Int64
    -- ^ If set, only deletes the bucket if its metageneration matches this value.
  , ifMetagenerationNotMatch :: Maybe Int64
    -- ^ If set, only deletes the bucket if its metageneration does not match this value.
  , provisionalUserProject :: Maybe Text
    -- ^ The project to be billed for this request if the target bucket is requester-pays bucket.
  , userProject :: Maybe Text
    -- ^ The project to be billed for this request. Required for Requester Pays buckets.
  , prettyPrint :: Bool
    -- ^ Whether the response should be pretty-printed. Defaults to 'True'.
  }

newStorageBucketsDelete :: StorageBucketsDelete Defaults
newStorageBucketsDelete =
  StorageBucketsDelete
  { bucket                   = ()
  , ifMetagenerationMatch    = Nothing
  , ifMetagenerationNotMatch = Nothing
  , provisionalUserProject   = Nothing
  , userProject              = Nothing
  , prettyPrint              = True
  }

-- Due to the use of HKD in the record, you end up needing standalone deriving
-- for stock instances:

deriving stock instance Eq StorageBucketsDelete
deriving stock instance Show StorageBucketsDelete
deriving stock instance Generic StorageBucketsDelete

All functions/instances then take 'StorageBucketsDelete Complete', forcing the bucket field to be specified:

newStorageBucketsDelete { }                :: StorageBucketsDelete Defaults
newStorageBucketsDelete { bucket = "foo" } :: StorageBucketsDelete Complete

You can also transparently use the smart constructor as expected:

newStorageBucketsDelete
  { bucket      = "foo"
  , prettyPrint = False
  , userProject = Just "bar"
  }

5. Optional parameter-less records

If you were to move all optional and default parameters out of the records and into HasField or a custom typeclass, you get a minimum specification of the required fields you can wrangle via standard record syntax:

-- | Permanently deletes an empty bucket.
data StorageBucketsDelete = StorageBucketsDelete
  { bucket :: Text
    -- ^ Name of a bucket.
  }

-- Look, no smart constructor!

But then where do you actually store the missing parameters?

One approach taken by kubernetes-client-core is to have a variety of plumbing upon which to hang instances and insert parameters into a backing store. In their case, it's unfortunately uni-directional as the act of setting a parameter also serialises it. You also move all field documentation onto the instances, further breaking up parameter help + discovery. You also lose any instances like Generic over the total possible set of parameters, instead it'd be restricted to only required parameters.

Having used that library extensively in anger, I can't say I'm a fan of the usability - but there are ways to recover both getting, setting, and instances, which I'll outline in a half-baked manner here. Likely you'd use large-records and hasField with OverloadedRecordDot:

-- | A wrapping container for any optional parameters of @a@.
--
-- (You can also potentially do away with the 'AWSRequest' typeclass by extending this
-- to 'data Request a b = ...`, a story for another day.)
data Request a = Request
  { request :: a
  , params :: Vector Any
  }

-- | Using the minimal StorageBucketsDelete from above.
--
-- Note: while we get field access/update via `HasField` instances below, you'd still need `PatternSynonyms` 
-- and `NoFieldSelectors` to recover record syntax, again, unless you use large-records directly.
newStorageBucketsDelete :: Request StorageBucketsDelete
newStorageBucketsDelete =
  Request StorageBucketsDelete { bucket } $
    Vector.fromList
        [ unsafeCoerce (Nothing @Int64)
        , unsafeCoerce (Nothing @Int64)
        , unsafeCoerce (Nothing @Text)
        , unsafeCoerce (Nothing @Text)
        , unsafeCoerce True
        ]

-- | If set, only deletes the bucket if its metageneration matches this value.
instance HasField "ifGenerationMatch" (Request StorageBucketsDelete) (Maybe Int64) where
  hasField r = (set, get)
    where
      set a = r.params = Vector.modify (\v -> Vector.write v 0 (unsafeCoerce a)) v
      get   = unsafeCoerce (r.params Vector.! 0)

-- And so on for each field in the backing vector ...

It will improve compilation time significantly (due to large-record), but it completely blows all your Generic instances and similar out of the water - you need the generator to emit stock-equivalent instances for Show and the like, unless deemed unnecessary. It also relies on HasField to be able to poke around inside Request a to get/set any of the required or optional parameters.

You could alternatively use Map Text Dynamic for a safer alternative at the expense of suboptimal field access. It would be easier to recover the lost instances, though.)

I am against a solution which excludes GHC >=8.8 && <9, because I think the installed base is still really large, and I think there are still a bunch of lagging libraries that aren't GHC-9 ready. This makes me most in favour of option 4 right now., but I have a few quibbles:

The name of either the type family or its index type isn't quite right to me: it's indicating fields that are both required and lack a default, and so I'd suggest maybe data Completion = Missing | Complete. Then Required Missing Int reads to me like "required but missing Int".
You could avoid the type family completely by using f :: Type -> Type in your records, and substituting Proxy or Identity. This may be a bit confusing for readers unfamiliar with the pattern, and it admits silly things like f ~ IO. Possible upside: it allows the records to have lifted Functor etc instances like you find in rank2classes/barbies/conkin, but I don't think that's especially useful.
If you want to retain the type family, you could put a kind signature on the type argument to the records, for the reader's benefit.
You cannot go from a StorageBucketsDelete Missing to a StorageBucketsDelete Complete using common lens idioms, unless it has at most one required-but-missing field. Consider newStorageBucketsDelete & #bucket .~ "foo": If we tried to set a second required-but-missing field using (&) we'd run into trouble because the type between the two applications wouldn't fit either Missing or Complete. You might be able to get away with newFoo & (#bar .~ "baz") . (#baz .~ "quux"), but I haven't tested this. Anyway, this isn't functionality we're losing, it just feels a bit inconsistent.

TL;DR: If we want this in ASAP, I support the HKD approach, with tweaks (1) and (3) in this post. If we're willing to wait for more people to adopt GHC >=9.2, then I'd probably recommend record pattern synonyms, as part of another major release.

As mentioned in the opening - treat this as documentation of exploratory work. There is no intention to push a particular approach within any foreseeable release window(s).

The name of either the type family or its index type isn't quite right to me [...]

The naming is taken directly from Chris Done's original gist, I've no particular attachment to any of it.

You could avoid the type family completely [...]

A record which just contains defaults and no required parameters isn't intended to be some generally useful value, it's meaningless without being fully saturated, hence the lack Show, Eq, and the like.

Paramterising the record over f :: Type -> Type means you now have to supply every required value wrapped in Identity, which is what the closed type family is explicitly avoiding. I'm not arguing for/against, but pointing out why it's there - you either get naked required fields with a type family, or Identity wrappers without.

I agree that lifted instances aren't useful - the only fields making use of f within a record are the required fields.

You cannot go from a StorageBucketsDelete Missing to a StorageBucketsDelete Complete using common lens idioms

You're not supposed to - if that's ambiguous or a stumbling block then it's probably a bad design. Consider:

-- Old:
newStorageBucketsDelete "bucket" & #bucket .~ "foo"

-- New:
newStorageBucketsDelete { bucket = "bucket" }  & #bucket .~ "foo"

The only part that has been changed here is to produce a value of StorageBucketsDelete we've gone from a function :: Text -> StorageBucketsDelete with associated parameter ordering woes, to using record syntax where we get named/freely ordered required parameters and punning.

The potential for ambiguity/misuse here seems to be that newStorageBucketsDelete { ... } is an atomic statement and newStorageBucketsDelete is not meaningful/useful until the required parameters are fully saturated.

An approach to supporting synonyms could be to retain the existing smart constructors and CPP 🥴 everything up the wazoo:

flag NoFieldSelectors
  description: If the NoFieldSelectors extension should be enabled, when available.
  default:     True
  manual:      False

library
  if flag(NoFieldSelectors) && impl(ghc >= 9.2.1)
    cpp-options:      -DNOFIELDSELECTORS
    default-extensions: NoFieldSelectors

{-# LANGUAGE CPP #-}

newStorageBucketsDelete :: Text -> StorageBucketsDelete
newStorageBucketsDelete bucket =
  StorageBucketsDelete
  { bucket
  , ifMetagenerationMatch    = Nothing
  , ifMetagenerationNotMatch = Nothing
  , provisionalUserProject   = Nothing
  , userProject              = Nothing
  , prettyPrint              = True
  }
#ifdef NOFIELDSELECTORS
{-# DEPRECATED newStorageBucketsDelete "Use the MkStorageBucketsDelete synonym instead" #-}

pattern MkStorageBucketsDeleteDefaults
  { bucket
  } = StorageBucketsDelete
      { bucket                   = bucket
      , ifMetagenerationMatch    = Nothing
      , ifMetagenerationNotMatch = Nothing
      , provisionalUserProject   = Nothing
      , userProject              = Nothing
      , prettyPrint              = True
      }
#endif

As mentioned in the opening - treat this as documentation of exploratory work. There is no intention to push a particular approach within any foreseeable release window(s).

That's cool. I think if you're willing to explore and wait, then there's probably a good/ergonomic solution that becomes available once the 9.x record stuff matures.

You cannot go from a StorageBucketsDelete Missing to a StorageBucketsDelete Complete using common lens idioms

You're not supposed to - if that's ambiguous or a stumbling block then it's probably a bad design. Consider:
-- Old:
newStorageBucketsDelete "bucket" & #bucket .~ "foo"

-- New:
newStorageBucketsDelete { bucket = "bucket" }  & #bucket .~ "foo"
The only part that has been changed here is to produce a value of StorageBucketsDelete we've gone from a function :: Text -> StorageBucketsDelete with associated parameter ordering woes, to using record syntax where we get named/freely ordered required parameters and punning.

I understand this, but it feels a little weird if you try to do half of the setting with a record update (which you must), and then the rest with lens idioms. Would it be easier to do things HKD-style, but populate the record with a single update? That seems potentially fine - it's not often that you're going to want to incrementally build up a request using (<>~) or whatever.

The potential for ambiguity/misuse here seems to be that newStorageBucketsDelete { ... } is an atomic statement and newStorageBucketsDelete is not meaningful/useful until the required parameters are fully saturated.

Indeed. I'd try it out without lens, and with a single record update that fills all the parameters at once. I suspect that will look reasonable.

My two cents on this, though I'll admit up front I'm not a current Amazonka user, merely a past one.

I really like #3, from a performance point of view - it seems like the option most likely to allow unpacked fields in the records, which can help a lot with memory usage and fragmentation. I've worked on projects using Amazonka where we were passing through many GB of data through Amazonka types (possibly on the order of hundreds of GB per day), and anything that could help reduce GC time and memory usage would be very valuable. I know it's to something we currently generate, but IMO unpacking fields, particularly in response types, could be a decent win for many applications.

IMO the CPP version above isn't too bad, as it means that we have a nice way solution for the future, with an appropriate backwards compatible solution for anyone stuck using very old compilers.

I also quite like the BazRequired record version too (and honestly have no problem with the current function based one - as long as we never change the order of arguments between Get runs that should be fine... maybe that's not possible to guarantee though).

brendanhay / amazonka