Automatic data types fusion when deriving generator for constructor

buzden commented 1 year ago

This can be seen as generation-time optimisation by growing the size of derived code. @AlgebraicWolf tested fusion in an ad-hoc example and it gave better results. We can generalise this.

The idea is to generate a data type for each data constructor (which is of a (complex?) dependent type) which would contain all constraints of data types of all constructor parameters. When we generate arguments for this particular constructor, instead of a sequence of subgeneration calls and then assembling it to the constructor's call, we can simply generate that specialised data structure and use it to generate the constructor call.

For example, consider couple of GADTs:

data X : Nat -> Type where
  X1 : Double -> X1
  Xn : (n : Nat) -> X (S n)

data Y : Nat -> Type where
  Y0 : String -> Y 0
  Y1 : Char -> Y 1
  Yn : Y n

If we derive a generator for data type Z defined like

data Z : Type where
  Zx : Z
  Zn : (n : Nat) -> X n -> Y n -> Z

we now make generator for Z to be alternatives of generator for Zx and generator for Zn, where generator of Zn is an alternative between approximately

do (n ** xn) <- genAnyX
   yn <- genY n
   pure $ Zn n xn yn

and

do (n ** yn) <- genAnyY
   xn <- genX n
   pure $ Zn n xn yn

What is suggested here is to create a special data type

data ForZn : Type where
  ForZn_X1_Y1 : Double -> Char -> ForZn
  ForZn_X1_Yn : Double -> ForZn
  ForZn_Xn_Y1 : Char -> ForZn
  ForZn_Xn_Yn : (n : Nat) -> ForZn

which contains all possible data from all possible arguments of the constructor Zn in a fused way. So, there exists a function zn : ForZn -> Z which produces all possible calls to constructor Zn:

zn : ForZn -> Z
zn $ ForZn_X1_Y1 d k = Zn $ 1 (X1 d) (Y1 k)
zn $ ForZn_X1_Yn d   = Zn $ 1 (X1 d) Yn
zn $ ForZn_Xn_Y1 c   = Zn $ 1 (Xn 0) (Y1 k)
zn $ ForZn_Xn_Yn n   = Zn $ (S n) (Xn n) Yn

So, generator for Zn can simply look like zn <$> genForZn, and generator for ForZn can be simply derived as usual (it does not contain dependent types in its constructors, so this mechanism should not be applied to it recursively).

AlgebraicWolf commented 6 months ago

A simple example demonstrating the importance of fusion would be the following set of data types:

data X : Nat -> Nat -> Type where
  MkX : X 42 m

data Y : Nat -> Nat -> Type where
  MkY : Y n 1337

data Z : Type where
  MkZ : (n : Nat) -> (m : Nat) -> X n m -> Y n m -> Z

Currently, the derived generator for Z would turn out to be grossly inefficient. There are two possible orderings of argument generation for constructor MkZ: X[], Y[0, 1] and Y[], X[0, 1], and either one is impractical to use.

However, this specific case could be solved with even the simplest fusion algorithm that works only on non-recursive data types. It would result in the following type:

XY : Nat -> Nat -> Type where
  MkXMkY : XY 42 1337

Which can be efficiently generated.

buzden commented 6 months ago

Just for a record, when you have the following data type definitions (based on the comment above)

data X : Nat -> Nat -> Type where
  MkX : X 42 m
  MkX2 : X n m
  MkX3 : X 42 24

data Y : Nat -> Nat -> Type where
  MkY : Y n 1337

data Z : Type where
  MkZ : (n : Nat) -> (m : Nat) -> X n m -> Y n m -> Z

having available the following (effective) generators

genX : Fuel -> Gen MaybeEmpty (n ** m ** X n m)
genXm : Fuel -> (m : Nat) -> Gen MaybeEmpty (n ** X n m)
genXnm : Fuel -> (n, m : Nat) -> Gen MaybeEmpty (X n m)

genY : Fuel -> Gen MaybeEmpty (n ** m ** Y n m)
genYn : Fuel -> (n : Nat) -> Gen MaybeEmpty (m ** Y n m)
genYnm : Fuel -> (n, m : Nat) -> Gen MaybeEmpty (Y n m)

currently we derive generator for Z which is equivalent to the following variant (using simplicifaction hack):

genZ : Fuel -> Gen MaybeEmpty Z
genZ fl = do
  (n ** m ** x) <- genX fl
  y <- genYnm fl n m
  pure $ MkZ n m x y

or, alternatively, this one

genZ' : Fuel -> Gen MaybeEmpty Z
genZ' fl = do
  (n ** m ** y) <- genY fl
  x <- genXnm fl n m
  pure $ MkZ n m x y

or alternative of both is simplification-hack option is turned off. Notice, that technically derived code is not the same, but it is equivalent for this particular example.

If we tries to use most effective subgenerators for both X and Y, we would run into a chicken and egg problem:

genZ'' : Fuel -> Gen MaybeEmpty Z
genZ'' fl = do
  (m ** y) <- genYn fl ?generated_below
  (n ** x) <- genXm fl m
  pure $ MkZ n m x ?y's_n_does_not_unify

So, what we could do is to employ the fusion. For this particular case, we can see than in the MkZ when using X and Y types, only two free variables are used, n and m, thus we can derive a fused data type indexed by them:

data XY : (n : Nat) -> (m : Nat) -> Type where
  MkX_MkY : XY 42 1337
  MkX2_MkY : XY n 1337

It contains a cartesian product of all the constructors that match each other (i.e. where actual indices unify with each other).

Also, we can derive the splitting function

splitXY : XY n m -> (X n m, Y n m)
splitXY MkX_MkY        = (MkX {m=1337}     , MkY {n=42})
splitXY (MkX2_MkY {n}) = (MkX2 {n} {m=1337}, MkY {n})

Thus, we can call usual derivation for the fused type:

genXY : Fuel -> Gen MaybeEmpty (n ** m ** XY n m)

and derive the following generator for Z:

genZ_ultimate : Fuel -> Gen MaybeEmpty Z
genZ_ultimate fl = do
  (n ** m ** xy) <- genXY fl
  let (x, y) = splitXY xy
  pure $ MkZ n m x y

buzden / deptycheck

Automatic data types fusion when deriving generator for constructor #55