GHC backend: Automatically inline all projections?

After learning about {-# INLINE #-} from the ChangeLog and seeing at the top of my profile output:

COST CENTRE                                                     %time %alloc

Data.Product.Σ.proj₂                                             13.1    0.0
Data.Product.Σ.proj₁                                              7.5    0.0
Data.SUList.ListSetMap.RawLSM.RawLSM3.MapImage.mapImage₀          3.0    4.9
Data.SUList.ListSetMap.RawLSM.RawLSM3.RawLSM3-comp.comp₀          2.8    3.3

, I added INLINE pragmas in the standard library for Data.Product.Σ.proj₂ and Data.Product.Σ.proj₁, and obtained a 16% speed-up.

Close to the top of the list in my profiling output, there are quite a few more record fields ─ shouldn't they all be inlined automatically?

Summary of the attached file ─ the important lines are the runtime of the compiled program (without profiling):

RATH-Agda r2102, on schroeder  (Phenom II 3.20GHz 16GB) with Agda-2.4.3.27
(e2d0a78147b80c55a863de33ed91562719fd4da3 + Andrea Vezzosi's patch from
  https://github.com/agda/agda/issues/1625#issuecomment-132196576)
======================================================================================
Summary
                         Alloc        Max.Res.       AllocRate     real       Prod

Unchanged standard library:
---------------------------
  GHC time        679,323,659,696  2,917,647,736   1,016,353,892  14m36.243s  76.4%
  prof GHC time 1,192,444,325,288  3,524,617,584     927,567,192  90m08.111s  23.8%

  executable size:        99665432
  prof executable size:  451859936

* runtime           1,490,315,368        116,584   1,014,939,354   0m01.541s  98.7%
  prof runtime      2,718,895,360        216,144     444,212,882   0m06.323s  99.7%
   excluding prof:  1,818,898,912                                  5.84 secs

{-# INLINE Data.Product.Σ.proj₁ Data.Product.Σ.proj₂ #-}:
--------------------------------------------------------
  GHC time        681,727,821,616  3,083,103,272   1,029,968,925  14m25.511s  76.7%
  prof GHC time 1,199,772,348,296  3,543,110,528   1,070,403,733  62m00.834s  30.2%

  executable size:       99619088
  prof executable size: 452529760

* runtime           1,452,939,136        114,456   1,157,619,547   0m01.328s  98.4%
  prof runtime      2,628,923,680        219,136     561,416,590   0m04.844s  99.6%
   excluding prof:  1,764,051,360                                  4.48 secs

2016-03-05_profiling_log.txt

I thought that GHC automatically inlines small, simple functions. Perhaps it would be worthwhile investigating why inlining is not happening automatically in this case.

Some time ago, I did some experiments: After some exploration, I arrived at the following patch:

diff --git a/src/full/Agda/Compiler/ToTreeless.hs b/src/full/Agda/Compiler/ToTreeless.hs
index c68e227..2fcfbe2 100644
--- a/src/full/Agda/Compiler/ToTreeless.hs
+++ b/src/full/Agda/Compiler/ToTreeless.hs
@@ -328,8 +328,11 @@ maybeInlineDef q vs =
   ifM (lift $ alwaysInline q) doinline $ do
     def <- lift $ getConstInfo q
     case theDef def of
-      Function{ funInline = inline }
+      def'@Function{ funInline = inline }
         | inline    -> doinline
+        | isProperProjection def' -> do
+            lift $ reportSDoc "treeless.inline" 20 $ text "-- inlining projection" $$ prettyPure (defName def)
+            doinline
         | otherwise -> do
         _ <- lift $ toTreeless' q
         used <- lift $ getCompiledArgUse q

The result is that the generated Haskell code explodes, with code of the following pattern (simplified for readability):

...  = case v3 of
           Constr1 v10 v11 v12 v13 v14 v15 -> case vX of
               ... -> ...
                   ... -> case v3 of
                      Constr1 v50 v51 v52 v53 v54 v55 -> result

(With default cases and coe calls liberally sprinkled in-between.)

I thought about adding a “case folding” pass to ToTreeless, but that would still generate those huge intermediate TTerms first.

I currently think that the proper way would be something along the following lines:

Generate TTerms only in an additional ReaderT that keeps an environment of case arguments with matched patterns.
Every time you return a TTerm constructor, check whether it is a case; if yes, look the argument up in the environment, and if it is found, prune all patterns that don't match the pattern associated with the argument in the environment, don't generate a new case, but return the local body replacing the local pattern variables with those from the environment.

With informal substitution notation, the example above would then become:

...  = case v3 of
           Constr1 v10 v11 v12 v13 v14 v15 -> case vX of
               ... -> ...
                   ... -> result[ v50 := v10 , v51 := v11 , v52 := v12 , v53 := v13 , v54 := v14 , v55 := v15 ]

@UlfNorell , @phile314: Does this make sense?

I think this would introduce valuable sharing that is currently lost. Even if GHC might be able to reconstruct it with extremely aggressive options (I tried quite a few things in the past, with no significant success), it would still be very expensive on the GHC-compilation side.

Unfortunately this proposal involves a lot more Agda internals than I am currently able to handle; I am not even confident that I could arrive at a way to set up that ReaderT that stands a chance of getting accepted into the Agda code base...

I thought there was a pass in the treeless compiler that would get rid of repeated cases, but of course that still would have the problem with the huge intermediate terms. This might make a nice code sprint for next week.

Another option would be to not inline projection functions in ToTreless itself, but to add an additional pass inlining projection functions and making sure that we generate only one case expression per scrutinee (like wolfram outlined above).

@UlfNorell I also thought there is such a pass. Maybe it only works on consecutive cases?

For illustration, here is an example of what came out with that patch (I switched Haskell pretty-printing to PPInLine to avoid megabytes of spaces, and then manually indented):

  name446 = "Relation.Binary.Poset.HasMeetProps.HasMeetProps.≤-from-∧₂";;
  d446 v0 v1 v2 v3 v4 = case coe v0 of
    { MAlonzo.Code.Relation.Binary.C235 v5 v6 v7 v8 -> case coe v8 of
      { MAlonzo.Code.Relation.Binary.C217 v9 v10 -> case coe v9 of
        { MAlonzo.Code.Relation.Binary.C17 v11 v12 v13 -> coe v13 v3
          (let { v14 = coe v1 v2 v3;} in case coe v14 of
            { MAlonzo.Code.Relation.Binary.Poset.Meet.C13 v15 v16 -> coe v15
            ; _ -> MAlonzo.RTE.mazUnreachableError
            }
          ) v2 (case coe v0 of
            { MAlonzo.Code.Relation.Binary.C235 v14 v15 v16 v17 -> case coe v17 of
              { MAlonzo.Code.Relation.Binary.C217 v18 v19 -> case coe v18 of
                { MAlonzo.Code.Relation.Binary.C17 v20 v21 v22 -> coe v21 v3
                  (let { v23 = coe v1 v2 v3;} in case coe v23 of
                    { MAlonzo.Code.Relation.Binary.Poset.Meet.C13 v24 v25 -> coe v24
                    ; _ -> MAlonzo.RTE.mazUnreachableError
                    }
                  ) (case coe v0 of
                    { MAlonzo.Code.Relation.Binary.C235 v23 v24 v25 v26 -> case coe v26 of
                      { MAlonzo.Code.Relation.Binary.C217 v27 v28 -> case coe v27 of
                        { MAlonzo.Code.Relation.Binary.C17 v29 v30 v31 -> case coe v29 of
                          { MAlonzo.Code.Relation.Binary.Core.C605 v32 v33 v34 -> coe v33
                            (let { v35 = coe v1 v2 v3;} in case coe v35 of
                              { MAlonzo.Code.Relation.Binary.Poset.Meet.C13 v36 v37 -> coe v36
                              ; _ -> MAlonzo.RTE.mazUnreachableError
                              }
                            ) v3 v4
                          ; _ -> coe MAlonzo.RTE.mazUnreachableError
                          }
                        ; _ -> coe MAlonzo.RTE.mazUnreachableError
                        }
                      ; _ -> coe MAlonzo.RTE.mazUnreachableError
                      }
                    ; _ -> MAlonzo.RTE.mazUnreachableError
                    }
                    )
               ; _ -> coe MAlonzo.RTE.mazUnreachableError
               }
             ; _ -> coe MAlonzo.RTE.mazUnreachableError
             }
           ; _ -> MAlonzo.RTE.mazUnreachableError
           }) (let { v14 = coe v1 v2 v3;} in case coe v14 of
             { MAlonzo.Code.Relation.Binary.Poset.Meet.C13 v15 v16 -> case coe v16 of
               { MAlonzo.Code.Relation.Binary.Poset.Meet.C1 v17 v18 v19 -> coe v17
               ; _ -> coe MAlonzo.RTE.mazUnreachableError
               }
             ; _ -> MAlonzo.RTE.mazUnreachableError
             })
         ; _ -> coe MAlonzo.RTE.mazUnreachableError
         }
       ; _ -> coe MAlonzo.RTE.mazUnreachableError
       }
     ; _ -> coe MAlonzo.RTE.mazUnreachableError
     };;

This contains 15 case expressions. Applying the transformation outlined above (collapsing each of v0, v8, and v9 twice) gets this down to 9:


  name446 = "Relation.Binary.Poset.HasMeetProps.HasMeetProps.≤-from-∧₂";;
  d446 v0 v1 v2 v3 v4 = case coe v0 of
    { MAlonzo.Code.Relation.Binary.C235 v5 v6 v7 v8 -> case coe v8 of
      { MAlonzo.Code.Relation.Binary.C217 v9 v10 -> case coe v9 of
        { MAlonzo.Code.Relation.Binary.C17 v11 v12 v13 -> coe v13 v3
          (let { v14 = coe v1 v2 v3;} in case coe v14 of
            { MAlonzo.Code.Relation.Binary.Poset.Meet.C13 v15 v16 -> coe v15
            ; _ -> MAlonzo.RTE.mazUnreachableError
            }
          ) v2 (coe v12 v3
                  (let { v23 = coe v1 v2 v3;} in case coe v23 of
                    { MAlonzo.Code.Relation.Binary.Poset.Meet.C13 v24 v25 -> coe v24
                    ; _ -> MAlonzo.RTE.mazUnreachableError
                    }
                  ) (case coe v11 of
                          { MAlonzo.Code.Relation.Binary.Core.C605 v32 v33 v34 -> coe v33
                            (let { v35 = coe v1 v2 v3;} in case coe v35 of
                              { MAlonzo.Code.Relation.Binary.Poset.Meet.C13 v36 v37 -> coe v36
                              ; _ -> MAlonzo.RTE.mazUnreachableError
                              }
                            ) v3 v4
                          ; _ -> coe MAlonzo.RTE.mazUnreachableError
                          }
                    )
           ) (let { v14 = coe v1 v2 v3;} in case coe v14 of
             { MAlonzo.Code.Relation.Binary.Poset.Meet.C13 v15 v16 -> case coe v16 of
               { MAlonzo.Code.Relation.Binary.Poset.Meet.C1 v17 v18 v19 -> coe v17
               ; _ -> coe MAlonzo.RTE.mazUnreachableError
               }
             ; _ -> MAlonzo.RTE.mazUnreachableError
             })
         ; _ -> coe MAlonzo.RTE.mazUnreachableError
         }
       ; _ -> coe MAlonzo.RTE.mazUnreachableError
       }
     ; _ -> coe MAlonzo.RTE.mazUnreachableError
     };;

Notice that coe v1 v2 v3 occured already originally four times; if we lift this let just out one level, we can share it:


  name446 = "Relation.Binary.Poset.HasMeetProps.HasMeetProps.≤-from-∧₂";;
  d446 v0 v1 v2 v3 v4 = case coe v0 of
    { MAlonzo.Code.Relation.Binary.C235 v5 v6 v7 v8 -> case coe v8 of
      { MAlonzo.Code.Relation.Binary.C217 v9 v10 -> case coe v9 of
        { MAlonzo.Code.Relation.Binary.C17 v11 v12 v13 -> let { v14 = coe v1 v2 v3;} in coe v13 v3
          (case coe v14 of
            { MAlonzo.Code.Relation.Binary.Poset.Meet.C13 v15 v16 -> coe v15
            ; _ -> MAlonzo.RTE.mazUnreachableError
            }
          ) v2 (coe v12 v3
                  (case coe v14 of
                    { MAlonzo.Code.Relation.Binary.Poset.Meet.C13 v24 v25 -> coe v24
                    ; _ -> MAlonzo.RTE.mazUnreachableError
                    }
                  ) (case coe v11 of
                          { MAlonzo.Code.Relation.Binary.Core.C605 v32 v33 v34 -> coe v33
                            (case coe v14 of
                              { MAlonzo.Code.Relation.Binary.Poset.Meet.C13 v36 v37 -> coe v36
                              ; _ -> MAlonzo.RTE.mazUnreachableError
                              }
                            ) v3 v4
                          ; _ -> coe MAlonzo.RTE.mazUnreachableError
                          }
                    )
           ) (case coe v14 of
             { MAlonzo.Code.Relation.Binary.Poset.Meet.C13 v15 v16 -> case coe v16 of
               { MAlonzo.Code.Relation.Binary.Poset.Meet.C1 v17 v18 v19 -> coe v17
               ; _ -> coe MAlonzo.RTE.mazUnreachableError
               }
             ; _ -> MAlonzo.RTE.mazUnreachableError
             })
         ; _ -> coe MAlonzo.RTE.mazUnreachableError
         }
       ; _ -> coe MAlonzo.RTE.mazUnreachableError
       }
     ; _ -> coe MAlonzo.RTE.mazUnreachableError
     };;

Sharing also the associated case coe v14 changes the evaluation order, but produces something that looks like acceptable compiler output :wink: :

  name446 = "Relation.Binary.Poset.HasMeetProps.HasMeetProps.≤-from-∧₂";;
  d446 v0 v1 v2 v3 v4 = case coe v0 of
    { MAlonzo.Code.Relation.Binary.C235 v5 v6 v7 v8 -> case coe v8 of
      { MAlonzo.Code.Relation.Binary.C217 v9 v10 -> case coe v9 of
        { MAlonzo.Code.Relation.Binary.C17 v11 v12 v13 -> let { v14 = coe v1 v2 v3;} in case coe v14 of
          { MAlonzo.Code.Relation.Binary.Poset.Meet.C13 v15 v16 -> coe v13 v3
            (coe v15) v2 (coe v12 v3
              (coe v15) (case coe v11 of
                          { MAlonzo.Code.Relation.Binary.Core.C605 v32 v33 v34 -> coe v33 (coe v15) v3 v4
                          ; _ -> coe MAlonzo.RTE.mazUnreachableError
                          }
                    )
           ) (case coe v16 of
               { MAlonzo.Code.Relation.Binary.Poset.Meet.C1 v17 v18 v19 -> coe v17
               ; _ -> coe MAlonzo.RTE.mazUnreachableError
               })
          ; _ -> MAlonzo.RTE.mazUnreachableError
          }
         ; _ -> coe MAlonzo.RTE.mazUnreachableError
         }
       ; _ -> coe MAlonzo.RTE.mazUnreachableError
       }
     ; _ -> coe MAlonzo.RTE.mazUnreachableError
     };;

To avoid changing the evaluation order, one could generate a pattern let, eliminating v14:

  name446 = "Relation.Binary.Poset.HasMeetProps.HasMeetProps.≤-from-∧₂";;
  d446 v0 v1 v2 v3 v4 = case coe v0 of
    { MAlonzo.Code.Relation.Binary.C235 v5 v6 v7 v8 -> case coe v8 of
      { MAlonzo.Code.Relation.Binary.C217 v9 v10 -> case coe v9 of
        { MAlonzo.Code.Relation.Binary.C17 v11 v12 v13 -> let
          { MAlonzo.Code.Relation.Binary.Poset.Meet.C13 v15 v16 = coe v1 v2 v3;
          } in coe v13 v3
            (coe v15) v2 (coe v12 v3
              (coe v15) (case coe v11 of
                          { MAlonzo.Code.Relation.Binary.Core.C605 v32 v33 v34 -> coe v33 (coe v15) v3 v4
                          ; _ -> coe MAlonzo.RTE.mazUnreachableError
                          }
                    )
           ) (case coe v16 of
               { MAlonzo.Code.Relation.Binary.Poset.Meet.C1 v17 v18 v19 -> coe v17
               ; _ -> coe MAlonzo.RTE.mazUnreachableError
               })
         ; _ -> coe MAlonzo.RTE.mazUnreachableError
         }
       ; _ -> coe MAlonzo.RTE.mazUnreachableError
       }
     ; _ -> coe MAlonzo.RTE.mazUnreachableError
     };;

Could this have any negative consequences, like for example causing space leaks? I haven't yet looked at this in detail, but it might be worth checking this before implementing. GHC's literature discusses this to some extent:

GHC's Common Subexpression Elimination (CSE) https://kar.kent.ac.uk/21455/1/Common_Subexpression_Elimination_in_a_Lazy.pdf , section 3.5
GHC's full laziness transform, http://research.microsoft.com/en-us/um/people/simonpj/papers/float.ps.gz , section 3.2

One factor coming in here to make a difference is that Agda has a module system. :wink:

Another important difference is that Haskell datatypes are by default coinductive. (All I am arguing about here is about constructors for (at most) inductive datatypes. One might even make a difference between non-recursive records and inductive recursive records. Constructors for coinductive records in Agda will definitely need to be treated differently.)

The pattern let of the last version above will evaluate coe v1 v2 v3 only once, and keep the fields v15 and v16 alive until their last use. If v13 ends up not using some of its arguments, this could mean that v15 and/or v16 are kept alive until the application of v13 is fully evaluated.

Without the let-floating, coe v1 v2 v3 will be evaluated (if needed by v13) four times.

The trimmed-down source for the example above is:

module Relation.Binary.Poset.LeqFromMeet2 where

open import Level
open import Relation.Binary using (Poset)
open import Relation.Binary.Poset.Meet using (module PosetMeet)

module HasMeetProps  {j k₁ k₂ : Level} (𝒫 : Poset j k₁ k₂)
                     (meet : (R S : Poset.Carrier 𝒫) → PosetMeet.Meet 𝒫 R S) where
  open Poset 𝒫
  open PosetMeet 𝒫
  infixr 7 _∧_

  _∧_ : Carrier → Carrier → Carrier
  R ∧ S = Meet.value (meet R S)

  ≤-from-∧₂ : {R S : Carrier} → R ∧ S ≈ S → S ≤ R
  ≤-from-∧₂ {R} {S} = ≤-from-Meet₂ (meet R S)

The situation is quite typical: v0 is a nested record value (here, 𝒫 : Poset), of which v8 and v9 are constituent records of which some fields are used (most likely via open public in the enclosing record).

v1 is the meet function; it is here applied to v2 (R) and v3 (S), producing the meet value v15 and the IsMeet proof for it in v16. Just the calculation of the value might already be quite expensive...

(The fact that this example is about “proofs” does actually not make a difference: These proofs here are, for me, in general not irrelevant, since I sometimes need them for totality proofs from which I then extract total functions.)

In many practical examples, the role of the v0 here is going to be taken by some module instantiation vMod vParam1 vParam2 ... vParam17 which might perform non-trivial computations at evaluation time, and definitely needs to be shared...

Natalie Perna and I are currently working on this in https://github.com/natalieperna/agda/tree/stable-2.5-inline-projections .

I only took a quick glance at your branch, but if I understood correctly you inline projections during the Internal -> Treeless translation. In a second step, you get rid of all the duplicated case expressions again and float them out. Wouldn't it be simpler to just not generate that many case expressions in the first place? E.g. following your suggestion in https://github.com/agda/agda/issues/1895#issuecomment-209996981 ?

agda / agda

GHC backend: Automatically inline all projections? #1895