dhall-lang / dhall-haskell

Maintainable configuration files
https://dhall-lang.org/
BSD 3-Clause "New" or "Revised" License
912 stars 211 forks source link

performance advice for dhall-kubernetes based package #1890

Open ggilmore opened 4 years ago

ggilmore commented 4 years ago

Forked from https://functionalprogramming.slack.com/archives/C82N1L0MB/p1592337474256100

My team is investigating Dhall to see if it'd be a good way to ship + customize our complex Kubernetes application on our internal clusters and for our customers to use. sourcegraph/deploy-sourcegraph-dhall-archived is my PoC implementation of our Kubernetes configuration. I like what I see so far, but I'm wondering how I can improve the render times that I'm seeing.

I've posted benchmark's in sourcegraph/deploy-sourcegraph-dhall-archived's README: https://github.com/sourcegraph/deploy-sourcegraph-dhall-archived/blob/a3f32062b10e08435c6ad73085a5df8c4c30c00d/README.md#benchmarks

Going off of https://discourse.dhall-lang.org/t/figuring-out-performance-bottlenecks/251/5, I'd like to call out that my normal form is apparently ~300x the size of the Prelude.

I know that dhall-kubernetes has a lot of types, but is this size expected?


Note: I've vendored https://github.com/dhall-lang/dhall-kubernetes for sourcegraph/deploy-sourcegraph-dhall-archived following @Gabriel439 's advice here: https://functionalprogramming.slack.com/archives/C82N1L0MB/p1592340512256900?thread_ts=1592337474.256100&cid=C82N1L0MB

Gabriella439 commented 4 years ago

@ggilmore: I'm still investigating, but I'll report what I know so far and various avenues that I'm exploring.

First, make sure that you are using a newer version of dhall / dhall-to-yaml since they contain caching performance improvements.

For reference, here is the timing I get when I dhall freeze --all --inplace ./pipeline.dhall on my laptop:

$ cat pipeline.dhall 
let Sourcegraph =
      ./package.dhall sha256:e47072cfa23584530258b5c950cbb7921486ff2fe36a49e8828eb256fca31ef7

let ToList = Sourcegraph.ToList

let c = Sourcegraph.Configuration::{=}

in  ToList c
$ time dhall-to-yaml --file ./pipeline.dhall
…
real    0m10.717s
user    0m10.043s
sys 0m0.659s

For future reference, the only expression that you need to freeze for benchmarking purposes is the top-most one since no transitive expressions are resolved in the event of a cache hit.

I suspect (but have not confirmed) that most of the time spent is on decoding the cached expression, which is 50 MB:

$ du -hs ~/.cache/dhall/1220e47072cfa23584530258b5c950cbb7921486ff2fe36a49e8828eb256fca31ef7 
 50M    /Users/gabriel/.cache/dhall/1220e47072cfa23584530258b5c950cbb7921486ff2fe36a49e8828eb256fca31ef7

... which means a decoding speed of ~4 MB/s.

There are two main avenues that I'm exploring:

To answer your two initial questions:

ggilmore commented 4 years ago

Hey @Gabriel439 thanks for looking into this! A few responses:

PierreR commented 4 years ago

@ggilmore I am so happy not to be an isolated soul on this planet ;-)

I have been so confused around the topic of performance/freeze that I have opened this discourse discussion.

IIRC at the time I did realized it was pointless to freeze everything. It has been quite a surprise mainly because all the generated dhall-packages repositories that I know about (dhall-kubernetes, dhall-openshift, dhall-argocd ) seems to be sending the wrong message on that regard.

At the end of the day when you need to integrate these 3 different generated packages, you do realized the current state is not a straightforward journey (see https://github.com/EarnestResearch/dhall-packages/issues/62). While using a fork as a workaround this kind of code will be on your way.

To contribute in a positive way, I guess a generic api generator is the way to go but you might not have the time/opportunities to do so.

@Gabriel439 Thank you so much for looking at this.

Gabriella439 commented 4 years ago

I plan to document more about freezing idioms and when to use them in the Dhall manual but I need to finish fixing a few warts in the import UX that I ran into while documenting things. I can give a quick summary here, though:

gregziegan commented 4 years ago

A semi-advanced example

The following code takes ~2.5-3 seconds to evaluate with dhall --file. This snippet omits dhall-k8s imports and the actual data behind bindings like secretVolumeMounts (but I promise it's only two or three small records).

This function gets imported and used in a few other modules, massively increasing the time for those modules to evaluate.

let Container/withSecrets
    : kubernetes.Container.Type -> kubernetes.Container.Type
    = \(container : kubernetes.Container.Type) ->
            container
        //  { volumeMounts = Some
                ( merge
                    { Some =
                        \(volumeMounts : List kubernetes.VolumeMount.Type) ->
                          volumeMounts # secretVolumeMounts
                    , None = secretVolumeMounts
                    }
                    container.volumeMounts
                )
            }

let PodSpec/withSecrets
    : kubernetes.PodSpec.Type -> kubernetes.PodSpec.Type
    = \(podSpec : kubernetes.PodSpec.Type) ->
            podSpec
        //  { volumes = Some
                ( merge
                    { Some =
                        \(volumes : List kubernetes.Volume.Type) ->
                          volumes # secretVolumes
                    , None = secretVolumes
                    }
                    podSpec.volumes
                )
            , initContainers = Some
                ( merge
                    { Some =
                        \(containers : List kubernetes.Container.Type) ->
                          containers # [ secretsContainer ]
                    , None = [ secretsContainer ]
                    }
                    podSpec.initContainers
                )
            , containers =
                Prelude.List.map
                  kubernetes.Container.Type
                  kubernetes.Container.Type
                  Container/withSecrets
                  podSpec.containers
            }

let DeploymentSpec/withSecrets
    : kubernetes.DeploymentSpec.Type -> kubernetes.DeploymentSpec.Type
    = \(spec : kubernetes.DeploymentSpec.Type) ->
        spec
        with template.spec = Some
            ( merge
                { Some = PodSpec/withSecrets
                , None =
                    PodSpec/withSecrets
                      kubernetes.PodSpec::{
                      , containers = [] : List kubernetes.Container.Type
                      }
                }
                spec.template.spec
            )

let withSecrets
    : kubernetes.Deployment.Type -> kubernetes.Deployment.Type
    = \(deployment : kubernetes.Deployment.Type) ->
            deployment
        //  { spec = Some
                ( merge
                    { Some = DeploymentSpec/withSecrets
                    , None =
                        DeploymentSpec/withSecrets
                          kubernetes.DeploymentSpec::{
                          , selector = kubernetes.LabelSelector::{=}
                          , template = kubernetes.PodTemplateSpec::{
                            , metadata = kubernetes.ObjectMeta::{
                              , name = "deployment-spec-with-secrets"
                              }
                            }
                          }
                    }
                    deployment.spec
                )
            }

in { withSecrets }

We have about 4 or 5 similarly complex functions that do deeply nested appends in this module. Including those bring the module's time to evaluate up to ~5 seconds. It does drop to 3 seconds when the dhall-k8s import is frozen, though.

Running dhall --file on another module that effectively calls each of these functions once takes around 30 seconds w/o local import freezes, 22 seconds with local import freezes.

Since dhall to-directory-tree would require an explicit mapping to Prelude.JSON types, generating k8s manifests has been the one spot we're not yet using it. In the meantime, we adopted dhall-render to skip that bit of type coercion. The bash script that would call dhall-to-yaml 11 times (for 11 different configuration environments) would take several minutes to complete, but the dhall-render solution about 20 seconds. We believe this is due to these factors:

  1. using GNU parallel resulted in too many context switches for dhall processes (probably due to lock contention when decoding large and cached k8s files?)
  2. not using parallel meant each dhall-to-yaml process had to pay the full cost of decoding k8s-related code in serial

Summary

  1. import dhall-k8s and running dhall-to-yaml multiple times is very fast
  2. let a = dhall-k8s.SomeResource::{} with x = y, import a, and then running dhall-to-yaml multiple times started getting very slow. 5 seconds -> 1 minute after introducing one function like the snippet above. Add more functions to a module that works with something common like dhall-k8s.Deployment and we saw 1 minute grow to 5.
  3. If dhall to-directory-tree could be used here, we'd probably see performance like dhall-render (20 seconds).
PierreR commented 4 years ago

Here is another measure that might help:

~ → time dhall hash <<< https://raw.githubusercontent.com/PierreR/dhall-packages/master/openshift/package.dhall
sha256:5c0bd30ac3c1d0914770754fe82256720a04471f38a26bbc6b63bd7ebd9eea94

real    1m42.402s
user    1m34.308s
sys 0m1.727s

Not sure why it takes almost 2min to get this hash

Gabriella439 commented 4 years ago

I appreciate the additional examples, but they're no longer necessary. At this point I have enough information to reproduce the problem. The thing that will help move this issue forward is one of the two approaches I outlined previously:

PierreR commented 4 years ago

@Gabriel439 Thanks for your input. Just a little question. Is dhall hash also affected by the decoder performance or is it just the size of the cached expressions ? It is not quite clear (for me) why this command would decode anything.

Gabriella439 commented 4 years ago

@PierreR: Any dhall command that interprets a cached dhall expression will decode an expression from the cache. Essentially, the cache is a content-addressable store of binary-encoded Dhall expressions, so look-up from the cache entails binary decoding.

Gabriella439 commented 4 years ago

I wanted to give an update on the approach I I believe will ultimately address this problem.

I'm primarily focusing on reducing the size of cached expressions because I believe that improving decoding performance is a dead end. My reasoning is that even if we could improve decoding performance by 10x (and I don't think we can any time soon) that would still not be the right order of magnitude for where Dhall needs to be in order to build out to the Dhall analog of Helm.

I believe the most promising approach to reducing the size of cached expressions is to preserve let bindings for types in normal forms and inferred types instead of inlining them. In other words, an expression like:

let Age = Natural

in  λ(x : Age) → x + 0

... would normalize to:

let Age = Natural

in  λ(x : Age) → x

... and have an inferred type of:

let Age = Natural

in  ∀(x : Age) → Age

... instead of the current behavior, which is to normalize to:

λ(x : Natural) → x

... and have an inferred type of:

∀(x : Natural) → Natural

There are a few reasons I prefer this solution:

I'm still thinking through the implications and details of this change and this would require proposing a change to the language standard which might not necessarily be accepted. However, I believe the improved readability of normal forms will persuade other implementations to agree to the change.

If this change were completed I'm fairly confident it would fix the performance issues that people are reporting when creating both Kubernetes-related utilities and lists of Kubernetes resources. However, I would verify this performance improvement first before proposing the change.

Gabriella439 commented 4 years ago

Just another update on this. The original tack of preserving let-bound names in types proved to be more difficult to standardize than I anticipated, mainly because:

I haven't given up on the first approach entirely, but I'm now exploring a second less invasive approach.

The next approach I'm taking is to try adding a new type of integrity check (say, cache:sha256:…) which hashes the import that has been retrieved and parsed but not further interpreted. In other words, it encodes, hashes, and caches the expression exactly the way the user wrote the expression (including unresolved imports), before the expression has been transitively resolved, type-checked, and normalized. The only restriction would be that all transitive imports of the expression would need to be protected by integrity checks of their own (so that the final result of interpreting the expression is still immutable). In other words, this would be effectively storing a Merkle tree in the Dhall binary cache (just like how the Nix store works).

This would give a performance improvement for the same reason as the first approach I suggested: human-authored code tends to be small, so there won't be as much for the interpreter to decode.

Gabriella439 commented 3 years ago

I made some progress on debugging the performance problems with sourcegraph-dhall and it turns out that the issue might not be what I thought it was. I originally thought that the problem was storing β-normalized expressions in our cache, but that was only partly true. When I performed an experiment to store and cached non-normalized expressions that did lead to a speed-up (~2x faster), but there was still a larger bottleneck that the optimization revealed.

The problem appears to be that the same expression (i.e. the entirety dhall-kubernetes in this case) appears multiple times in the transitive closure and cache, even when the closure is not β-normalized. To see why, imagine that we have the following chain of imports:

-- ./largeImport.dhall

{-  Imagine that this file represents some large expression,
    like dhall-kubernetes or some other large package
-}
…
-- ./A.dhall
let largeImport = ./largeImport.dhall

in  largeImport.foo
-- ./B.dhall
let largeImport = ./largeImport.dhall

in  largeImport.bar
-- ./C.dhall
let A = ./A.dhall

let B = ./B.dhall

in  { A, B }

If we attempt to cache ./C.dhall then the largeImport expression will show up twice in the fully-resolved expression, even if we do not β-normalize anything. This duplication is the reason why we end up with abnormally large cache sizes.

If my diagnosis is correct, then the solution is to actually change how we cache things (at least for the Haskell implementation's semi-semantic cache, and possibly also for the standard semantic cache, too), so that we permit caching unresolved expressions, so long as the imports are protected by integrity checks. Moreover, we can add integrity checks on the fly to the cached representation if their transitive dependencies are "pure".

I'll use the above example to make things more concrete. Suppose that I want to cache ./C.dhall. In order to do that I'd need to first cache all of its transitive dependencies in depth-first order, so the algorithm would be:

Then when we need to resolve ./C.dhall we would fetch it from the cache and interpret it like a normal Dhall expression (including import resolution). Storing the unresolved representation for each cached file would ensure that we only store large expressions at most once within the cache, which will make things much more compact, which will in turn greatly accelerate decoding speed (because there will be far less to decode).

PierreR commented 3 years ago

@Gabriel439 is there another implementation of Dhall (Rust or Go ?) that scales better (to be used with dhall-kubernetes (dhall-to-yaml) ?

Gabriella439 commented 3 years ago

@PierreR: I'm not sure. A different implementation might get a constant factor improvement from a faster CBOR decoding logic, but other than that the problem I'm describing would affect all implementations because it's an issue with the language standard that needs to be fixed.

PierreR commented 3 years ago

@Gabriel439 is there any trick that we might apply to the cache itself ?

We have been transitioning toward something more "monorepo" like. Unfortunately a full rebuild is now taking up to 30min.

Here are the workarounds we have been applied so far:

Is there something else we could try before the issue gets resolved ?

Gabriella439 commented 3 years ago

@PierreR: Not that I know of. The main issue (which is the one I'm trying to fix) is that dhall interpreters do not permit cached expressions to contain unresolved imports, so even if you were to do cache surgery to factor things in the way I described the interpreters would reject the cached expression.

Gabriella439 commented 3 years ago

@PierreR: Sorry, I missed your other suggestions when responding. There definitely is the option of trimming dhall-kubernetes down to only the resources you need. This can be done by wrapping the dhall-kubernetes import in a file that only projects out what you need, like this:

-- ./package-small.dhall
let kubernetes =
      https://raw.githubusercontent.com/dhall-lang/dhall-kubernetes/master/package.dhall sha256:d541487f153cee9890ebe4145bae8899e91cd81e2f4a5b65b06dfc325fb1ae7e

in
  kubernetes.{ DaemonSet, Deployment, … }

… and then if you freeze and import ./package-small.dhall it should make everything across the board smaller.

However, when creating ./package-small.dhall, take care NOT to include Resource, since that's where most of the size comes from. In fact, anything you can do to eliminate the use of the Resource union will greatly improve performance since that's what's bloating the cache.

arobertn commented 3 years ago

I've been reading a number of these performance-related issues/discussions and feel like one source of problems is not being mentioned, at least explicitly. We are using dhall-kubernetes with 311K of source files (around 60), plus Prelude and k8s, which themselves amount to 11MB. When run through dhall resolve/normalize, 300MB results. We then encode this to cbor so we can load it quickly in a tool which generates code using this library. Both generating the 300MB and consuming it (dhall encode) take minutes and GB of memory. (While the cbor ends up around 30MB and loads in 15 seconds.) Looking through the 300MB file reveals what our problem is: dozens, hundreds, even thousands in some cases of repetitions of both dhall-kubernetes and our own custom definitions.

The reason is we use these types in the signatures of internal functions, and then combine their outputs to create final yaml.

Here's an example:

let exportVolumes =
      \(doExport : Bool) ->
      \(sharedVolumeMount : k8s.VolumeMount.Type) ->
        if    doExport
        then  Some [ sharedVolumeMount ]
        else  None (List k8s.VolumeMount.Type)

We use these functions nested and compositionally as we might when working with any typed functional programming language. Is this whole style of using dhall just a complete no-no given dhall's normalizing interpreter architecture? That is, we can leverage the "typed" aspect of dhall, but not its "compositional" aspect?

Gabriella439 commented 3 years ago

@arobertn: It is fixable, but it needs to be fixed at the import layer (specifically in hashing and caching expressions). My understanding is that if this didn't go through an import (i.e. if it was all hypothetically in one big Dhall file) then this wouldn't hit the inefficiencies that you're describing

What I mean by fixing things at the import layer is that we need to extend the cache format so that imports protected with an integrity check don't need to be resolved when caching an expression. In other words, if you have an expression like:

let k8s = ./aVeryLargePackage.dhall sha256:…

in  …

Then the cache could store the k8s reference by its integrity check instead of resolving the package and caching the fully resolved normal form of the k8s expression. Among other things, this would greatly reduce the size of cached expressions since the k8s package would only be serialized once and decoded once if every reference were protected by an integrity check. Currently, that is not the case because (as you noted) if you have an expression that references k8s in the normal form then it has to encode it for every occurrence in the expression (sort of; I'm glossing over details).

arobertn commented 3 years ago

@Gabriel439 Thanks for your input. I've been trying to verify your suggestion that this relates to imports by constructing a set of small files that import each other, and a single combined file, and running both through dhall or dhall resolve. But I observe no impact. I've attached the example in case you can spot that I'm missing a key aspect to trigger this.

My belief, which I'm not expert enough to state with more confidence or precision than that, is our problem comes from the fact that dhall resolution essentially inlines all definitions down to primitive types everywhere, regardless of whether internal or external to the top-level file. When composite types, especially large ones like k8s.Container and friends, are used at multiple levels in signatures that then get expanded at all use points, recursively, things get out of hand.

For instance, if I have a function (f: Bool -> k8s.Container.Type), and I call it (directly or indirectly) three times inside some function g(), rather than making the final representation consist of a definition of k8s.Container.Type, one of f referring to that, and three references to f, we get three repetitions of the full k8s.Container.Type. (You can see this happening in the example.) I don't have any compiler or parser knowledge though so don't know if this is a fundamental necessity, an architectural choice, or an implementation detail in dhall.

dhall_local_imports.tar.gz

Gabriella439 commented 3 years ago

@arobertn: Yeah, I should have clarified that my proposed change would include not only not resolving imports in cached expressions, but also not αβ-normalizing cached expressions either

arobertn commented 3 years ago

OK, I see. Anything I can do to help? Is there a branch somewhere or a design outline?

Gabriella439 commented 3 years ago

@arobertn: I just wrote up a design outline here: https://github.com/dhall-lang/dhall-lang/issues/1185

arobertn commented 3 years ago

Thanks, this makes the plan clear – but it raises questions in my mind whether it would help performance overall in our use case. The steps are:

  1. ~9000 lines of function definitions involving custom and k8s types across 60 files
  2. Compile-time (currently 6 minutes, 8GB): transform top-level functions to cbor files with `echo "(./xx.dhall).f" | dhall | dhall encode > xx_f.cbor
  3. Run-time (15 seconds, 1 GB): load xx_f.cbor into a Haskell process.
  4. Run-time (1 second, 1 GB): execute f() to generate yaml.

To me it looks like https://github.com/dhall-lang/dhall-lang/issues/1185 will greatly reduce time and space used by (2), but at the cost of the work being pushed to (3)? In other words, does the interpreter still need to construct the fully-normalized expression internally in order to execute it?

If so, then https://github.com/dhall-lang/dhall-lang/issues/1185 would not save us. (Is there some other more efficient sequence that would support (4) within the same 1-second/1GB as now? Note, we moved to this after just loading the dhall into Haskell directly became prohibitively expensive.)

Gabriella439 commented 3 years ago

@arobertn: I can provide a bit more context that is necessary to understand the performance impact of these changes:

Haskell-specific details

The https://github.com/dhall-lang/dhall-lang/issues/1185 change to the language standard would also entail a second change that is specific to the Haskell implementation

To provide some context, the Haskell implementation actually supports two caches:

The latter cache is also storing fully resolved and β-normalized cache products, so if https://github.com/dhall-lang/dhall-lang/issues/1185 were to change the behavior of the standard cache then the Haskell implementation of Dhall would also need make a matching change to the Haskell-specific cache to not resolve or β-normalize cache products (or perhaps disable that cache entirely) in order to fully realize the performance benefits.

Type-checking and normalization are "lazy"

In other words, type-checking and normalization (mostly) avoid duplicate work. To be totally pedantic, the only way to completely prevent any sort of duplicated work is to implement something like Lamping's algorithm for optimal lambda calculus reduction (The book The optimal implementation of functional programming languages covers this well). However, what Dhall and most other implementations based on normalization-by-evaluation do is a close enough approximation.

Specifically, the Haskell implementation of Dhall avoids duplication of shared terms (particularly let bindings) so that they are not inlined multiple times where they are referenced. Instead, they are stored within a type-checking or normalization context, respectively, to avoid duplicated work. So, for example, if you write something like:

let k8s = aVeryLargeRecord

in  λ(x : k8s.A) → λ(y : k8s.B) → [ k8s.Resource.A x, k8s.Resource.B y ]

… then the type-checker and normalizer will only keep around one copy of aVeryLargeRecord.

However, this "laziness" and avoidance of duplicate work does not work across import boundaries, because …

The current language standard enforces unnecessary duplication for cached imports

Whenever you import an expression the language standard requires the cache product to be stored in a fully resolved and β-normalized form. This forces the interpreter to use much more memory (because it cannot easily deduplicate such imported expressions) and also much more CPU (because it now has to interpret each occurrence of the duplicated expression instead of sharing one occurrence in the type-checking/normalization context).

For example, this unnecessary duplication actually makes the dhall-kubernetes cache product at least three times as large as it needs to because for most of the resource types they are actually stored three separate times within the cache. For example, a hypothetical resource type named A is stored in:

… whereas if the cache products could store references to imports instead of inlining everything then the type A would only be stored once, ever, within the cache and, similarly, the interpreter would only keep one copy of the decoded type A in memory when interpreting your program.

Conceptually, implementing https://github.com/dhall-lang/dhall-lang/issues/1185 means that if you use sha256:none integrity checks pervasively then cache products are always proportional to the size of the Dhall source code and, similarly, the interpreter's memory usage is (basically) proportional to the maximum of (A) the size of all input Dhall source code and (B) the size of the requested result.


So, to address your last comment: