commercialhaskell / stack

The Haskell Tool Stack
http://haskellstack.org
BSD 3-Clause "New" or "Revised" License
3.99k stars 843 forks source link

Blessed recipe how to use stack on github actions, in particular caching? #5754

Open andreasabel opened 2 years ago

andreasabel commented 2 years ago

Is there a blessed documentation how to do caching with stack builds on GitHub Actions? If not, could we have one?

In particular, I wonder how to correctly cache stack builds from one run of the CI to another. The resources I queried recommend to restore the whole stack root (.stack/). I wonder whether this would overwrite parts that shouldn't be overwritten, in particular if stack was updated upstream in between. Note that for cabal, only the subdirectory .cabal/store is cached, not all of .cabal/.

For example, is the following workflow correct?

  1. stack update
  2. stack build --dry-run, generating lock file
  3. restore .stack/ if stack version and lock file have not changed in comparison with last run of CI
  4. stack build

Context:

mpilgrem commented 2 years ago

I do not know that it is blessed, but v3 of actions/cache provides an example for Haskell using Stack here. That uses hashes of the stack.yaml and the package.yaml (it assumes the use of Hpack) in the keys. This repository, like many others, ignores stack.yaml.lock (there is a discussion here: https://github.com/commercialhaskell/stack/issues/4795).

However that example - and the CI currently used in this repository - seem to me to assume that operating system is Unix-like. The default STACK_ROOT is System.Directory.getAppUserDataDirectory stackProgName (see Stack.Config.determineStackRootAndOwnership). On Unix-like OSs, that is ~/.stack. On Windows, that is %APPDATA%/stack (usually C:\Users\<user>\AppData\Roaming\stack).

On Unix-like operating systems, Stack stores GHC and other tools in a programs directory in the STACK_ROOT. On Windows, Stack stores those tools and MSYS2 in %LOCALAPPDATA%\Programs\stack (usually C:\Users\<user>\AppData\Local\Programs\stack).

I will raise a separate issue for the lack of Windows caching in the current CI on this repository.

andreasabel commented 2 years ago

Thanks for the pointer!

v3 of actions/cache provides an example for Haskell using Stack here.

I think this example is insufficient, because it does not show how to correctly place these cache action into a bigger workflow. E.g. how do these interact with stack update?

I am particularly interested in a blessed scheme that would avoid CI breakages coming from upstream (stack, virtual environment...) as experienced in:

mpilgrem commented 2 years ago

Can you elaborate on the stack update dimension? My understanding is that stack update will change the contents of <STACK_ROOT>/pantry (even if it is only to change <STACK_ROOT>/pantry/hackage/timestamp.json). However, changing the package index has no effect on the behaviour of Stack: see this FAQ.

andreasabel commented 2 years ago

Can you elaborate on the stack update dimension? My understanding is that stack update will change the contents of <STACK_ROOT>/pantry (even if it is only to change <STACK_ROOT>/pantry/hackage/timestamp.json).

Ok, but then restoring a cached <STACK_ROOT> after stack update (like in the OP) might be slightly problematic because it undoes the effect of stack update.

However, changing the package index has no effect on the behaviour of Stack:

This cannot be meant literally, otherwise there was no purpose in running stack update.

mpilgrem commented 2 years ago

As I understand it, there is no substantive purpose in running stack update, see this FAQ. That is, if stack needs something that is not in the package index, it automatically updates the index and then tries again.

andreasabel commented 2 years ago

if stack needs something that is not in the package index, it automatically updates the index and then tries again.

Ah, this is very interesting to know. Thanks for the pointer!

ulidtko commented 1 year ago

Excellent question @andreasabel. I've done quite a few CI/CD pipelines for haskell, on CircleCI, BitBucket Pipelines, GitHub Actions... even Jenkins. I do have comments on this topic.

v3 of actions/cache provides an example for Haskell using Stack here.

I think this example is insufficient, [...]

Agreed :100: — these samples are "starters" at best.

Let's just... let me just criticize the first sample lines we're shown at the link — I can see 4 issues here right away:

- uses: actions/cache@v3
  name: Cache ~/.stack
  with:
    path: ~/.stack
    key: ${{ runner.os }}-stack-global-${{ hashFiles('stack.yaml') }}-${{ hashFiles('package.yaml') }}
    restore-keys: |
      ${{ runner.os }}-stack-global-

Issue 1: pick apart ~/.stack

No, you don't want to cache the entire ~/.stack. Somewhat famously (#133) stack doesn't even try to cleanup unused stuff from there; it doesn't even remove the .tar.xz archives of GHC post-unpacking. "Moved to wishlist" the issue says.

Caching ~/.stack can cause super-weird issues with stale config.yaml. I had that.

Caching ~/.stack/pantry should be done, but with a different cache-invalidation-key than both ~/.stack/snapshots and ~/.stack/programs. Despite it being immutable and rebuilding-on-demand — rebuild of the Pantry index takes quite a long while; so instead of burning pipeline time & CPU credits, I usually make a dedicated cache specifically for ~/.stack/pantry.

Caching ~/.stack/programs should be done (if you install GHC using Stack), but again with yet different cache invalidation key. See below.

Issue 2: hash the lockfile

hashFiles('stack.yaml') — no; this is never correct. Use hashFiles('stack.yaml.lock') instead.

Why would you want to invalidate any cache by insignificant changes in stack.yaml? Whitespace changes, comments, package list regrouping or reordering — does not invalidate any of pre-compiled artifacts. Stack will happily reuse those, if you allow it to, saving pipeline time & CPU credits. hashFiles('stack.yaml') has no place in cache invalidation key string.

Specifically for the cache of compiled dependency packages (see below) — hash of the lockfile is invalidation trigger of the correct granularity. I want it to change exactly whenever the dependency forest changes. hashFiles('stack.yaml.lock') does exactly that.

Non-insignificant modifications in project's stack.yaml (adding non-Stackage deps, switching forks, updating the resolver snapshot, etc) — will also generate changes in stack.yaml.lock. Almost always in this scenario, you want to restore a previous already-invalidated cache copy, because often, the change in dependency forest will be small; you'll get the using precompiled package for most of deps instead of full rebuild. Partial reuse of invalid/outdated cache is very much a thing — that's why actions/cache has that restore-keys option.

Issue 3: don't hash the cabal-file

hashFiles('package.yaml') — again no, absolutely not, this is completely incorrect here. ~/.stack has very little to do with your project's package.yaml (which is a cabal-file in disguise).

Say you're compiling package acme-app which depends on package text (within Stackage snapshot) and package acme-missiles (not in Stackage, but on Hackage). The acme-app's package.yaml will declare it needs these deps, perhaps with version bounds. But it's the stack.yaml (with its lockfile) that will define which specific source-code will fulfill those deps. E.g. for text, it will pick the version implied by resolver snapshot; for acme-missiles a developer will be forced to specify an extra-deps entry... in stack.yaml.

Now. Compiled modules/bins/tests of acme-app will all go under its local .stack-work. Compiled dependencies will go under the global ~/.stack/snapshots. Assuming we already properly cache ~/.stack/snapshots by the lockfile, what are we buying with hashFiles('package.yaml') ? The answer is nothing. Well, except for unnecessary full-rebuilds caused by gratuitous cache invalidations caused by factoring in the package.yaml. That file doesn't matter for validity of ~/.stack.

Issue 4: no manual override

There's a particularly gnarly type of issue in caching CI pipelines, once you start optimizing them. Cache bloat.

Variations and imperfections in the setup — e.g. caching too much, not invalidating correctly, re-caching anew no-longer necessary parts of outdated cache — will sometimes cause issues very difficult to pinpoint. There won't be related "recent changes" in git. For all you know, the pipeline worked "perfectly" just last year — but gradually, developers have become increasingly stern in their complaints of slow CI. You go check — and voila, the store-restore steps dominate pipeline duration, dwarfing actual compile... because there're tens of gigs in the caches.

Trust me, debugging these isn't impossible. But hell it is tedious. Will easily consume days of work.

On that background — I tend to always include a manual override style of control in my pipelines, on the top-level of most cache-invalidation-key structures. Examples below.

This is your "flush the cache now" button. Remember, CI runs in cloud... Some day, you or your successor will love having it.

Bonus: runner.os?

One more: I've never used runner.os as cache-invalidation-key component. (1 websearch later) Seems unnecessary; on GHA you have to opt-in to sharing caches across runners OS, they aren't shared by default.

Bonus: missing id on the step

You'll practically always have steps that "warm up" / rebuild an invalid (outdated) or missing cache. Unless these steps are idempotent and fast, you'll often want something like this on them:

        if: steps.ghcup.outputs.cache-hit != 'true'

For this to work though, the preceding actions/cache step must say id: ghcup or similar.

TL;DR: my recipe

Opinions will differ, and I'm not trying to change anyone's mind or win prizes; just sharing experience. Hopefully this is illuminating or at least helpful.

Here's my recipe of optimal Haskell CI-pipeline caching, in GH Actions snippets.

For big projects, I'll create four caches:

Smaller projects (~minute of .stack-work-only rebuild) may run well without the last one.

Near the top of workflow-yaml file, I'll set up my manual overrides, see above:

env:
  #-- increment this if you think cache of GHC installation needs cold rebuild
  MANUAL_CACHE_RESET_COMPILER: v0
  #-- increment this if you think cache of .stack-work needs cold rebuild
  MANUAL_CACHE_RESET_PRODUCTS: v0
  #-- increment this to force-rebuild the cache of dependency packages
  MANUAL_CACHE_RESET_TESTDEPS: v0
  #-- should never be needed, as stackage snapshots are immutable
  # MANUAL_CACHE_RESET_SNAPSHOT: v0

In this case, the pipeline will compile & run tests — so I'll be building with --test — hence the …_TESTDEPS.

Cache of compiler install

There're many ways to ~skin the cat~ install GHC :grin: E.g. with stack setup:

      - name: Cache GHC installation
        uses: actions/cache@v3
        id: ghc
        env:
          MANUAL_RESET: ${{ env.MANUAL_CACHE_RESET_COMPILER }}
        with:
          path: ~/.stack/programs/*/ghc-*
          key: CI-ghc-${{ env.MANUAL_RESET }}--${{ env.STACK_LTS }}

      - name: Install GHC using Stack
        if: steps.ghc.outputs.cache-hit != 'true'
        run: stack setup --install-ghc  

With ghcup:

      - name: Cache GHC installation
        uses: actions/cache@v3
        id: ghcup
        env:
          MANUAL_RESET: ${{ env.MANUAL_CACHE_RESET_COMPILER }}
        with:
          path: |
            ~/.ghcup/bin/*
            ~/.ghcup/cache/*
            ~/.ghcup/config.yaml
            ~/.ghcup/ghc/${{ env.GHC_VERSION }}
          key: CI-ghcup-${{ env.MANUAL_RESET }}--${{ env.GHC_VERSION }}

      - uses: haskell/actions/setup@v2
        if: steps.ghcup.outputs.cache-hit != 'true'
        with:
          ghc-version: ${{ env.GHC_VERSION }}
          enable-stack: true
          stack-version: "latest"

It might be difficult to specify fixed GHC_VERSION in advance — STACK_LTS may substitute it in the cache key, and is easy to grab from stack.yaml resolver.

Cache of Pantry

Pantry is perhaps the easiest:

      - name: Cache Pantry (Stackage package index)
        id: pantry
        uses: actions/cache@v3
        with:
          path: ~/.stack/pantry
          key: CI-pantry-${{ env.STACK_LTS }}

      - name: Recompute Stackage package index
        if: steps.pantry.outputs.cache-hit != 'true'
        run: stack update # populates ~/.stack/pantry

It's immutable; nothing really invalidates it but time. We'll hook into our acme-app updating its resolver tag for invalidating/rebuilding/recaching the index; that'd be the exact moment we'll want to "pull" updates there. Hence the key.

I've never happened to need a cache_reset bust on this one.

Cache of compiled dependencies

      - name: Cache Haskell dependencies
        uses: actions/cache@v3
        env:
          MANUAL_RESET: ${{ env.MANUAL_CACHE_RESET_TESTDEPS }}
        with:
          #-- NOTE no, shouldn't cache the entire ~/.stack -- that'd be bad. just these 2:
          path: |
            ~/.stack/stack.sqlite3
            ~/.stack/snapshots
          #-- NOTE the caching key structure:
          #--   * fixed ID string -- to indicate scope & purpose, descriptive;
          #--   * manual reset -- on top level, stupid simple manual override;
          #--   * resolver version -- helps maintain sleek size of the cache;
          #--   * lockfile hashsum -- as invalidation trigger of the correct granularity.
          #-- Since this cache only stores built *dependency packages* (not project code!),
          #-- we should invalidate/reupload it on each change to the dependency forest (≈lockfile).
          #--
          #-- All this decides when cache gets REBUILT (invalidated & recreated):
          key: CI-testdeps-${{ env.MANUAL_RESET }}--${{ env.STACK_LTS }}--${{ hashFiles('stack.yaml.lock') }}
          #-- All this adds fallbacks to UNPACK stale cache copies, prefix-matched:
          restore-keys: |
            CI-testdeps-${{ env.MANUAL_RESET }}--${{ env.STACK_LTS }}--
            CI-testdeps-${{ env.MANUAL_RESET }}--

The "warming up" step for this one is conceptually stack build ... --only-dependencies conditional on the cache-hit — but thanks to correct cache-invalidation-key plus Stack's consistency with reproducible builds, there's no need to have that explicitly. It works well as is, across years of project/Stackage/GHC upgrades.

Cache of local modules

      - name: Cache per-branch Haskell project buildstate
        uses: actions/cache@v3
        env:
          MANUAL_RESET: ${{ env.MANUAL_CACHE_RESET_PRODUCTS }}
        with:
          path: .stack-work
          key: CI-builddir-${{ env.MANUAL_RESET }}--${{ env.GHC_VERSION }}

As mentioned above, this is optional. For smallish packages (~tens of modules) may not give any benefit, once the dependencies & compiler have been handled properly.

I didn't find a spot-on computation to nail the cache-invalidation-key for this one. The path structures under .stack-work won't let GHC reuse .hi's written by other versions of itself; thus at the very least GHC_VERSION should factor in to the key, to avoid gradual cache bloat over your project going through GHC version upgrades. Whether that's also the "upper bound" (and therefore the exact solution) — I don't know yet.


It works very well in practice though. HTH

andreasabel commented 1 year ago

Thanks for this detailed description, @ulidtko ! I am trying to put this into practice now.

Let me raise some doubts about the key for the dependencies (snapshots):

key: CI-testdeps-${{ env.MANUAL_RESET }}--${{ env.STACK_LTS }}--${{ hashFiles('stack.yaml.lock') }}

This key only accounts for change in the resolver and other changes in stack.yaml (e.g. added extra-deps). However, if my code requires a new dependency (added in the .cabal or package.yaml file), this will not be reflected in a change of the stack.yaml.lock file. The latter only adds SHAs to the resolver and extra-deps, but does not specify the build plan. Consequently, the new dependency would be built but not staved to $STACK_ROOT/snapshots because the key has not changed and no new cache is saved. This means the cache rots, "accumulating" missing packages. This ultimately can degrade the build times, as dependencies will always have to be rebuilt.

I think this key should have another component in the end that hashes the build plan. I found that the output of stack build --test --dry-run contains the build plan listing all the dependencies (and their version, but the version is anyway fixed by the resolver and extra-deps). However, it is not complete either. E.g. if I specify a flag for a dependency in the stack.yaml file, it is neither represented in the output of dry-run nor in stack-yaml.lock.
So maybe taking the .cabal file and the stack.yaml file into key, contrary to your advise, it at least sound in the sense that different plans will have different keys. It might not be perfect, e.g. if someone adds a comment to the .cabal file, the key changes while the plan stays the same. But this is maybe not happening frequently, and the harm is little (a redundant cache save).

ulidtko commented 1 year ago

Hey @andreasabel, glad to get feedback. Yup, I see, good point!

Also I definitely remember seeing this happening in practice too. Good to finally realize why :sweat_smile:

Appending cabal-file hash to the key of deps-cache indeed is a way to "solve" this (feels conceptually wrong, but will work), and not without its own drawbacks... Same for the build-plan from stack build --test --dry-run — it appears to be stateful, producing output which depends on what's already in ~/.stack/snapshots.

I simply didn't find anything better than stack's lockfile as "the perfect" value to hook cache invalidation onto. Perhaps an SQL query against stack.sqlite3 can provide that?..

Minutia like this is the ultimate reason I always have that MANUAL_RESET field. In the absence of a perfect-cut path, the caching policy must necessarily be either too optimistic, or too pessimistic. Too optimistic will exhibit rotting, but yield faster builds more often. Too pessimistic will waste CI time, but cache will be correct and "maintenance-free". In this tradeoff, MANUAL_RESET allows to lean on "too optimistic" side, simultaneously reaping the "fast more often" benefit¹ — yet reducing the rotting aspect to a single-line "bump a number" commit push once every some months (if it gets bad enough in-between resolver updates).

¹ rebuilding just the handful missing deps is still much faster than full recompile of the typical ~hundreds of deps