Refactor CI actions - Githubissues

cmcmA20 commented 1 year ago

split the job into three phases: cabal/ghc, agda, library
cache all the intermediate binaries
bump ghc and cabal version

Old action failing to populate caches on unsuccessful library tests was quite infuriating. Example usage: https://github.com/cmcmA20/cubical-mini/actions/runs/4527022375 Documentation deploy doesn't work as intended but I'm not sure if I haven't botched setting it up in my repo.

@andreasabel FYI

andreasabel commented 1 year ago

@cmcmA20 : Thanks for the PR (with the lucky number 1000!).

I looked at the code and it seems like you want to cache 3 things:

GHC and Cabal installation and Cabal packages.
Agda binary and dependencies of Agda.
The cubical library build.

1+2 share the same set of paths (~/.ghcup and ~/.cabal), 3 uses a different path (_build). Since 3 was already separate from 1+2, I only comment on the first 2 caches.

The key structure in terms of program versions you suggest is (leaving out os because it is fixed to Linux):

ghc-cabal
ghc-cabal-agda

Why do you separate steps 1 + 2? You want to save the GHC installation in terms Agda fails to build? Shouldn't you assume that all your external dependencies (GHC/Cabal/Agda) build fine and you only worry about building the cubical library in the CI?

Note that the cache use of your new workflow is huge: short of 2.5 GB. With the total cache limit of 10GB, you could only accommodate master and 3 PRs. Anything beyond this would bump caches and make new commits to PRs much slower. https://github.com/cmcmA20/cubical-mini/actions/caches

One improvement thus would be to fuse 1+2 into a single cache saving action.
Whether caching the GHC installation and the cabal package index makes sense is a bit of a question per se. Restoring the cache also amounts to downloading hundreds of MB and unpacking them. Ideally installing GHC with ghcup should not do much more than that. And downloading the cabal package index with cabal update shouldn't do much more than that. And these are fetched from external locations where you do not have to pay with cache space.
But caching GHC/Cabal was already part of the original CI. Was this working correctly or did your changes (like also caching ~/.ghcup) fix it?

Your original problem seemed to have been this one:

Old action failing to populate caches on unsuccessful library tests was quite infuriating.

I agree this problem needs addressing.
Maybe it can be fixed more conservatively by just adding a actions/cache/save step for the Agda installation before trying to build the cubical library?

My suggestion would be to make a new PR with minimal changes to the current CI that achieve your goal (caching Agda before trying building the cubical library) and see if this already does the job (the bumps to GHC and Cabal versions should also be included, of course).

cmcmA20 commented 1 year ago

Shouldn't you assume that all your external dependencies (GHC/Cabal/Agda) build fine and you only worry about building the cubical library in the CI?

You're absolutely right! Regarding cache size, I was not aware of such a strict quota (10GB), my fault.

But caching GHC/Cabal was already part of the original CI. Was this working correctly or did your changes (like also caching ~/.ghcup) fix it?

If you set a different cabal/ghc version (e.g. 3.6.2.0 + 9.2.5), you can observe non-idempotence of current workflow, it works correctly only if there was a cache miss. ghcup is the culprit, it fails to find its' config (because it's not cached) and defaults to a wrong ghc version. Seems like it's using cabal version to automatically infer ghc version.

My suggestion would be to make a new PR with minimal changes to the current CI that achieve your goal (caching Agda before trying building the cubical library) and see if this already does the job (the bumps to GHC and Cabal versions should also be included, of course).

Thank you for guidance, Andreas, will ping you when it's done!

andreasabel commented 1 year ago

If you set a different cabal/ghc version (e.g. 3.6.2.0 + 9.2.5), you can observe non-idempotence of current workflow, it works correctly only if there was a cache miss. ghcup is the culprit, it fails to find its' config (because it's not cached) and defaults to a wrong ghc version. Seems like it's using cabal version to automatically infer ghc version.

I was suspecting something like that. Thanks for fixing this!

cmcmA20 commented 1 year ago

@andreasabel : I've managed to shrink the cache, it's 1.2GB now. Notice that the actual library cache sizes are ~120MB (for agda/cubical) and base branch caches are shared with child branches, so you can have (10240 - 1218) MB / 120 MB = ~75 different branches cached (including master). Cache usage is optimal now and the action is idempotent for any cache state. Please let me know whether GHPages deploys correctly.

GHCup config was not An Impostor. runhaskell is used to generate Everything.agda files so we actually need to have it in the cache. Binary is linked dynamically so all the deps are in for a ride too, and there are 37 of them. Singling them out for caching could save about 350 MB (+3 PRs) and bloat the action config file. Guess I'll save the candle for another game.

cmcmA20 commented 1 year ago

Thank you! One last fix though: cache -> cache/restore in library section. I see it works fine but better be safe than sorry. Re-running the action should take 2 minutes.

could save about 350 MB

I was wrong, you can potentially trim the external deps cache down to ~300MB but the config would still become a mess.

andreasabel commented 1 year ago

About the further procedure: @ecavallo @felixwellen @mortberg : Who is in charge of CI here? If you feel in charge, please assign yourself to this PR and handle the merge. I'd say this PR is ready, but I don't want to overstep the owners... @cmcmA20 : If you do not think this PR is ready, please communicate this (e.g. by stating it here or converting the PR to a draft).

cmcmA20 commented 1 year ago

@andreasabel I think this is only marginally better than a previous one, so let's wait for maintainers to say their word about possible directions regarding CI architecture. Until then, I'll experiment with Wen's setup/agda.

jpoiret commented 1 year ago

Hi, I was also independently working on a similar fix on my end, so let me share my thoughts:

There's no reason we should cache GHC and Cabal, they can be installed quite fast using the haskell setup action. However, they need to always be set-up, and not just on cache miss as it's being done right now, since they aren't in the cache either. This would reduce cache use even more as well.
Given cubical's development model, I'm not sure the AGDA_BRANCH ref moving around should be an immediate problem (and it's not a regression with current behavior so it shouldn't be a blocker). But, if cubical wants to test against master at some point, the agda-setup action won't help either (from what I understand). This could be done in a future PR.

Regarding the documentation not building, you'd need to set "Read and write permissions" for workflows under Settings → Actions. One other solution that would work out of the box would be to add the write permission to the action in the yaml file, but I'm not a fan of this either since it grants the rights to the whole job. Some refactoring into multiple jobs could probably be done as well, or use the new Beta for deploying pages directly via actions.

My commits were a bit more conservative, see https://github.com/agda/cubical/compare/master...jpoiret:cubical:ci-clean-up The goal was to just fix the failing CI builds for now, the rest could come later.

cmcmA20 commented 1 year ago

@jpoiret: Hi! Big thanks for the doc tip.

There's no reason we should cache GHC and Cabal, they can be installed quite fast

You'r right, tests show that the difference is negligible so it's a total waste of cache space. However, the CI flow can be decomposed into these jobs: (ghc and cabal) <+> (Agda <|> fix-whitespace) <+> test library <+> docs. And then the third stage nontrivially depending on the binaries from the first one really bites. What would be a good way to handle this?

But, if cubical wants to test against master at some point, the agda-setup action won't help either

I hadn't much time to test this but skimming through the README gave me the same impression. Anyway I've just implemented this feature.

andreasabel commented 1 year ago

If you are thinking about more revolutionary changes of this CI, please also consider using binary installs of fix-whitespace (rather than building from source). Here is an action I wrote to this end: https://github.com/andreasabel/fix-whitespace-action

andreasabel commented 1 year ago

@cmcmA20 wrote:

Documentation deploy doesn't work as intended

According to my own experiments, deployment only works if the CI is run on master, because one usually does not want update documentation while WIP.

NB: Yesterday I set up a web page deployment using the new fashion (in beta stage), using actions/upload-pages-artifact@v1 and actions/deploy-pages@v2: https://github.com/teach-plt/www/blob/79b4094c422fd8d89c0bcdb83304d55947cfdf46/.github/workflows/deploy.yml#L75-L105 This requires to change Settings/Pages to "using Github Actions" in the repo, though.

andreasabel commented 1 year ago

@cmcmA20

@andreasabel I think this is only marginally better than a previous one, so let's wait for maintainers to say their word about possible directions regarding CI architecture.

I thought the cache improvement (status as of 3 days ago) would be an incremental improvement worth merging, until a more thorough overhaul is in place. (Atm, though, the workflow is failing on octokit/request-action.)

cmcmA20 commented 1 year ago

@andreasabel

workflow is failing

It's exactly what is says on the tin, "v.2.6.3" is not a valid branch name, though it is a valid tag name, sorry for overlooking that, I've pushed a fix. This raises a question: do we want to leave it as it is (the original action config was actually misleading about the nature of these things) or do we want to support both branches and tags?

For now I consider this change ready for merge but your fix-whitspace and doc deploy actions would be a great addition, so all the CI tweaks are grouped under the same PR. Sorry for dragging this, I was under the impression that doing more than was originally planned is fine. If you have specific preferences about PR decomposition, please let me know, so we can split it into a few fine-grained ones.

andreasabel commented 1 year ago

It's exactly what is says on the tin, "v.2.6.3" is not a valid branch name, though it is a valid tag name, sorry for overlooking that, I've pushed a fix. This raises a question: do we want to leave it as it is (the original action config was actually misleading about the nature of these things) or do we want to support both branches and tags?

I think so. Tags are fixed targets, branches are movable targets. There is no guarantee that release-2.6.3 will forever match v2.6.3. In fact it even does not so atm. (And they might deviate further in the future if we push to release-2.6.3 e.g. to prepare 2.6.3.1, who knows?)

For now I consider this change ready for merge but your fix-whitspace and doc deploy actions would be a great addition, so all the CI tweaks are grouped under the same PR. Sorry for dragging this, I was under the impression that doing more than was originally planned is fine. If you have specific preferences about PR decomposition, please let me know, so we can split it into a few fine-grained ones.

Well, in terms of PR strategy, I think one should separate concerns.

The first issue was about improving caching. A fixed point was reached when caching the Agda build before trying to compile the cubical library was working. At this point there is a meaningful commit which achieves one thing, which can be merged.
The next thing is about resolving AGDA_BRANCH to a SHA rather than using it verbally. You started to work on this (which is great), but now we are again in the middle of something and cannot merge. Also, this is somewhat orthogonal to the cache improvement, it is more about getting the right version of Agda, which is another improvement.
Then, there will be an improvement of fix-whitespace etc.

All these are separate concerns and should go into separate commits, so then in case on of these points actually wasn't a strict improvement, it can be reverted easily without affecting the others.

If you agree, then let's work like this in small steps, each of which solves one topic and can be reviewed easily. (Sorry, I didn't communicate this clearly before.)

cmcmA20 commented 1 year ago

Thank you for the explanation! I've reverted the changes to the first stage where we got the caching right, so we can already reap the benefits. Everything else will go into a new folow-up PR.

Current state of affairs:

Job remains unsplit
Job behaviour is correct for any cache state: empty, only Agda+fix-whitespace, only library, both
AGDA_BRANCH doesn't really track the branch
Agda+fix-whitespace cache size is unchanged, (~300MB)
Cabal version bumped to 3.6.2.0
GHC version bumped to 9.2.5

Also rebased on the current master

andreasabel commented 1 year ago

Thanks for this journey, where we all learned new things.

agda / cubical

Refactor CI actions #1000