NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
18.39k stars 14.34k forks source link

GHAs sometimes fail to fetch the repo #360271

Open Atemu opened 2 days ago

Atemu commented 2 days ago

Issue description

I've regularly observed PRs where a bunch of GHA checks fail with:

Fetching the repository
  /usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +refs/pull/356919/merge:refs/remotes/pull/356919/merge
  Error: fatal: couldn't find remote ref refs/pull/356919/merge
  The process '/usr/bin/git' failed with exit code 128

Fetching the PR should basically never fail unless GH is having a moment and in that case we shouldn't error out but rather retry (ideally using exponential back-off).

Our current fetch step actually does retry but it only attempts twice it and only waits like 12+-2 seconds each time it seems.

Timely CI completion is a lot less critical than not sending the PR author a bunch of confusing CI failure notifications IMHO, so we should retry for a lot longer that that.
I think a timeout of 30min would be appropriate because, after that long, an error that is more critical than a temporary hiccup is likely to have occured and it's okay to fail loudly. It's also not an unreasonable amount of time in Nixpkgs PR lifecycle time scales IMHO; you wouldn't expect anything noteworthy to have happened 30min after opening a PR. (You would expect the basic checks to have completed under normal circumstances of course but actually interesting interactions (ofBorg, reviews) are likely to take much longer.)

cc @infinisil

infinisil commented 2 days ago

Agreed this is a problem. I recently introduced the get-merge-commit.sh, which fixes this problem and is already used for some workflows. It still needs to get applied to them all though (or rather, the ones using pull_request_target and refs/pull/.../merge)