NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
16.46k stars 12.95k forks source link

github-runner: workDir doesn't work #289422

Open colemickens opened 4 months ago

colemickens commented 4 months ago

Describe the bug

I'm not sure if I'm using it wrong, but it doesn't really seem like workDir is ... working.

I have to pre-create the directories, and then when the service runs, it doesn't have permissions:

----------------------------------------------
Feb 13 11:18:37 raisin k60s1knv2a6qgraaas8vrqdp4dhvlwpi-github-runner-raisin-default-configure.sh[897175]: # Authentication
Feb 13 11:18:40 raisin k60s1knv2a6qgraaas8vrqdp4dhvlwpi-github-runner-raisin-default-configure.sh[897175]: √ Connected to GitHub
Feb 13 11:18:40 raisin k60s1knv2a6qgraaas8vrqdp4dhvlwpi-github-runner-raisin-default-configure.sh[897175]: # Runner Registration
Feb 13 11:18:40 raisin k60s1knv2a6qgraaas8vrqdp4dhvlwpi-github-runner-raisin-default-configure.sh[897175]: A runner exists with the same name
Feb 13 11:18:40 raisin k60s1knv2a6qgraaas8vrqdp4dhvlwpi-github-runner-raisin-default-configure.sh[897175]: √ Successfully replaced the runner
Feb 13 11:18:42 raisin k60s1knv2a6qgraaas8vrqdp4dhvlwpi-github-runner-raisin-default-configure.sh[897175]: √ Runner connection is good
Feb 13 11:18:42 raisin k60s1knv2a6qgraaas8vrqdp4dhvlwpi-github-runner-raisin-default-configure.sh[897175]: # Runner settings
Feb 13 11:18:42 raisin k60s1knv2a6qgraaas8vrqdp4dhvlwpi-github-runner-raisin-default-configure.sh[897175]: √ Settings Saved.
Feb 13 11:18:42 raisin w35vw1gpw7bafgfzhcf6wacp8vx89ypb-github-runner-raisin-default-setup-work-dirs.sh[897209]: ln: failed to create symbolic link '/var/lib/github-runners/raisin-default/_diag': Permission denied
Feb 13 11:18:42 raisin systemd[1]: github-runner-raisin-default.service: Control process exited, code=exited, status=1/FAILURE
Feb 13 11:18:42 raisin systemd[1]: github-runner-raisin-default.service: Failed with result 'exit-code'.
Feb 13 11:18:42 raisin systemd[1]: Failed to start GitHub Actions runner.
Feb 13 11:18:42 raisin systemd[1]: github-runner-raisin-default.service: Consumed 1.482s CPU time, received 333.9K IP traffic, sent 34.8K IP traffic.

This is on nixos-unstable after the recent github-runner cleanup PR was merged. I was using my custom hacked thing, so I don't know if this is a regression or not.

I'm also not sure how to manually fix this given the usage of DynamicUser. Maybe some StartExecPre or tmpfiles.d magic that is missing that should be ensuring the dir is there?

see: https://github.com/NixOS/nixpkgs/pull/284814

cc: @veehaitch

colemickens commented 4 months ago

This module is really not easy for me to follow. I wish this were parameterized systemd units and used systemd state dirs more normally. I really don't think I can fix this without resurrecting my re-write.

I really hope someone can look. The way workDir is setup, linked to stateDir, tied to the bind mounts, and the ordering of stuff makes this non-trivial. But also, as far as I can tell workDir is basically completely broken and unusable.

Given that, and the default behavior to put the state in temporary storage, means that using the github-runner for nixpkgs-related things is indeed quite painful. Every single time the service or computer restarts, I have to very slowly reclone nixpkgs.

veehaitch commented 4 months ago

Thanks for reporting this! I don't agree that "workDir is basically completely broken and unusable"; it works fine in our runner configurations. However, I understand that you may have a different idea of what the workDir option should be able to do for you. Please also note that the "service will clean this directory on each service start".

Could you please describe in more detail what you are trying to do and I'll try to help? I'm also happy to review a PR if you come up with a solution to your problem. We're always looking to improve the module 🙂

colemickens commented 4 months ago

it works fine in our runner configurations

Ah! the feeling of both shame and hope! Thanks for the kind reply, hopefully my words weren't too pointed, I certainly appreciate the module!

Please also note that the "service will clean this directory on each service start".

That might be why I ended up radically reworking the module before dropping it after the recent refactor. :/ Sigh.

Could you please describe in more detail what you are trying to do and I'll try to help? I'm also happy to review a PR if you come up with a solution to your problem. We're always looking to improve the module 🙂

I did take a crack at it over the weekend, but I do fear it requires more attention, and then that will require more attention to integrate a non-hacky rewrite with what we have now. Of course I don't expect anyone to do this for me, but I don't know when I'll be able to engage with this. And it also sounds like I might be holding it wrong.


My apologies for also not making my use-case more clear to guide the conversation --

basically I want the runner's job's workspace directory to persist between runs. Normally I would be quite opposed to this, but re-cloning nixpkgs on each run is an inefficiency and slowness I can't justify.

As far as I know, this is the naive default for self-hosted runners. While again, I like that NixOS tends towards clean-slate, idempotency, I really am seeking a persistent dir.

The other issue is -- my remote CI builder is very memory starved and losing 4.5GB to git checkouts on tmpfs (/run) makes everything that much harder.

veehaitch commented 4 months ago

I can very much relate to the memory issue stemming from the tmpfs. That's how we solve that:

services.github-runners."a-runner" = {
    # ...
    # Use an additional `StateDirectory=` as `workDir`
    serviceOverrides.StateDirectory = [
      "github-runner/a-runner" # module default
      "github-runner-work/a-runner"
    ];
    workDir = "/var/lib/github-runner-work/a-runner";
};
colemickens commented 4 months ago

Interesting, that sort of makes sense. You leverage the machinery that pre-creates the state dir, it must be sequenced differently wrt to binds, and then your work dir just lands inside of it. This gives me some ideas, thank you @veehaitch.

(My gut first impression is that the module should maybe do something like that for you if you have workdir set? (handwaving). Because I think if you change workDir to not be inside stateDirectory, you'll find yourself hitting similar errors as me.

colemickens commented 4 months ago

I'm still having issues, with the module as it appears in nixos-unstable:

    services = {
      github-runners = {
        "${runnerName}" = {
          enable = true;
          url = "https://github.com/colemickens/nixcfg";
          tokenFile = config.sops.secrets."github-runner-token".path;
          replace = true;
          name = runnerName;
          serviceOverrides.StateDirectory = [
            "github-runner/${runnerName}" # module default
          ];
          workDir = "/var/lib/github-runner/${runnerName}"; # TODO: make sure this works
          extraLabels = [ runnerName ];
        };
      };

results in:

Feb 21 18:29:23 slynux systemd[1]: Starting GitHub Actions runner...
Feb 21 18:29:23 slynux z5gpr3smv6jfmphp4b5x3y679scqjhfy-github-runner-slynux-default-unconfigure.sh[82337]: Config has changed, removing old runner state.
Feb 21 18:29:23 slynux z5gpr3smv6jfmphp4b5x3y679scqjhfy-github-runner-slynux-default-unconfigure.sh[82337]: The old runner will still appear in the GitHub Actions UI. You have to remove it manually.
Feb 21 18:29:23 slynux (igure.sh)[82358]: github-runner-slynux-default.service: Failed to set up mount namespacing: /var/lib/private/github-runner/slynux-default/.current-token: No such file or directory
Feb 21 18:29:23 slynux systemd[1]: github-runner-slynux-default.service: Control process exited, code=exited, status=226/NAMESPACE
Feb 21 18:29:23 slynux systemd[1]: github-runner-slynux-default.service: Failed with result 'exit-code'.
Feb 21 18:29:23 slynux systemd[1]: Failed to start GitHub Actions runner.

if I make the inaccessible token path optional in the module, then I get:

Feb 21 18:26:28 slynux systemd[1]: Starting GitHub Actions runner...
Feb 21 18:26:28 slynux z5gpr3smv6jfmphp4b5x3y679scqjhfy-github-runner-slynux-default-unconfigure.sh[79165]: Config has changed, removing old runner state.
Feb 21 18:26:28 slynux z5gpr3smv6jfmphp4b5x3y679scqjhfy-github-runner-slynux-default-unconfigure.sh[79165]: The old runner will still appear in the GitHub Actions UI. You have to remove it manually.
Feb 21 18:26:29 slynux systemd[1]: Started GitHub Actions runner.
Feb 21 18:26:29 slynux Runner.Listener[79257]: Unhandled exception. System.IO.IOException: Too many levels of symbolic links : '/var/lib/github-runner/slynux-default/.credentials'
Feb 21 18:26:29 slynux Runner.Listener[79257]:    at Interop.ThrowExceptionForIoErrno(ErrorInfo errorInfo, String path, Boolean isDirectory, Func`2 errorRewriter)
Feb 21 18:26:29 slynux Runner.Listener[79257]:    at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String path, OpenFlags flags, Int32 mode)
Feb 21 18:26:29 slynux Runner.Listener[79257]:    at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize)
Feb 21 18:26:29 slynux Runner.Listener[79257]:    at System.IO.Strategies.OSFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize)
Feb 21 18:26:29 slynux Runner.Listener[79257]:    at System.IO.Strategies.FileStreamHelpers.ChooseStrategy(FileStream fileStream, String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options, Int64 preallocationSize)
Feb 21 18:26:29 slynux Runner.Listener[79257]:    at System.IO.StreamReader.ValidateArgsAndOpenPath(String path, Encoding encoding, Int32 bufferSize)
Feb 21 18:26:29 slynux Runner.Listener[79257]:    at System.IO.File.InternalReadAllText(String path, Encoding encoding)
Feb 21 18:26:29 slynux Runner.Listener[79257]:    at System.IO.File.ReadAllText(String path, Encoding encoding)
Feb 21 18:26:29 slynux Runner.Listener[79257]:    at GitHub.Runner.Sdk.IOUtil.LoadObject[T](String path, Boolean required) in /build/src/src/Runner.Sdk/Util/IOUtil.cs:line 47
Feb 21 18:26:29 slynux Runner.Listener[79257]:    at GitHub.Runner.Common.HostContext..ctor(String hostType, String logFile) in /build/src/src/Runner.Common/HostContext.cs:line 216
Feb 21 18:26:29 slynux Runner.Listener[79257]:    at GitHub.Runner.Listener.Program.Main(String[] args) in /build/src/src/Runner.Listener/Program.cs:line 20
Feb 21 18:26:29 slynux systemd[1]: github-runner-slynux-default.service: Main process exited, code=dumped, status=6/ABRT
Feb 21 18:26:29 slynux systemd[1]: github-runner-slynux-default.service: Failed with result 'core-dump'.
nixos-discourse commented 3 months ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/github-runner-seeking-advice/41719/1

colemickens commented 3 months ago

@veehaitch any other thoughts?

at this point I'm looking at abandoning GHA or writing my own simpler out-of-tree module for github-runners.

colemickens commented 3 months ago

I have spent an insane amount of time on this and absolutely nothing has worked.

I'm not convinced this actually works.

colemickens commented 3 months ago

Like, I plainly don't see how your example can possibly work given the symlinking that is done in setupWorkDir. The symlinks aren't right and the runner noticeably complains.

colemickens commented 3 months ago
lrwxrwxrwx 1 61239 61239 50 Mar 27 15:39 /var/lib/github-runner/slynux-default/.credentials -> /var/lib/github-runner/slynux-default/.credentials
colemickens commented 2 months ago

If anyone else ends up here, I've hacked together something that works for me: https://github.com/colemickens/nixos-github-actions/

colemickens commented 1 week ago

Golly I wish this module worked:

Failed to set up mount namespacing: /var/lib/private/github-runner/rock5b-default/.current-token: No such file or directory
NeverBehave commented 5 days ago

leaving a quick note for anyone bump into this problem under nixos-24.05: checkout@v3 seems have trouble with directory but upgrading to v4 seems to be fixing the problem.