Open antifuchs opened 1 year ago
Updating to mention that this also fails with the latest released version of flyctl, 0.1.12: https://github.com/antifuchs/nixpkgs-bug-repro-flyctl/commit/a3eb32abc5d6fb8c04c2d08c5abe7e09d8f5e581 - invoking the #latest
flake app also hangs indefinitely, without output. My next suspicion would be that it's maybe related to the version of the golang compiler used to build it (in combination with resource limits in place, maybe?), but I haven't been able to replicate this at all using a systemd-run invocation with the given service options in place.
Since plain systemd-run works, that means it's likely not the environment itself / app but some specific unit options that break it. Have you tried commenting out every single protect setting / restriction from your runner service and reenabling one by one?
Also, a basic question, but have you got anything relevant in the journal?
Nothing relevant in the journal for github-runner, it's generally super quiet. I'll try to binary-search the settings that cause the hangup now.
OK, so that is interesting. I have disabled all the restrictions on the github-runner unit, and it doesn't affect the behavior at all: The process still crashes. Since I'm running github-runner in an fhs user env, next I'm going to test whether flyctl survives being launched in one (which should also allow for much faster iteration times, if it fails there).
OMG so now I've got this reproduced quite a bit and it's a bit more insidious, even. To trigger this, you need to:
nixpkgs-22.11
branch works correctly!I added a test program to the repro repo, under https://github.com/antifuchs/nixpkgs-bug-repro-flyctl/tree/main/testapp - when I run the testapp-cgo
attribute (built to ensure cgo builds a dynamically-linked binary), it hangs exactly like flyctl does (which also is a cgo-built dynamically linked binary).
That binary correctly runs under the github-runner when built with pkgs-stable.buildGoModule
(or buildGo119Module
); it fails to start up within 1 minute built with pkgs-unstable.buildGo*Module
.
So that's great(ℹ️) news, I'll work on getting github-runner out of the way of figuring this out now.
So I'm curious if github-runner does anything with sandboxing / syscall filtering / resource limiting...
An idea for further debugging (sorry, don't have time to dig into it myself): instead of running fly itself, run a script which dumps the output of export
, ulimit -a
, and maybe readlink /proc/$$/task/*/ns/* | sort -u
Then try to run fly directly (without the github-runner) in a similar environment - same variables, same limits... the namespaces may be a bit harder to reproduce, but at least you'll know if cgroups limit you in some way.
Your suggestion brought an idea up - since the github runner is running in an FHS userenv, which includes a chroot-like env, and the program is loading dynamic libraries, might having mismatched libraries cause issues? Concretely, this is an FHS userenv built by nixpkgs release-22.11, used by a dynamically-linked program built with a nixpkgs-unstable toolchain. I believe the versions match up, but that seems like a factor I hadn't tested yet.
Update: Nope, I had tried that, and it didn't trigger the bug. Back to the drawing board!
Some updates: I used nsenter
to run the programs in the github-runner's namespaces (validated with the diagnostics tool that they're all the same), and could not repro the hang. Next up would be the process environment and mayyyyybeee the file descriptor configuration.
Describe the bug
I'm seeing the
flyctl
version that lives in nixpkgs-unstable (1.0.8) hang on startup in a really early FUTEX_WAIT_PRIVATE call - but only when invoked under a private github-runner installation. The version offlyctl
that is in nixpkgs-22.11 works fine. I can run both versions of flyctl in a terminal on the machine where it hangs.This is annoying to reproduce because it needs a self-hosted github runner, but here goes:
Steps To Reproduce
I've made a test repo https://github.com/antifuchs/nixpkgs-bug-repro-flyctl that pins the versions of nixpkgs (and thus flyctl) that I'm reporting. I have a little github workflow here that invokes it:
Steps to reproduce the behavior:
version
as the commandline arg).stable
succeeds immediately and thatunstable
gets cancelled after 1min, the workflow timeout.Expected behavior
I would expect both versions of flyctl to succeed and run.
Additional context
I straced the flyctl startup (with a different commandline), and it looks like it's hanging at the very beginning of startup. Here's the redacted strace output:
Lots of strace output
```console set_robust_list(0x7f370b4c8a20, 24) = 0 getpid() = 175529 close(255) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigaction(SIGTSTP, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f370b508bf0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f370b508bf0}, 8) = 0 rt_sigaction(SIGTTIN, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f370b508bf0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f370b508bf0}, 8) = 0 rt_sigaction(SIGTTOU, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f370b508bf0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f370b508bf0}, 8) = 0 rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f370b508bf0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f370b508bf0}, 8) = 0 rt_sigaction(SIGQUIT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f370b508bf0}, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f370b508bf0}, 8) = 0 rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f370b508bf0}, {sa_handler=0x44a480, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f370b508bf0}, 8) = 0 openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3> dup2(3>, 0The github-runner systemd service is running with pretty strict restrictions; here's that list:
Notify maintainers
@aaronjanse @jsierles @techknowlogick @viraptor
Metadata
Please run
nix-shell -p nix-info --run "nix-info -m"
and paste the result.