cross-platform-actions / action

Cross-platform GitHub action
MIT License
128 stars 19 forks source link

NetBSD - VM doesn't start after a 120s timeout #62

Closed kobalicek closed 11 months ago

kobalicek commented 11 months ago

I'm having the following occasional issue when running NetBSD runner:

  Pseudo-terminal will not be allocated because stdin is not a terminal.
  ssh: connect to host localhost port 2847: Connection refused
  Waiting for VM to be ready...
  Executing command inside VM: true
  /usr/bin/ssh -t runner@localhost
  Pseudo-terminal will not be allocated because stdin is not a terminal.
  ssh: connect to host localhost port 2847: Connection refused
  Waiting for VM to be ready...
  Executing command inside VM: true
  /usr/bin/ssh -t runner@localhost
  Pseudo-terminal will not be allocated because stdin is not a terminal.
  ssh: connect to host localhost port 2847: Connection refused
  Waiting for VM to be ready...
  Executing command inside VM: true
  /usr/bin/ssh -t runner@localhost
  Pseudo-terminal will not be allocated because stdin is not a terminal.
  ssh: connect to host localhost port 2847: Connection refused
  Waiting for VM to be ready...
  Executing command inside VM: true
  /usr/bin/ssh -t runner@localhost
  Pseudo-terminal will not be allocated because stdin is not a terminal.
  ssh: connect to host localhost port 2847: Connection refused
  Terminating VM
  /usr/bin/sudo kill -s TERM 1370
  kill: 1370: No such process
Error: Waiting for VM to become ready timed out after 120 seconds

I'm using QEMU to run it.

Basically the VM is not ready after 120 seconds, which causes the action to be terminated.

I'm not sure what is the problem in this case - if the GHA runner is simply overloaded or whether there is a race or something caused by the action itself, which results in inability to connect to the SSH server inside the VM.

I'm wondering - is this something we have to live with or do you think that this can be fixed somehow? It's very hard to diagnose as it doesn't happen every time, but it happens frequently enough to have my attention.

jacob-carlborg commented 11 months ago

Yeah, it's difficult to say. Could be both something inside the VM and something outside. Perhaps it's possible to run through DTrace to debug it. Not sure if that works on a GHA runner though. Perhaps it's possible to redirect the output of the VM to some file and print that, to see what's going on.

jacob-carlborg commented 11 months ago

Do you have a link to a failing job?

kobalicek commented 11 months ago

I have - actually two failing jobs within 2 days:

I'm not sure that would help though, as nothing interesting happens in these runs, it just stops at the beginning.

manxorist commented 11 months ago

I am seeing the same problem:

https://github.com/OpenMPT/openmpt/blob/ea8aafbdcbf07b1d2a96a0d213edb64e7872f6ae/.github/workflows/NetBSD-9.3-Makefile.yml

and a couple of failing jobs:

The last successful NetBSD VM run was on 2023-09-21, https://github.com/OpenMPT/openmpt/actions/runs/6264853726 .

kobalicek commented 11 months ago

And one more:

I think that this is the most unstable runner at the moment - it fails in like 50% of time like this

jacob-carlborg commented 11 months ago

it fails in like 50% of time like this

Oh, that's pretty bad. I'll see if I can debug the issue.

jacob-carlborg commented 11 months ago

Seems like GitHub made some breaking changes again. This happens when trying to run QEMU:

dyld[1372]: Library not loaded: '/usr/local/opt/capstone/lib/libcapstone.4.dylib'

But it should always fail.

This makes it much easier to fix. I thought all the dependencies were statically linked to avoid this exact problem, but it looks like I missed one.

BTW, this is not specific to NetBSD, it applies for all platforms when QEMU is used as the hypervisor. But since xhyve if the default hypervisor for FreeBSD and OpenBSD on macOS runners it doesn't affect those platforms unless explicitly switching hypervisor to QEMU.

If you're in a hurry you can switch to using Linux runners instead of macOS as a workaround, but macOS has better performance.

kobalicek commented 11 months ago

Yeah it always fails.

I'm removing netbsd from my CI as this just makes all builds to fail.

I think this is just really unfortunate reality that it's not natively supported by github.

jacob-carlborg commented 11 months ago

Fixed in https://github.com/cross-platform-actions/action/releases/tag/v0.19.1. I've added a test to make sure this doesn't happen again. In doing that I also found another non-system dependency. But that is fixed as well.