cross-platform-actions / action

Cross-platform GitHub action
MIT License
140 stars 19 forks source link

NetBSD - VM doesn't start after a 120s timeout #62

Closed kobalicek closed 1 year ago

kobalicek commented 1 year ago

I'm having the following occasional issue when running NetBSD runner:

  Pseudo-terminal will not be allocated because stdin is not a terminal.
  ssh: connect to host localhost port 2847: Connection refused
  Waiting for VM to be ready...
  Executing command inside VM: true
  /usr/bin/ssh -t runner@localhost
  Pseudo-terminal will not be allocated because stdin is not a terminal.
  ssh: connect to host localhost port 2847: Connection refused
  Waiting for VM to be ready...
  Executing command inside VM: true
  /usr/bin/ssh -t runner@localhost
  Pseudo-terminal will not be allocated because stdin is not a terminal.
  ssh: connect to host localhost port 2847: Connection refused
  Waiting for VM to be ready...
  Executing command inside VM: true
  /usr/bin/ssh -t runner@localhost
  Pseudo-terminal will not be allocated because stdin is not a terminal.
  ssh: connect to host localhost port 2847: Connection refused
  Waiting for VM to be ready...
  Executing command inside VM: true
  /usr/bin/ssh -t runner@localhost
  Pseudo-terminal will not be allocated because stdin is not a terminal.
  ssh: connect to host localhost port 2847: Connection refused
  Terminating VM
  /usr/bin/sudo kill -s TERM 1370
  kill: 1370: No such process
Error: Waiting for VM to become ready timed out after 120 seconds

I'm using QEMU to run it.

Basically the VM is not ready after 120 seconds, which causes the action to be terminated.

I'm not sure what is the problem in this case - if the GHA runner is simply overloaded or whether there is a race or something caused by the action itself, which results in inability to connect to the SSH server inside the VM.

I'm wondering - is this something we have to live with or do you think that this can be fixed somehow? It's very hard to diagnose as it doesn't happen every time, but it happens frequently enough to have my attention.

jacob-carlborg commented 1 year ago

Yeah, it's difficult to say. Could be both something inside the VM and something outside. Perhaps it's possible to run through DTrace to debug it. Not sure if that works on a GHA runner though. Perhaps it's possible to redirect the output of the VM to some file and print that, to see what's going on.

jacob-carlborg commented 1 year ago

Do you have a link to a failing job?

kobalicek commented 1 year ago

I have - actually two failing jobs within 2 days:

I'm not sure that would help though, as nothing interesting happens in these runs, it just stops at the beginning.

manxorist commented 1 year ago

I am seeing the same problem:

https://github.com/OpenMPT/openmpt/blob/ea8aafbdcbf07b1d2a96a0d213edb64e7872f6ae/.github/workflows/NetBSD-9.3-Makefile.yml

and a couple of failing jobs:

The last successful NetBSD VM run was on 2023-09-21, https://github.com/OpenMPT/openmpt/actions/runs/6264853726 .

kobalicek commented 1 year ago

And one more:

I think that this is the most unstable runner at the moment - it fails in like 50% of time like this

jacob-carlborg commented 1 year ago

it fails in like 50% of time like this

Oh, that's pretty bad. I'll see if I can debug the issue.

jacob-carlborg commented 1 year ago

Seems like GitHub made some breaking changes again. This happens when trying to run QEMU:

dyld[1372]: Library not loaded: '/usr/local/opt/capstone/lib/libcapstone.4.dylib'

But it should always fail.

This makes it much easier to fix. I thought all the dependencies were statically linked to avoid this exact problem, but it looks like I missed one.

BTW, this is not specific to NetBSD, it applies for all platforms when QEMU is used as the hypervisor. But since xhyve if the default hypervisor for FreeBSD and OpenBSD on macOS runners it doesn't affect those platforms unless explicitly switching hypervisor to QEMU.

If you're in a hurry you can switch to using Linux runners instead of macOS as a workaround, but macOS has better performance.

kobalicek commented 1 year ago

Yeah it always fails.

I'm removing netbsd from my CI as this just makes all builds to fail.

I think this is just really unfortunate reality that it's not natively supported by github.

jacob-carlborg commented 1 year ago

Fixed in https://github.com/cross-platform-actions/action/releases/tag/v0.19.1. I've added a test to make sure this doesn't happen again. In doing that I also found another non-system dependency. But that is fixed as well.