Open davidchisnall opened 1 year ago
Do you have an example of a workflow run that shows it's failing?
I saw an example here https://github.com/microsoft/snmalloc/actions/runs/4124100375/jobs/7122943134 where the job is cancelled, but it looks like maybe the output just cut off for a while, prompting the cancellation request.
I don't think that no output would cause the workflow to be cancelled. It's hitting the timeout of 25 minutes. I've compared the above failing job with a successful job. I noticed that both Test 9 (func-first_operation-fast
) and Test 10 (func-first_operation-check
) timed out after 400 seconds (that's 6 minutes and 40 seconds). In the successful run they take slightly less than 5 seconds. I also noticed that there's no output for the last 15 minutes, before the job times out.
It's difficult to say what's the cause of this. There are many layers involved. I would hope that if the SSH connection drops an exception would be thrown and abort the action. To me it seems like the VM just stops doing work.
I added some more aggressive timeouts because they were taking a very long time in the cases where they didn't make progress. This one hit the timeout (set to 25 minutes, a successful run takes <15): https://github.com/microsoft/snmalloc/actions/runs/4123711673/jobs/7122600447
Hmm, this one also stops making progress. I'll see if I can fork the repository and debug the issue.
For reference, the perf-contention-fast test takes <20s on the macOS runner, yet we sit waiting for timeout after 20 minutes on the FreeBSD VM on the macOS runner. On FreeBSD on Hyper-V VM on a Xeon W-2155, that test takes a shade over 1s, so I can confirm that it doesn't hit any special weirdness with FreeBSD.
I'm getting a slightly similar issue with a couple of my actions where some random component either times out or stops on its own. This time, it was rsync
: https://github.com/Slackadays/Clipboard/actions/runs/4154242313/jobs/7186806540
I forked snmalloc but I haven't been able to reproduce the issue yet.
As another data point, I noticed that my FreeBSD issues only happened when using macos-latest
and not ubuntu-latest
.
A more exciting failure today, the job succeeded, but the VM teardown failed and so the runner reported failure.
As another data point, I noticed that my FreeBSD issues only happened when using
macos-latest
and notubuntu-latest
.
I've noticed similar with OpenBSD jobs. Unit tests in a couple of projects I contribute to will run fine locally, but fail in weird ways in CI when run on macos-12 hosts (eg, 'write after free' and unexplained segfaults). Switching the host to ubuntu-latest results in consistently clean runs. I would run everything on ubuntu, except the performance tradeoff is significant.
OpenBSD and FreeBSD uses the xhyve hypervisor on macOS runners. The xhyve hypervisor is probably not as battle tested as QEMU. I could add an option to allow selecting hypervisor. Then QEMU could be selected on macOS runners and it would support hardware accelerated virtualization, this should make the performance better compared to Linux runners and hopefully make it more stable.
I'd be happy to give that a whirl and test it on the projects where I had issues using the xhyve hypervisor.
@Slackadays @knightjoel I created a new release which supports selecting the hypervisor. Now it's possible to use QEMU (which is the default on Linux runners) on macOS runners for FreeBSD and OpenBSD. Previously the would only use the Xhyve hypervisor.
https://github.com/cross-platform-actions/action/releases/tag/v0.11.0
I'm trying to use this with microsoft/snmalloc#588, but each run seems to have a high chance of one of the jobs failing by hitting the timeout. The output looks as if it's just disconnecting. Is it possible that the ssh connection is dropped under high load? Would it be possible to run dtach in the VMs and reconnect if the session is dropped?