Nondeterministic failures

davidchisnall commented 1 year ago

I'm trying to use this with microsoft/snmalloc#588, but each run seems to have a high chance of one of the jobs failing by hitting the timeout. The output looks as if it's just disconnecting. Is it possible that the ssh connection is dropped under high load? Would it be possible to run dtach in the VMs and reconnect if the session is dropped?

jacob-carlborg commented 1 year ago

Do you have an example of a workflow run that shows it's failing?

Slackadays commented 1 year ago

I saw an example here https://github.com/microsoft/snmalloc/actions/runs/4124100375/jobs/7122943134 where the job is cancelled, but it looks like maybe the output just cut off for a while, prompting the cancellation request.

jacob-carlborg commented 1 year ago

I don't think that no output would cause the workflow to be cancelled. It's hitting the timeout of 25 minutes. I've compared the above failing job with a successful job. I noticed that both Test 9 (func-first_operation-fast) and Test 10 (func-first_operation-check) timed out after 400 seconds (that's 6 minutes and 40 seconds). In the successful run they take slightly less than 5 seconds. I also noticed that there's no output for the last 15 minutes, before the job times out.

It's difficult to say what's the cause of this. There are many layers involved. I would hope that if the SSH connection drops an exception would be thrown and abort the action. To me it seems like the VM just stops doing work.

davidchisnall commented 1 year ago

I added some more aggressive timeouts because they were taking a very long time in the cases where they didn't make progress. This one hit the timeout (set to 25 minutes, a successful run takes <15): https://github.com/microsoft/snmalloc/actions/runs/4123711673/jobs/7122600447

jacob-carlborg commented 1 year ago

Hmm, this one also stops making progress. I'll see if I can fork the repository and debug the issue.

davidchisnall commented 1 year ago

For reference, the perf-contention-fast test takes <20s on the macOS runner, yet we sit waiting for timeout after 20 minutes on the FreeBSD VM on the macOS runner. On FreeBSD on Hyper-V VM on a Xeon W-2155, that test takes a shade over 1s, so I can confirm that it doesn't hit any special weirdness with FreeBSD.

Slackadays commented 1 year ago

I'm getting a slightly similar issue with a couple of my actions where some random component either times out or stops on its own. This time, it was rsync: https://github.com/Slackadays/Clipboard/actions/runs/4154242313/jobs/7186806540

jacob-carlborg commented 1 year ago

I forked snmalloc but I haven't been able to reproduce the issue yet.

Slackadays commented 1 year ago

As another data point, I noticed that my FreeBSD issues only happened when using macos-latest and not ubuntu-latest.

davidchisnall commented 1 year ago

A more exciting failure today, the job succeeded, but the VM teardown failed and so the runner reported failure.

knightjoel commented 1 year ago

As another data point, I noticed that my FreeBSD issues only happened when using macos-latest and not ubuntu-latest.

I've noticed similar with OpenBSD jobs. Unit tests in a couple of projects I contribute to will run fine locally, but fail in weird ways in CI when run on macos-12 hosts (eg, 'write after free' and unexplained segfaults). Switching the host to ubuntu-latest results in consistently clean runs. I would run everything on ubuntu, except the performance tradeoff is significant.

jacob-carlborg commented 1 year ago

OpenBSD and FreeBSD uses the xhyve hypervisor on macOS runners. The xhyve hypervisor is probably not as battle tested as QEMU. I could add an option to allow selecting hypervisor. Then QEMU could be selected on macOS runners and it would support hardware accelerated virtualization, this should make the performance better compared to Linux runners and hopefully make it more stable.

knightjoel commented 1 year ago

I'd be happy to give that a whirl and test it on the projects where I had issues using the xhyve hypervisor.

jacob-carlborg commented 1 year ago

@Slackadays @knightjoel I created a new release which supports selecting the hypervisor. Now it's possible to use QEMU (which is the default on Linux runners) on macOS runners for FreeBSD and OpenBSD. Previously the would only use the Xhyve hypervisor.

https://github.com/cross-platform-actions/action/releases/tag/v0.11.0

cross-platform-actions / action

Nondeterministic failures #29