cross-platform-actions / action

Cross-platform GitHub action
MIT License
128 stars 19 forks source link

Very frequent freezing of FreeBSD VM during teardown #61

Closed kobalicek closed 9 months ago

kobalicek commented 1 year ago

I have been experiencing a very frequent freezing during teardown lately with FreeBSD VMs.

I was using Xhyve + FreeBSD 13.2 version.

For example these two consecutive runs failed every time for the same reason:

I'm not sure what to do, because my builds are basically failing due to these issues. Temporarily I switched to QEMU virtualization and that seems to be more stable in my case.

jacob-carlborg commented 1 year ago

Does this seem to be the same issue as https://github.com/cross-platform-actions/action/issues/29?

kobalicek commented 1 year ago

Hard to say - but for me it was rather deterministic - basically it was a miracle if the build finished successfully.

I think the running script either doesn't know the VM got killed or the killing is blocking it up for some reason. It had to manually kill these builds after 5 hours.

I was also wondering if there is rsync that syncs back files from VM to the host - cannot this be possibly turned off as well? I don't need that synching - maybe this is what blocks forever?

jacob-carlborg commented 1 year ago

I forked Blend2D to be able to easier debug the CI workflow. I disabled all matrix entries except for FreeBSD and switch to macOS runner. So far I have not been able to reproduce the issue. Here's an example of a CI run [1]. As yo can see, I've run it five times. Did I do something wrong?

I was also wondering if there is rsync that syncs back files from VM to the host

Yes, the action syncs back files to the host.

cannot this be possibly turned off as well

I guess I can add an option for that. Could you please create a separate issue for this?

maybe this is what blocks forever

In your first example, that is what seems to be happening. In your second example it gets stuck after syncing, it's the final force shutdown of the VM that times out. As an extra precaution the action kills the VM in case it fails to shutdown using the regular shutdown command. You can easily see this in the CI log by enabling timestamps.

[1] https://github.com/cross-platform-actions/blend2d/actions/runs/5899978065

jacob-carlborg commented 1 year ago

BTW, I see that you run all the *BSD workflows on Linux. I recommend running on macOS instead because it supports hardware accelerated nested virtualization, which the Linux runners don't. You can force using QEMU as the hypervisor on macOS using the hypervisor input [1].

[1] https://github.com/cross-platform-actions/action#inputs

kobalicek commented 1 year ago

I have changed the runners to use Linux and QEMU as that seemed to be more stable in my case.

kobalicek commented 11 months ago

BTW I have changed the runners to use MacOS, but that doesn't solve the issue. It seems this doesn't really matter at all. I get a very frequent build failures on FreeBSD because of this teardown issue.

BTW I know that there is a sync process to sync files back from VM after the run, cannot this be the source of the problem? Can this be possibly disabled by an option to avoid syncing back if I don't need that functionality?

jacob-carlborg commented 11 months ago

BTW I know that there is a sync process to sync files back from VM after the run, cannot this be the source of the problem?

I guess that's likely if it fails when syncing back the files.

Can this be possibly disabled by an option to avoid syncing back if I don't need that functionality?

Yeah, I guess so.

chipsenkbeil commented 11 months ago

@jacob-carlborg I've also noticed freezing on 0.19.1. Upon success not freezing, I get an error related to syncing files back. Is there a way to disable syncing back to the host or ignore if the teardown fails?

https://github.com/chipsenkbeil/service-manager-rs/actions/runs/6553345333/job/17798792946

image
jacob-carlborg commented 11 months ago

@chipsenkbeil in your case there's a clear error message. There are some files it doesn't have access to read.

jacob-carlborg commented 11 months ago

@kobalicek could you please try to enable debug output by setting the following variables: ACTIONS_RUNNER_DEBUG and ACTIONS_STEP_DEBUG. You can set them in the repository settings -> Security -> Secretes and variables -> Actions. Set the value to true. This will add the verbose flag to rsync, which might show some more information.

chipsenkbeil commented 11 months ago

@chipsenkbeil in your case there's a clear error message. There are some files it doesn't have access to read.

Yes, and it's interesting because these are generate files from a compiler by running a build command. As a user, I wasn't expecting to encounter an error like this. I suppose my only option is to delete them before finishing because otherwise this fails to sync.

The freezing happens the majority of the time, which is why I flagged it here as a potential, reproducible situation. Will try to delete before teardown and see if that helps.

kobalicek commented 10 months ago

Today's failure looks like this:

Downloading disk image: https://github.com/cross-platform-actions/freebsd-builder/releases/download/v0.5.0/freebsd-13.2-x86-64.qcow2
  Downloading hypervisor: https://github.com/cross-platform-actions/resources/releases/download/v0.9.1/xhyve-macos.tar
  Downloading resources: https://github.com/cross-platform-actions/resources/releases/download/v0.9.1/resources-macos.tar
  /usr/bin/ssh-keygen -t ed25519 -f /tmp/resourcesaj07Fg/id_ed25519 -q -N 
  /usr/sbin/mkfile -n 40m /tmp/resourcesaj07Fg/res.raw
  Downloaded file: /Users/runner/work/_temp/b6800e99-af4a-4aa0-a1df-b796dbb9cdc2
  /usr/sbin/diskutil partitionDisk /dev/disk2 1 GPT fat32 RES 100%
  Started partitioning on disk2
  Unmounting disk
  Creating the partition map
  Downloaded file: /Users/runner/work/_temp/44cc5c[40](https://github.com/blend2d/blend2d/actions/runs/6585917423/job/17893260198#step:7:41)-40cd-[42](https://github.com/blend2d/blend2d/actions/runs/6585917423/job/17893260198#step:7:43)30-84e6-97ff3dc92e3c
  Waiting for partitions to activate
  Formatting disk2s1 as MS-DOS (FAT32) with name RES
  512 bytes per physical sector
  /dev/rdisk2s1: 76594 sectors in 76594 FAT32 clusters (512 bytes/cluster)
  bps=512 spc=1 res=32 nft=2 mid=0xf8 spt=32 hds=16 hid=2048 drv=0x80 bsec=77824 bspf=599 rdcl=2 infs=1 bkbs=6
  Mounting disk
  Finished partitioning on disk2
  /usr/bin/sudo umount /Volumes/RES
  /usr/bin/hdiutil detach /dev/disk2
  hdiutil: couldn't eject "disk2" - Resource busy

  /Users/runner/work/_actions/cross-platform-actions/action/master/webpack:/cross-platform-action/node_modules/@actions/exec/lib/toolrunner.js:574
                  error = new Error(`The process '${this.toolPath}' failed with exit code ${this.processExitCode}`);
  ^
  Error: The process '/usr/bin/hdiutil' failed with exit code 16
      at ExecState._setResult (/Users/runner/work/_actions/cross-platform-actions/action/master/webpack:/cross-platform-action/node_modules/@actions/exec/lib/toolrunner.js:574:1)
      at ExecState.CheckComplete (/Users/runner/work/_actions/cross-platform-actions/action/master/webpack:/cross-platform-action/node_modules/@actions/exec/lib/toolrunner.js:557:1)
      at ChildProcess.<anonymous> (/Users/runner/work/_actions/cross-platform-actions/action/master/webpack:/cross-platform-action/node_modules/@actions/exec/lib/toolrunner.js:[45](https://github.com/blend2d/blend2d/actions/runs/6585917423/job/17893260198#step:7:46)1:1)
      at ChildProcess.emit (node:events:[51](https://github.com/blend2d/blend2d/actions/runs/6585917423/job/17893260198#step:7:52)3:28)
      at maybeClose (node:internal/child_process:1100:16)
      at Socket.<anonymous> (node:internal/child_process:4[58](https://github.com/blend2d/blend2d/actions/runs/6585917423/job/17893260198#step:7:59):11)
      at Socket.emit (node:events:513:28)
      at Pipe.<anonymous> (node:net:301:12)

I think in the end this must be related to syncing the files.

kobalicek commented 10 months ago

Which is related to https://github.com/cross-platform-actions/action/issues/64

jacob-carlborg commented 10 months ago

I think in the end this must be related to syncing the files.

@kobalicek if you get the error: hdiutil: couldn't eject "disk2" - Resource busy it's not related to syncing files. It occurs before even the VM has been started. The action creates a secondary hard drive with an SSH key on it. For some reason it fails to eject that hard drive before starting the VM.

jacob-carlborg commented 10 months ago

@chipsenkbeil, @kobalicek I've created a new release which adds support for disabling file syncing: https://github.com/cross-platform-actions/action/releases/tag/v0.20.0.

chipsenkbeil commented 10 months ago

@jacob-carlborg fantastic! Thanks for rolling this out so quickly 😄

kobalicek commented 10 months ago

I would close this one - I don't have this problem at the moment, I would open a new issue if I face a similar issue in the future.

manxorist commented 10 months ago

I am also seeing this issue, I think. It happens most often with FreeBSD 12.4 for me. See https://github.com/OpenMPT/openmpt/actions/runs/6665550333/job/18142362950 or https://github.com/OpenMPT/openmpt/actions/runs/6676795368/job/18146210087 for 2 failing runs.

jacob-carlborg commented 10 months ago

@kobalicek have you disabled file syncing? Ideally I would like to solve the issue without having to relying on disabling file syncing.

jacob-carlborg commented 10 months ago

@kobalicek @chipsenkbeil @manxorist I wonder if this could be related to how much memory the VM is using. I got another report that there might not be enough memory for the host https://github.com/cross-platform-actions/action/issues/68. Could you please try reducing the memory to see if there's a difference?

manxorist commented 10 months ago

@jacob-carlborg Using

        memory: 4G
        sync_files: runner-to-vm

I am still seeing hangs: https://github.com/OpenMPT/openmpt/actions/runs/6731368321/job/18295891777

jacob-carlborg commented 10 months ago

I am still seeing hangs: OpenMPT/openmpt/actions/runs/6731368321/job/18295891777

@manxorist that's disappointing. In this case it's hanging when shutting down the VM.

manxorist commented 10 months ago

I switched FreeBSD to QEMU on macOS and the first 4 runs went without any problem so far. I will continue monitoring and report back if it indeed fixes the FreeBSD issue for me.

I also tried switching OpenBSD to QEMU on macOS, and I am seeing VM startup issues there. See #73.

jacob-carlborg commented 9 months ago

@kobalicek @manxorist @chipsenkbeil I've created a branch that skips shutting down the VM and just lets the action exit: https://github.com/cross-platform-actions/action/tree/no-vm-shutdown. It would be great if anyone could give it a try to see if it helps. Unfortunately I haven't been able to find the root cause but this might mitigate some of the problem.

manxorist commented 9 months ago

@jacob-carlborg

I've created a branch that skips shutting down the VM and just lets the action exit: https://github.com/cross-platform-actions/action/tree/no-vm-shutdown.

I had 6 runs (3 times 13.2, 3 times 12.4) for now, all successful. So it appears to be a viable work-around.

manxorist commented 9 months ago

Well, ignore the last comment. I got confused about the various configurations and tested macOS/QEMU instead of macOS/xhyve.

I will re-test.

chipsenkbeil commented 9 months ago

@kobalicek @manxorist @chipsenkbeil I've created a branch that skips shutting down the VM and just lets the action exit: https://github.com/cross-platform-actions/action/tree/no-vm-shutdown. It would be great if anyone could give it a try to see if it helps. Unfortunately I haven't been able to find the root cause but this might mitigate some of the problem.

I'll give it a try. Even with skipping the copying back of files, it was still hanging at times. What do I need to set after switching to this branch? Any specific flag?

manxorist commented 9 months ago

2 times 13.2 and 2 times 12.4 for now, all successful.

jacob-carlborg commented 9 months ago

@chipsenkbeil no flags, it's automatic. If you look at the output you can verify if it shuts down the VM or not. Here's an example of where it doesn't shut down the VM [1]. And in the next example [2], it shuts down the VM, you can see the output: Executing command inside VM: sudo shutdown -p now.

[1] https://github.com/cross-platform-actions/action/actions/runs/6928693321/job/18844968427#step:3:2046 [2] https://github.com/cross-platform-actions/action/actions/runs/6875647797/job/18699661036#step:3:2063

chipsenkbeil commented 9 months ago

@jacob-carlborg switched over to the branch. Only one run thus far and it worked fine. Will jump in if it hangs again, but the repo using it has low volume of updates, so it may be a little while.

manxorist commented 9 months ago

@kobalicek @manxorist @chipsenkbeil I've created a branch that skips shutting down the VM and just lets the action exit

As we already established in #67, the VMs are for the majority (or all) use cases non-persistent and throw-away anyway, so is there a reason for properly shutting them down in the first place?

I think for testability and correctness sake, there should always be a mode available with proper file syncing barriers and proper shutdown in place, but in the default case, nobody cares what happens with the VM after then build files have (optionally) been synced back.

jacob-carlborg commented 9 months ago

is there a reason for properly shutting them down in the first place?

I was going to say "no, there's no reason" and I was planning to merge this branch regardless if it helps with this issue or not because it would be a good change anyway, less things for the action to do means the job finishes sooner. But now I started thinking, what if a job performs some additional major steps after the VM step, then the VM will unnecessarily occupy resources like CPU and memory.

manxorist commented 9 months ago

is there a reason for properly shutting them down in the first place?

But now I started thinking, what if a job performs some additional major steps after the VM step, then the VM will unnecessarily occupy resources like CPU and memory.

I guess that's a fair point that I did not consider. Still, for users who just care to run something like a test suite (my use case), it really does not matter what happens with the VM, and the whole action does nothing else after running things inside the VM. So a general option would probably a good idea to have.

jacob-carlborg commented 9 months ago

So a general option would probably a good idea to have.

Yes, I agree. Perhaps default to not shutting down the VM? I think the only steps I have that are after the VM step is to upload binaries to a GitHub release.

manxorist commented 9 months ago

So a general option would probably a good idea to have.

Yes, I agree.

Perhaps default to not shutting down the VM?

Well, I think resource consumption for following steps is a valid concern and the default should be to properly shutdown the VM, and skipping proper shutdown should only be optional.

jacob-carlborg commented 9 months ago

Well, I think resource consumption for following steps is a valid concern and the default should be to properly shutdown the VM, and skipping proper shutdown should only be optional.

Hmm, I'm thinking ahead of this feature request as well https://github.com/cross-platform-actions/action/issues/26. Trying to figure out how the API should look like. What you're suggesting would be the safest alternative, no risk of breaking anything. But it would be more verbose if one would use the action in multiple steps. I don't know how common that would be. What to optimize for in the API.