adobe / aquarium-fish

Your best secure distributed heterogeneous dynamic compute resource manager for CI
Other
7 stars 2 forks source link

VMX: Stale VM's blocking the driver usage #62

Open sparshev opened 1 month ago

sparshev commented 1 month ago

There happened multiple times during usage of the fish node - ~every 2-3 months of active utilization VMX just starts to behave weirdly and fails to allocate the VM net interface. This have a negative impact on the Fish node - because the allocation stops in vmrun start - even the timeout not helping and since the state of the Application is ELECTED - jenkins net plugin also don't see any issues.

To fix that - OS reboot usually helps with cleanup of the VM's via manual deallocation...

Steps to Reproduce

Unfortunately - reproducing regularly but no specific steps. Most probably some issue with VMWare Fusion, so hard to find out what's causing it.

Platform and Version

OS: macOS 11.6.5 20G527 VM's: Ubuntu/Windows, possible Mac too Fish: Aquarium Fish v0.7.2 (240223.211700) VMware Fusion: 13.0.1

Sample Code that illustrates the problem

Logs taken while reproducing problem

sparshev commented 1 month ago

The issue with timeout looks related to those issues:

Shortly: if subcommand will run another process (and vmrun could do that in theory) - even if vmrun is killed by timeout, it doesn't mean the child process will be killed and since we're attaching stderr/stdout to listen - they will wait for closure, which will not happen until all child processes will be completed.

It seems like the proposed resolutions are relying on Lin only, so need to check if this solution will work for Mac as well.