BOINC / boinc

Open-source software for volunteer computing and grid computing.
https://boinc.berkeley.edu
GNU Lesser General Public License v3.0
2.04k stars 449 forks source link

Clean up Vbox VMs #5424

Open davidpanderson opened 1 year ago

davidpanderson commented 1 year ago

(Win) when I open VirtualBox Manager, there are dozens of entries in the VM list with names like boinc_091b568ebb08451b, pointing to nonexistent slot directories. These shouldn't be there; should be cleaned up by vboxwrapper.

computezrmle commented 1 year ago

Do you remember which project you ran?

At least LHC@home is using this pattern for VM names (example): boinc_d0488831bc598c0b

Recent logs show that deregistering/removal works fine with vboxwrapper 26206 used there:

2023-12-05 10:28:50 (42792): Powering off VM.
2023-12-05 10:28:51 (42792): Successfully stopped VM.
2023-12-05 10:28:51 (42792): Deregistering VM. (boinc_d0488831bc598c0b, slot#33)
2023-12-05 10:28:51 (42792): Removing network bandwidth throttle group from VM.
2023-12-05 10:28:51 (42792): Removing VM from VirtualBox.
10:28:57 (42792): called boinc_finish(0)

Using a vboxwrapper instance to remove a VM not under it's own control may cause trouble:

  1. Another vboxwrapper running concurrently may just have sent a request to register a fresh VM
  2. Another BOINC instance may also run multiple vboxwrapper instances

As long as those instances run under the same user account their vboxmanage requests are queued by VirtualBox and finally written to/removed from the same VirtualBox.xml file.

AenBleidd commented 1 year ago

I believe, we still need some kind of clean-up that will check next:

  1. VM was created X days ago
  2. VM was not running X days
  3. VM points to the slot directory that is now used by other task or empty

Because there always can happen some situations when vboxwrapper might fail deregistering/removal of the VM, and it will stuck in the VBoxManager forever

computezrmle commented 12 months ago

As for (1.) A typical VM entry in VirtualBox.xml looks like this. VirtualBox provides no information as to when it has been created: <MachineEntry uuid="{e663f635-e077-4dea-be4a-287b325fc0dd}" src="/home/boinc3/BOINC_LHCVB/slots/1/boinc_a2df64699a65780d/boinc_a2df64699a65780d.vbox"/>

As for (2.) There's already a watchdog implemented in vboxwrapper which ensures a stuck VM can be identified and shut down. It's up to the entire project people to use it or not to use it. Nonetheless, there can be situations (mostly after a crash) where the relationship between a registered VM and BOINC/vboxwrapper can't be restored. This leaves orphans. In general I'm not aware of a method implemented in BOINC/vboxwrapper to reliably decide (from outside) whether a VM got stuck or intentionally waits for something to happen.

As for (3.) Most promising point. Assuming the BOINC client keeps track of the slot numbers and tells a fresh vboxwrapper to use slot n. Then this vboxwrapper should be authorized to clean up slot n (if not empty) and remove any VirtualBox object related to it. Needs to be ensured this doesn't have unwanted side effects, e.g on running VMs.

AenBleidd commented 12 months ago

I think, this should be a functionality of the BOINC client, and completely decoupled from the vboxwrapper, because if might happen that there were no VBox tasks for quite a long time, and we need to clean up some orphan VMs. There is currently a mechanism in BOINC client that do some cleaning from time to time, so it should be extended at some point.

Assuming the BOINC client keeps track of the slot numbers and tells a fresh vboxwrapper to use slot n.

I does

davidpanderson commented 12 months ago

I agree that VM cleanup should be done by the client. How exactly should it do it?

Also, has anyone besides me seen this issue? If so, what project do the orphan VMs belong to?

computezrmle commented 12 months ago

VM names like "boinc_4e84e6a8a719072c" are used at least by LHC@home and cosmology@home. Hence, it can't be said which project left the orphans nor when.

As long as the orphan machine entries are only in VirtualBox.xml they may confuse a user looking through VirtualBox Manager but in fact they do not affect fresh BOINC work. In most cases they are remains after a crash, typically due to a power outage. They can safely be removed manually using the VirtualBox Manager or scripted via VboxManage. In the latter case it must be ensured the entry doesn't belong to a VM that is just about to be created.

Complaints about that have been posted in the past but not recently in the forums from LHC@home.

davidpanderson commented 12 months ago

Users can remove these entries manually. But it would be good if BOINC did it automatically.

Toby-Broom commented 10 months ago

I see it too, I just go in an clean them up by hand or manually use vboxmanage to clean them up.

I think it was worse in the past than now but just a feeling, to support computezrmle comment

I hypothesis is it could be on a reboot of th computer, I run a shell script on linux/win to wait for all the VMs to close down before rebooting since the OS is not paitent enough to wait ~2 min for the VMs to close down.

You could compare the entries in VirtualBox.xml to the BOINC know list and remove the excess?