Open davidpanderson opened 1 year ago
Do you remember which project you ran?
At least LHC@home is using this pattern for VM names (example): boinc_d0488831bc598c0b
Recent logs show that deregistering/removal works fine with vboxwrapper 26206 used there:
2023-12-05 10:28:50 (42792): Powering off VM.
2023-12-05 10:28:51 (42792): Successfully stopped VM.
2023-12-05 10:28:51 (42792): Deregistering VM. (boinc_d0488831bc598c0b, slot#33)
2023-12-05 10:28:51 (42792): Removing network bandwidth throttle group from VM.
2023-12-05 10:28:51 (42792): Removing VM from VirtualBox.
10:28:57 (42792): called boinc_finish(0)
Using a vboxwrapper instance to remove a VM not under it's own control may cause trouble:
As long as those instances run under the same user account their vboxmanage requests are queued by VirtualBox and finally written to/removed from the same VirtualBox.xml file.
I believe, we still need some kind of clean-up that will check next:
Because there always can happen some situations when vboxwrapper might fail deregistering/removal of the VM, and it will stuck in the VBoxManager forever
As for (1.)
A typical VM entry in VirtualBox.xml looks like this.
VirtualBox provides no information as to when it has been created:
<MachineEntry uuid="{e663f635-e077-4dea-be4a-287b325fc0dd}" src="/home/boinc3/BOINC_LHCVB/slots/1/boinc_a2df64699a65780d/boinc_a2df64699a65780d.vbox"/>
As for (2.) There's already a watchdog implemented in vboxwrapper which ensures a stuck VM can be identified and shut down. It's up to the entire project people to use it or not to use it. Nonetheless, there can be situations (mostly after a crash) where the relationship between a registered VM and BOINC/vboxwrapper can't be restored. This leaves orphans. In general I'm not aware of a method implemented in BOINC/vboxwrapper to reliably decide (from outside) whether a VM got stuck or intentionally waits for something to happen.
As for (3.) Most promising point. Assuming the BOINC client keeps track of the slot numbers and tells a fresh vboxwrapper to use slot n. Then this vboxwrapper should be authorized to clean up slot n (if not empty) and remove any VirtualBox object related to it. Needs to be ensured this doesn't have unwanted side effects, e.g on running VMs.
I think, this should be a functionality of the BOINC client, and completely decoupled from the vboxwrapper, because if might happen that there were no VBox tasks for quite a long time, and we need to clean up some orphan VMs. There is currently a mechanism in BOINC client that do some cleaning from time to time, so it should be extended at some point.
Assuming the BOINC client keeps track of the slot numbers and tells a fresh vboxwrapper to use slot n.
I does
I agree that VM cleanup should be done by the client. How exactly should it do it?
Also, has anyone besides me seen this issue? If so, what project do the orphan VMs belong to?
VM names like "boinc_4e84e6a8a719072c" are used at least by LHC@home and cosmology@home. Hence, it can't be said which project left the orphans nor when.
As long as the orphan machine entries are only in VirtualBox.xml they may confuse a user looking through VirtualBox Manager but in fact they do not affect fresh BOINC work. In most cases they are remains after a crash, typically due to a power outage. They can safely be removed manually using the VirtualBox Manager or scripted via VboxManage. In the latter case it must be ensured the entry doesn't belong to a VM that is just about to be created.
Complaints about that have been posted in the past but not recently in the forums from LHC@home.
Users can remove these entries manually. But it would be good if BOINC did it automatically.
I see it too, I just go in an clean them up by hand or manually use vboxmanage to clean them up.
I think it was worse in the past than now but just a feeling, to support computezrmle comment
I hypothesis is it could be on a reboot of th computer, I run a shell script on linux/win to wait for all the VMs to close down before rebooting since the OS is not paitent enough to wait ~2 min for the VMs to close down.
You could compare the entries in VirtualBox.xml to the BOINC know list and remove the excess?
(Win) when I open VirtualBox Manager, there are dozens of entries in the VM list with names like boinc_091b568ebb08451b, pointing to nonexistent slot directories. These shouldn't be there; should be cleaned up by vboxwrapper.