BOINC / boinc

Open-source software for volunteer computing and grid computing.
https://boinc.berkeley.edu
GNU Lesser General Public License v3.0
1.95k stars 439 forks source link

Add race mitigation lock to vboxwrapper #5571

Closed computezrmle closed 2 months ago

computezrmle commented 3 months ago

Introduction

Vboxwrapper allows to use VirtualBox multiattach/differencing disk images since version 26206. Attaching a disk in this mode can't be done in 1 step. Hence, a workaround suggested by VirtualBox had to be implemented in vboxwrapper and improved.

The required steps (in short):

  1. Create a VM
  2. Attach a virtual disk in normal mode to register it
  3. Handle errors like the "4.0 or later" error ("closemedium"; see below)
  4. Disconnect the disk from the VM but leave it in the registry
  5. Reconnect the disk in multiattach mode ("storageattach")

For more details see:

4602

4603

Long term experience with LHC@home shows that errors like these happen more often than expected:

Command: VBoxManage -q storageattach "boinc_086a07090bec38dd" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --mtype multiattach --medium "/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/CMS_2022_09_07_prod.vdi"
Exit Code: -2135228409
Output:
VBoxManage: error: Cannot attach medium '/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/CMS_2022_09_07_prod.vdi': the media type 'MultiAttach' can only be attached to machines that were created with VirtualBox 4.0 or later
VBoxManage: error: Details: code VBOX_E_INVALID_OBJECT_STATE (0x80bb0007), component SessionMachine, interface IMachine, callee nsISupports
VBoxManage: error: Context: "AttachDevice(Bstr(pszCtl).raw(), port, device, DeviceType_HardDisk, pMedium2Mount)" at line 781 of file VBoxManageStorageController.cpp
Command:
VBoxManage -q closemedium "/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi" 
Output:
VBoxManage: error: Cannot close medium '/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi' because it has 2 child media
VBoxManage: error: Details: code VBOX_E_OBJECT_IN_USE (0x80bb000c), component MediumWrap, interface IMedium, callee nsISupports
VBoxManage: error: Context: "Close()" at line 1875 of file VBoxManageDisk.cpp

They usually happen when

VirtualBox processes all incoming commands FIFO, hence "closemedium" and "storageattach" (normal/multiattach) from different vboxwrappers can easily be mixed.

Suggested Solution

This PR introduces a lock to protect the critical code sections. All vboxwrapper instances trying to modify the same parent disk entry must own the lock or wait. This is important when a VM gets registered as well as when a VM gets deregistered.

Also included in this PR

Tests done

Several series of concurrently starting VMs (CMS subproject from LHC@home)

computezrmle commented 3 months ago

While it builds fine on my local Linux system I get these errors on the github build system:

vbox_vboxmanage.o: In function `VBOX_VM::remove_race_mitigation_lock(int&, std::string&)':
/__w/boinc/boinc/samples/vboxwrapper/vbox_vboxmanage.cpp:2315: undefined reference to `shm_unlink'
vbox_vboxmanage.o: In function `VBOX_VM::set_race_mitigation_lock(int&, std::string&, std::string const&)':
/__w/boinc/boinc/samples/vboxwrapper/vbox_vboxmanage.cpp:2415: undefined reference to `shm_open'
vbox_vboxmanage.o: In function `VBOX_VM::remove_race_mitigation_lock(int&, std::string&)':
/__w/boinc/boinc/samples/vboxwrapper/vbox_vboxmanage.cpp:2315: undefined reference to `shm_unlink'
collect2: error: ld returned 1 exit status

Any idea what's wrong?

AenBleidd commented 3 months ago

@computezrmle, these two builds are building on Ubuntu 13.04, because we still need to provide the binaries for the old libc version: image I believe, it works on your machine because you're using more fresh version of one of the libraries

computezrmle commented 3 months ago

@AenBleidd I'm using openSUSE Tumbleweed [6.7.4-1-default|libc 2.39] Can I do anything to solve this?

AenBleidd commented 3 months ago

@computezrmle, try to add -lrt here at the very end between -lboinc and $(STDCPPTC) https://github.com/BOINC/boinc/blob/6cd441f6a6e8b8323a8103ec6e874a8a71908972/samples/vboxwrapper/Makefile#L80

computezrmle commented 2 months ago

A case is not handled correctly in is_disk_image_registered() when a VM restarts. Set the PR to draft to work out a solution.

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 10.54%. Comparing base (6cd441f) to head (85bc766). Report is 38 commits behind head on master.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #5571 +/- ## ============================================ - Coverage 10.80% 10.54% -0.26% Complexity 1068 1068 ============================================ Files 279 279 Lines 36294 35869 -425 Branches 8409 8409 ============================================ - Hits 3920 3781 -139 + Misses 31980 31694 -286 Partials 394 394 ``` [see 90 files with indirect coverage changes](https://app.codecov.io/gh/BOINC/boinc/pull/5571/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=BOINC)
computezrmle commented 2 months ago

A reliable volunteer from LHC@home and I tested the artifact from yesterday with real project tasks on Windows and Linux. They work as expected and solve the issues addressed in the first comment of this PR. Thus, it's ready for review now.