cloudfoundry / bosh-virtualbox-cpi-release

BOSH VirtualBox CPI
Apache License 2.0
6 stars 22 forks source link

Support newer VirtualBox releases 6.1+ (currently attach disk fails) #18

Closed ipa-bak closed 3 years ago

ipa-bak commented 4 years ago

TOC

Problem description

A bosh create-env with virtualbox-cpi fails.

Related:

Attempted action and error received

When deploying a bosh-lite with VirtualBox back-end using bosh create-env ..., then at some point bosh tries to attach a new disk to the running director VM. This fails with VirtualBox 6.1+:

Started deploying
  Creating VM for instance 'bosh/0' from stemcell 'sc-e0936c75-fae3-4da0-4da6-5269645395fe'... Finished (00:00:01)
  Waiting for the agent on VM 'vm-b9b883b9-fcfc-445f-7362-01abed26200b' to be ready... Finished (00:00:30)
  Creating disk... Finished (00:00:00)
  Attaching disk 'disk-9e7808be-db81-40a6-5960-570a491ef1b5' to VM 'vm-b9b883b9-fcfc-445f-7362-01abed26200b'... Failed (00:01:12) 
Failed deploying (00:01:44)

Cleaning up rendered CPI jobs... Finished (00:00:00)

Deploying:
  Creating instance 'bosh/0':
    Updating instance disks:
      Updating disks:
        Deploying disk:
          Mounting disk:
            Sending 'get_task' to the agent:
              Agent responded with error: Action Failed get_task: Task a200eef1-cb44-4439-414d-8bc2e3ee391c result: Mounting persistent disk: Getting real device path: Timed out getting real device path for '1ATA'

Analysis

To debug, I replaced the VBoxManage binary with a wrapper that logs all arguments before then calling the real thing. Logging all calls to VBoxManage shows, bosh / the vritualbpx-cpi tried to execute the following commands:

1 /usr/bin/VBoxManage import /home/cf-test-u1/.bosh_virtualbox_cpi/stemcells/sc-e0936c75-fae3-4da0-4da6-5269645395fe/image.ovf
2 /usr/bin/VBoxManage list vms
3 /usr/bin/VBoxManage modifyvm 4e1ac8cb-9f43-4fb5-86d7-148f17143670 --name sc-e0936c75-fae3-4da0-4da6-5269645395fe
4 /usr/bin/VBoxManage snapshot sc-e0936c75-fae3-4da0-4da6-5269645395fe take prepared-clone
5 /usr/bin/VBoxManage list natnetworks
6 /usr/bin/VBoxManage list hostonlyifs
7 /usr/bin/VBoxManage clonevm sc-e0936c75-fae3-4da0-4da6-5269645395fe --snapshot prepared-clone --options link --name vm-b9b883b9-fcfc-445f-7362-01abed26200b --uuid b9b883b9-fcfc-445f-7362-01abed26200b --register
8 /usr/bin/VBoxManage modifyvm vm-b9b883b9-fcfc-445f-7362-01abed26200b --name vm-b9b883b9-fcfc-445f-7362-01abed26200b --memory 6144 --cpus 4 --paravirtprovider default --audio none
9 /usr/bin/VBoxManage list natnetworks
10 /usr/bin/VBoxManage modifyvm vm-b9b883b9-fcfc-445f-7362-01abed26200b --nic1 natnetwork --nat-network1 NatNetwork --macaddress1 023C2A97CA55
11 /usr/bin/VBoxManage list hostonlyifs
12 /usr/bin/VBoxManage modifyvm vm-b9b883b9-fcfc-445f-7362-01abed26200b --nic2 hostonly --hostonlyadapter2 vboxnet1 --macaddress2 027AFB3CF70A
13 /usr/bin/VBoxManage createhd --filename /home/cf-test-u1/.bosh_virtualbox_cpi/disks/disk-7f1063f4-8ffa-4365-7955-249eed469cf0/disk.vmdk --size 16384 --format VMDK --variant Standard
14 /usr/bin/VBoxManage showvminfo vm-b9b883b9-fcfc-445f-7362-01abed26200b --machinereadable
15 /usr/bin/VBoxManage storageattach vm-b9b883b9-fcfc-445f-7362-01abed26200b --storagectl IDE --port 0 --device 1 --type hdd --medium /home/cf-test-u1/.bosh_virtualbox_cpi/disks/disk-7f1063f4-8ffa-4365-7955-249eed469cf0/disk.vmdk --mtype normal
16 /usr/bin/VBoxManage showvminfo vm-b9b883b9-fcfc-445f-7362-01abed26200b --machinereadable
17 /usr/bin/VBoxManage storageattach vm-b9b883b9-fcfc-445f-7362-01abed26200b --storagectl SCSI --port 0 --device 0 --type dvddrive --medium /home/cf-test-u1/.bosh_virtualbox_cpi/vms/vm-b9b883b9-fcfc-445f-7362-01abed26200b/env.iso
18 /usr/bin/VBoxManage startvm vm-b9b883b9-fcfc-445f-7362-01abed26200b --type headless
19 /usr/bin/VBoxManage createhd --filename /home/cf-test-u1/.bosh_virtualbox_cpi/disks/disk-9e7808be-db81-40a6-5960-570a491ef1b5/disk.vmdk --size 65536 --format VMDK --variant Standard
20 /usr/bin/VBoxManage showvminfo vm-b9b883b9-fcfc-445f-7362-01abed26200b --machinereadable
21 /usr/bin/VBoxManage setextradata vm-b9b883b9-fcfc-445f-7362-01abed26200b VBoxInternal2/SilentReconfigureWhilePaused 1
22 /usr/bin/VBoxManage showvminfo vm-b9b883b9-fcfc-445f-7362-01abed26200b --machinereadable
23 /usr/bin/VBoxManage controlvm vm-b9b883b9-fcfc-445f-7362-01abed26200b pause
24 /usr/bin/VBoxManage storageattach vm-b9b883b9-fcfc-445f-7362-01abed26200b --storagectl IDE --port 1 --device 0 --type hdd --medium /home/cf-test-u1/.bosh_virtualbox_cpi/disks/disk-9e7808be-db81-40a6-5960-570a491ef1b5/disk.vmdk --mtype normal
25 /usr/bin/VBoxManage controlvm vm-b9b883b9-fcfc-445f-7362-01abed26200b resume

Step 21 and subsequent steps 23-25 are not supported by recent VirtualBox version and have only worked by accident in the past, see here: https://www.virtualbox.org/ticket/19251#comment:6 and following comment 7 as well as comment 4 in the thread.

Proposed solution

The virtualbox-cpi should:

Executing above storageattach against a running (not paused) Virtualbox VM running Ubuntu (in this case an ancient 16.04) results in the following kernel log messages:

Sep 28 22:34:44 VBox kernel: [1325048.521701] ata4: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xe frozen
Sep 28 22:34:44 VBox kernel: [1325048.521708] ata4: irq_stat 0x80400040, connection status changed
Sep 28 22:34:44 VBox kernel: [1325048.521711] ata4: SError: { PHYRdyChg DevExch }
Sep 28 22:34:44 VBox kernel: [1325048.521724] ata4: hard resetting link
Sep 28 22:34:45 VBox kernel: [1325049.254359] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) 
Sep 28 22:34:45 VBox kernel: [1325049.254509] ata4.00: ATA-6: VBOX HARDDISK, 1.0, max UDMA/133
Sep 28 22:34:45 VBox kernel: [1325049.254512] ata4.00: 8388608 sectors, multi 128: LBA48 NCQ (depth 31/32) 
Sep 28 22:34:45 VBox kernel: [1325049.254697] ata4.00: configured for UDMA/133
Sep 28 22:34:45 VBox kernel: [1325049.254720] ata4: EH complete
Sep 28 22:34:45 VBox kernel: [1325049.254929] scsi 3:0:0:0: Direct-Access     ATA      VBOX HARDDISK    1.0  PQ: 0    
 ANSI: 5
Sep 28 22:34:45 VBox kernel: [1325049.255685] sd 3:0:0:0: [sdb] 8388608 512-byte logical blocks: (4.29 GB/4.00 GiB
)
Sep 28 22:34:45 VBox kernel: [1325049.255699] sd 3:0:0:0: [sdb] Write Protect is off
Sep 28 22:34:45 VBox kernel: [1325049.255701] sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
Sep 28 22:34:45 VBox kernel: [1325049.255721] sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't
 support DPO or FUA
Sep 28 22:34:45 VBox kernel: [1325049.258446] sd 3:0:0:0: Attached scsi generic sg2 type 0
Sep 28 22:34:45 VBox kernel: [1325049.259967] sd 3:0:0:0: [sdb] Attached SCSI disk

As can be seen, the kernel picked up the state change and attached the new disk, in this case as sdb. The newly attached disk is visible in the /dev filesystem:

test@vbox:~$ ls -l /dev/disk/by* | grep -e sdb -e disk
/dev/disk/by-id:
lrwxrwxrwx 1 root root  9 Sep 28 22:34 ata-VBOX_HARDDISK_VB88c2ea74-ffc2bff7 -> ../../sdb
/dev/disk/by-path:
lrwxrwxrwx 1 root root  9 Sep 28 22:34 pci-0000:00:0d.0-ata-2 -> ../../sdb
/dev/disk/by-uuid:

A modern Uuntu 20.04 behaves exactly the same, except it suppresses the error message SError: { PHYRdyChg DevExch } about unclean device connect in the kernel log.

A disk is removed by VBoxManage storageattach <uuid> --storagectl <ctrl_name> --port <port_num> --device 0 --type hdd --medium none

Caveats

pellared commented 3 years ago

@cppforlife Are there any plans to fix this issue in the upcoming days? It is a blocker for MacOS BigSur support. Reference: https://github.com/cloudfoundry/bosh-deployment/issues/378

Alternatively, if you give me some guidelines so that I can try making a PR.

ipa-bak commented 3 years ago

@cppforlife Are there any plans to fix this issue in the upcoming days? It is a blocker for MacOS BigSur support. Reference: cloudfoundry/bosh-deployment#378

Alternatively, if you give me some guidelines so that I can try making a PR.

I also still have it on my ToDo stack to get this solved, but will not get to it in the next 2+ months. If you think you can do something, I would be happy to test and comment. Note: I am just a user of this; I am not in any way related to the project and have no idea if and how the actual owner(s) of this project accept PRs (two are open since 3 and 4 years).

That said, if you have any questions regarding my findings earlier in this conversation, feel free to ask here and "@"-mention me.

bosh-admin-bot commented 3 years ago

This issue was marked as Stale because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale label or comment below.

pellared commented 3 years ago

Anybody?

bgandon commented 3 years ago

So, this repo has been donated by Dmitriy to the Cloud Foundry Foundation, and now the BOSH teams are finally admin on this repository, which allows us to take part in maintaining it. Thus the "stale bbot", the comments, and new PR…

Thanks for the great detailed analysis of the issue and the valuable links. This should be fixed with #20. For the time being, I'll close this and we may re-open if relevant.