apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
2.09k stars 1.11k forks source link

Storage issues on VMware 8.0U1 (8.0.1.0) #7572

Open weizhouapache opened 1 year ago

weizhouapache commented 1 year ago

To support VMware 8.0U1 (8.0.1.0), I made some manual database changes below

INSERT IGNORE INTO `cloud`.`hypervisor_capabilities` (uuid, hypervisor_type, hypervisor_version, max_guests_limit, security_group_enabled, max_data_volumes_limit, max_hosts_per_cluster, storage_motion_supported, vm_snapshot_enabled) values (UUID(), 'VMware', '8.0.1.0', 1024, 0, 59, 64, 1, 1);

and

INSERT IGNORE INTO `cloud`.`guest_os_hypervisor` (uuid,hypervisor_type, hypervisor_version, guest_os_name, guest_os_id, created, is_user_defined) SELECT UUID(),'VMware', '8.0.1.0', guest_os_name, guest_os_id, utc_timestamp(), 0  FROM `cloud`.`guest_os_hypervisor` WHERE hypervisor_type='VMware' AND hypervisor_version='8.0.0.1';

However, I faced many issues which are related to storage

COMPONENT NAME
VMware
CLOUDSTACK VERSION
4.18 + manual DB changes
CONFIGURATION
OS / ENVIRONMENT
SUMMARY
STEPS TO REPRODUCE
EXPECTED RESULTS
ACTUAL RESULTS
weizhouapache commented 1 year ago

regarding cks, it is worth to mention that

From ACS 4.16 onwards, if a CKS cluster is to be deployed on VMware, the 'vmware.create.full.clone' configuration parameter will need to be set to true, so as to allow resizing of root volumes of the cluster nodes.

weizhouapache commented 1 year ago

by @rohityadavcloud

cc @weizhouapache @borisstoyanov @DaanHoogland @NuxRo I've started a discussion thread on VMware forum - https://communities.vmware.com/t5/ESXi-Discussions/VMware-disk-errors-when-booting-on-ESXi-8-0u1a/m-p/2980935#M289426

I tried to setup a mbx template and I can consistently reproduce issues with VMware 8.0u1a esxi (I think vcenter isn't an issue), I tried both NFS and local/datastore on ESXi 8.0u1 (using latest build/iso VMware-VMvisor-Installer-8.0U1a-21813344.x86_64.iso).

weizhouapache commented 1 year ago

by @weizhouapache

I did few more testing and here are the results of actions: (1) register template and (2) deploy vm VCSA 8.0U1 and ESXi 8.0b: works Upgraded a host to 8.0c: works Upgraded a host to 8.1 U1

2023-05-30 16:46:02,790 INFO [vmware.util.VmwareContext] (agentRequest-Handler-9:job-104/job-105, cmd: CopyCommand) Connected, conn: sun.net.www.protocol.https.DelegateHttpsURLConnection:https://10.0.32.197/nfc/52e7741f-89f7-ed9c-01df-eea4c3eb6911/disk-0.vmdk

I suspect if it is caused by a change in ESXi 8.0 U1, which might cause data loss during NFC (Network File Copy). see https://docs.vmware.com/en/VMware-vSphere/8.0/rn/vsphere-esxi-801-release-notes/index.html

New file type for OSDATA volumes on SSD devices: vSphere 8.0 Update 1 adds a new file system type, VMFSOS, specifically for the ESX-OSData system partition on local SSD devices, which allows you to continue using virtual flash resources on other devices. The new file type prevents cases when you format an ESX-OSData volume on a local SSD device, and fsType returns a file of type Virtual Flash File System (VFFS). As a result, the disk backing of the ESX-OSData volume is listed under the Virtual Flash resources in vCenter, but such a disk belongs to the ESX-OSData volume and is not a part of the Virtual Flash resource pool.

rohityadavcloud commented 1 year ago

@weizhouapache there's a new comment from a community member on the vmware community thread:

The issue you are experiencing is likely due to a change in the way that vSphere 8.0u1 handles storage. In 8.0u1, vSphere uses a new format for VMDK files, which is not compatible with older versions of vSphere. This is why you are not seeing the issue when you use 8.0 or older versions of vSphere.

There are a few things you can do to work around this issue:

You can upgrade your Apache CloudStack to a version that is compatible with vSphere 8.0u1.
You can create a new VMDK file in the older format. To do this, you will need to use the qemu-img command. For example, to create a 10GB VMDK file in the older format, you would use the following command:

qemu-img create -f raw /tmp/vmdk.raw 10G

Once you have created the new VMDK file, you can attach it to your VM and boot it up.

There also seems to be a new 8.0u2 release https://core.vmware.com/resource/whats-new-vsphere-8-update-2 ?

omurozlu commented 10 months ago

I'm having the same problem. Has anyone found a solution?

weizhouapache commented 10 months ago

I'm having the same problem. Has anyone found a solution?

@omurozlu no, we will revisit 8.0U1 and 8.0U2 support in 4.19.1/4.20.0.

weizhouapache commented 9 months ago

this issue still exist in 4.19.0.0 RC4, so VMware 8.0.1 is still unsupported.

image

cc @vladimirpetrov @shwstppr

alexandru-bagu commented 5 months ago

I have done some tests with CS 4.18.1, ESXi 8.0u2 (ESXi-8.0U2-22380479-standard) and vCenter 8.0u2 and it looks like it's somewhat working. My environment is hosted in KVM and for the VMs to not have issues with the underlying storage I had to use RAW disk format instead of QCOW2. There are still issues though when deploying templates but they are intermittent which is odd, sometimes it doesn't work for 8 attempts in a row then it works (disk locked issue).

image image

After a while I started getting read-only filesystem errors on my routers though.

Is there anything that CloudStack can do to fix this issue? Where exactly is the problem?

weizhouapache commented 5 months ago

I have done some tests with CS 4.18.1, ESXi 8.0u2 (ESXi-8.0U2-22380479-standard) and vCenter 8.0u2 and it looks like it's somewhat working. My environment is hosted in KVM and for the VMs to not have issues with the underlying storage I had to use RAW disk format instead of QCOW2. There are still issues though when deploying templates but they are intermittent which is odd, sometimes it doesn't work for 8 attempts in a row then it works (disk locked issue).

image image

After a while I started getting read-only filesystem errors on my routers though.

Is there anything that CloudStack can do to fix this issue? Where exactly is the problem?

@alexandru-bagu I have investigated the issue for some days last year, unfortunately I could not get the root cause and find a fix. Early this year I tested 4.20.0.0-SNAPSHOT with the new Debian12 systemvm template, see #8497, surprisingly it worked. I suspect the issue was caused by some linux kernel changes, I cannot confirm it. If it is true, some user vms might be impacted as well.

I suggest to use vmware 8.0, not 80u1/u2 which are not officially supported by ACS 4.19 but might be supported in ACS 4.20.

leolns commented 2 months ago

this issue still exist in 4.19.0.0 RC4, so VMware 8.0.1 is still unsupported.

image

cc @vladimirpetrov @shwstppr

I was able to workaround this problem forcing a fsck on first image boot. To do this you need to do a "touch /forcefsck"

It worked on 8.0u3 with vsan esa.

Detailed description bellow:

tar -xf systemvmtemplate-4.19.1-vmware.ova 
qemu-img convert -O raw systemvmtemplate-4.19.1-vmware-disk1.vmdk your_file.raw
losetup -fP your_file.raw
mount /dev/loop0p6 /mnt
touch /mnt/forcefsck
umount /mnt
losetup -d /dev/loop0
qemu-img convert -O vmdk -o subformat=streamOptimized your_file.raw systemvmtemplate-4.19.1-vmware-disk1.vmdk
ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk
# Changed ovf:size with the new image size in bytes from "ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk" inside file systemvmtemplate-4.19.1-vmware.ovf
sha256sum --tag systemvmtemplate-4.19.1-vmware.ovf > systemvmtemplate-4.19.1-vmware.mf
sha256sum --tag systemvmtemplate-4.19.1-vmware-disk1.vmdk >> systemvmtemplate-4.19.1-vmware.mf
tar -cvf systemvmtemplate-4.19.1-vmware.ova systemvmtemplate-4.19.1-vmware.ovf systemvmtemplate-4.19.1-vmware-disk1.vmdk systemvmtemplate-4.19.1-vmware.mf
# Published a new template image inside UI
# Changed global settings router.template.vmware to new image name
weizhouapache commented 2 months ago

this issue still exist in 4.19.0.0 RC4, so VMware 8.0.1 is still unsupported. image cc @vladimirpetrov @shwstppr

I was able to workaround this problem forcing a fsck on first image boot. To do this you need to do a "touch /forcefsck"

It worked on 8.0u3 with vsan esa.

Detailed description bellow:

tar -xf systemvmtemplate-4.19.1-vmware.ova 
qemu-img convert -O raw systemvmtemplate-4.19.1-vmware-disk1.vmdk your_file.raw
losetup -fP your_file.raw
mount /dev/loop0p6 /mnt
touch /mnt/forcefsck
umount /mnt
losetup -d /dev/loop0
qemu-img convert -O vmdk -o subformat=streamOptimized your_file.raw systemvmtemplate-4.19.1-vmware-disk1.vmdk
ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk
# Changed ovf:size with the new image size in bytes from "ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk" inside file systemvmtemplate-4.19.1-vmware.ovf
sha256sum --tag systemvmtemplate-4.19.1-vmware.ovf > systemvmtemplate-4.19.1-vmware.mf
sha256sum --tag systemvmtemplate-4.19.1-vmware-disk1.vmdk >> systemvmtemplate-4.19.1-vmware.mf
tar -cvf systemvmtemplate-4.19.1-vmware.ova systemvmtemplate-4.19.1-vmware.ovf systemvmtemplate-4.19.1-vmware-disk1.vmdk systemvmtemplate-4.19.1-vmware.mf
# Published a new template image inside UI
# Changed global settings router.template.vmware to new image name

thanks a lot for the update @leolns

how long fsck take in your environment ?

leolns commented 2 months ago

this issue still exist in 4.19.0.0 RC4, so VMware 8.0.1 is still unsupported. image cc @vladimirpetrov @shwstppr

I was able to workaround this problem forcing a fsck on first image boot. To do this you need to do a "touch /forcefsck" It worked on 8.0u3 with vsan esa. Detailed description bellow:

tar -xf systemvmtemplate-4.19.1-vmware.ova 
qemu-img convert -O raw systemvmtemplate-4.19.1-vmware-disk1.vmdk your_file.raw
losetup -fP your_file.raw
mount /dev/loop0p6 /mnt
touch /mnt/forcefsck
umount /mnt
losetup -d /dev/loop0
qemu-img convert -O vmdk -o subformat=streamOptimized your_file.raw systemvmtemplate-4.19.1-vmware-disk1.vmdk
ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk
# Changed ovf:size with the new image size in bytes from "ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk" inside file systemvmtemplate-4.19.1-vmware.ovf
sha256sum --tag systemvmtemplate-4.19.1-vmware.ovf > systemvmtemplate-4.19.1-vmware.mf
sha256sum --tag systemvmtemplate-4.19.1-vmware-disk1.vmdk >> systemvmtemplate-4.19.1-vmware.mf
tar -cvf systemvmtemplate-4.19.1-vmware.ova systemvmtemplate-4.19.1-vmware.ovf systemvmtemplate-4.19.1-vmware-disk1.vmdk systemvmtemplate-4.19.1-vmware.mf
# Published a new template image inside UI
# Changed global settings router.template.vmware to new image name

thanks a lot for the update @leolns

how long fsck take in your environment ?

It took only a few seconds to run and it only runs on the first boot.