Storage issues on VMware 8.0U1 (8.0.1.0)

weizhouapache commented 1 year ago

To support VMware 8.0U1 (8.0.1.0), I made some manual database changes below

INSERT IGNORE INTO `cloud`.`hypervisor_capabilities` (uuid, hypervisor_type, hypervisor_version, max_guests_limit, security_group_enabled, max_data_volumes_limit, max_hosts_per_cluster, storage_motion_supported, vm_snapshot_enabled) values (UUID(), 'VMware', '8.0.1.0', 1024, 0, 59, 64, 1, 1);

and

INSERT IGNORE INTO `cloud`.`guest_os_hypervisor` (uuid,hypervisor_type, hypervisor_version, guest_os_name, guest_os_id, created, is_user_defined) SELECT UUID(),'VMware', '8.0.1.0', guest_os_name, guest_os_id, utc_timestamp(), 0  FROM `cloud`.`guest_os_hypervisor` WHERE hypervisor_type='VMware' AND hypervisor_version='8.0.0.1';

However, I faced many issues which are related to storage

System VMs and VRs are booted into read-only file system, but it works fine after soft reboot (ctrl+alt+delete) or hard reboot
Sometimes cannot power on VM, this mostly happens in the first vm deployment of a new template This has been addressed by a commit https://github.com/apache/cloudstack/pull/7380/commits/a2fcf0d66ad3962b61d5aa12a4b17b96a2cca840 in PR #7380
marvin test failure with test_internal_lb.py it works inside some VMs, but in some VMs there is error below sshClient: DEBUG: {Cmd: /usr/bin/wget -T3 -qO- --user=admin --password=password http://10.1.2.12:8081/admin?stats via Host: 10.0.52.187} {returns: ["/usr/bin/wget: '/usr/lib/libpcre.so.1' is not an ELF file", "/usr/bin/wget: can't load library 'libpcre.so.1'"]} this has been addressed by a commit https://github.com/apache/cloudstack/pull/7380/commits/b1c08fddd6104fdd823411fbc1311fe2a136f307 in PR #7380
kubernetes control/worker nodes have read-only file system
kubernetes cluster is stuck at Starting

Error cloning VM from template in primary storage

2023-04-29 08:30:05,771 ERROR [c.c.s.r.VmwareStorageProcessor] (DirectAgent-285:ctx-a6342678 10.0.32.132, job-2661/job-2662, cmd: CopyCommand) (logid:1e91ee05) Error cloning VM from template in primary storage: %sUnable to access file /vmfs/volumes/e243b6f2-2c50ea8e/c86c7187-363a-4b41-baa1-267b78ccdc69/c86c7187-363a-4b41-baa1-267b78ccdc69-000001.vmdk since it is locked
java.lang.RuntimeException: Unable to access file /vmfs/volumes/e243b6f2-2c50ea8e/c86c7187-363a-4b41-baa1-267b78ccdc69/c86c7187-363a-4b41-baa1-267b78ccdc69-000001.vmdk since it is locked
    at com.cloud.hypervisor.vmware.util.VmwareClient.waitForTask(VmwareClient.java:426)
    at com.cloud.hypervisor.vmware.mo.VirtualMachineMO.createFullClone(VirtualMachineMO.java:856)
    at com.cloud.storage.resource.VmwareStorageProcessor.createVMFullClone(VmwareStorageProcessor.java:772)
    at com.cloud.storage.resource.VmwareStorageProcessor.cloneVMFromTemplate(VmwareStorageProcessor.java:3836)

ISSUE TYPE

Bug Report

COMPONENT NAME

VMware

CLOUDSTACK VERSION

4.18 + manual DB changes

CONFIGURATION

OS / ENVIRONMENT

SUMMARY

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

weizhouapache commented 1 year ago

regarding cks, it is worth to mention that

From ACS 4.16 onwards, if a CKS cluster is to be deployed on VMware, the 'vmware.create.full.clone' configuration parameter will need to be set to true, so as to allow resizing of root volumes of the cluster nodes.

weizhouapache commented 1 year ago

by @rohityadavcloud

cc @weizhouapache @borisstoyanov @DaanHoogland @NuxRo I've started a discussion thread on VMware forum - https://communities.vmware.com/t5/ESXi-Discussions/VMware-disk-errors-when-booting-on-ESXi-8-0u1a/m-p/2980935#M289426

I tried to setup a mbx template and I can consistently reproduce issues with VMware 8.0u1a esxi (I think vcenter isn't an issue), I tried both NFS and local/datastore on ESXi 8.0u1 (using latest build/iso VMware-VMvisor-Installer-8.0U1a-21813344.x86_64.iso).

weizhouapache commented 1 year ago

by @weizhouapache

I did few more testing and here are the results of actions: (1) register template and (2) deploy vm VCSA 8.0U1 and ESXi 8.0b: works Upgraded a host to 8.0c: works Upgraded a host to 8.1 U1

If only 8.0 U1 host is enabled, does not work. If another 8.0c host is enabled. deployvm does not work either (on same primary storage)
If only 8.0c host is enabled: works. If another 8.0U1 host is enabled. deployvm also works (on same primary storage)
Thus, it looks like an issue in the ESXi upgrade (between 8.0c and 8.0 U1). The only difference is the host which handles the CopyCommand from secondary storage to primary storage.

2023-05-30 16:46:02,790 INFO [vmware.util.VmwareContext] (agentRequest-Handler-9:job-104/job-105, cmd: CopyCommand) Connected, conn: sun.net.www.protocol.https.DelegateHttpsURLConnection:https://10.0.32.197/nfc/52e7741f-89f7-ed9c-01df-eea4c3eb6911/disk-0.vmdk

I suspect if it is caused by a change in ESXi 8.0 U1, which might cause data loss during NFC (Network File Copy). see https://docs.vmware.com/en/VMware-vSphere/8.0/rn/vsphere-esxi-801-release-notes/index.html

New file type for OSDATA volumes on SSD devices: vSphere 8.0 Update 1 adds a new file system type, VMFSOS, specifically for the ESX-OSData system partition on local SSD devices, which allows you to continue using virtual flash resources on other devices. The new file type prevents cases when you format an ESX-OSData volume on a local SSD device, and fsType returns a file of type Virtual Flash File System (VFFS). As a result, the disk backing of the ESX-OSData volume is listed under the Virtual Flash resources in vCenter, but such a disk belongs to the ESX-OSData volume and is not a part of the Virtual Flash resource pool.

rohityadavcloud commented 1 year ago

@weizhouapache there's a new comment from a community member on the vmware community thread:

The issue you are experiencing is likely due to a change in the way that vSphere 8.0u1 handles storage. In 8.0u1, vSphere uses a new format for VMDK files, which is not compatible with older versions of vSphere. This is why you are not seeing the issue when you use 8.0 or older versions of vSphere.

There are a few things you can do to work around this issue:

You can upgrade your Apache CloudStack to a version that is compatible with vSphere 8.0u1.
You can create a new VMDK file in the older format. To do this, you will need to use the qemu-img command. For example, to create a 10GB VMDK file in the older format, you would use the following command:

qemu-img create -f raw /tmp/vmdk.raw 10G

Once you have created the new VMDK file, you can attach it to your VM and boot it up.

There also seems to be a new 8.0u2 release https://core.vmware.com/resource/whats-new-vsphere-8-update-2 ?

omurozlu commented 10 months ago

I'm having the same problem. Has anyone found a solution?

weizhouapache commented 10 months ago

I'm having the same problem. Has anyone found a solution?

@omurozlu no, we will revisit 8.0U1 and 8.0U2 support in 4.19.1/4.20.0.

weizhouapache commented 9 months ago

this issue still exist in 4.19.0.0 RC4, so VMware 8.0.1 is still unsupported.

cc @vladimirpetrov @shwstppr

alexandru-bagu commented 5 months ago

I have done some tests with CS 4.18.1, ESXi 8.0u2 (ESXi-8.0U2-22380479-standard) and vCenter 8.0u2 and it looks like it's somewhat working. My environment is hosted in KVM and for the VMs to not have issues with the underlying storage I had to use RAW disk format instead of QCOW2. There are still issues though when deploying templates but they are intermittent which is odd, sometimes it doesn't work for 8 attempts in a row then it works (disk locked issue).

After a while I started getting read-only filesystem errors on my routers though.

Is there anything that CloudStack can do to fix this issue? Where exactly is the problem?

weizhouapache commented 5 months ago

I have done some tests with CS 4.18.1, ESXi 8.0u2 (ESXi-8.0U2-22380479-standard) and vCenter 8.0u2 and it looks like it's somewhat working. My environment is hosted in KVM and for the VMs to not have issues with the underlying storage I had to use RAW disk format instead of QCOW2. There are still issues though when deploying templates but they are intermittent which is odd, sometimes it doesn't work for 8 attempts in a row then it works (disk locked issue).

After a while I started getting read-only filesystem errors on my routers though.

Is there anything that CloudStack can do to fix this issue? Where exactly is the problem?

@alexandru-bagu I have investigated the issue for some days last year, unfortunately I could not get the root cause and find a fix. Early this year I tested 4.20.0.0-SNAPSHOT with the new Debian12 systemvm template, see #8497, surprisingly it worked. I suspect the issue was caused by some linux kernel changes, I cannot confirm it. If it is true, some user vms might be impacted as well.

I suggest to use vmware 8.0, not 80u1/u2 which are not officially supported by ACS 4.19 but might be supported in ACS 4.20.

leolns commented 2 months ago

this issue still exist in 4.19.0.0 RC4, so VMware 8.0.1 is still unsupported.

cc @vladimirpetrov @shwstppr

I was able to workaround this problem forcing a fsck on first image boot. To do this you need to do a "touch /forcefsck"

It worked on 8.0u3 with vsan esa.

Detailed description bellow:

tar -xf systemvmtemplate-4.19.1-vmware.ova 
qemu-img convert -O raw systemvmtemplate-4.19.1-vmware-disk1.vmdk your_file.raw
losetup -fP your_file.raw
mount /dev/loop0p6 /mnt
touch /mnt/forcefsck
umount /mnt
losetup -d /dev/loop0
qemu-img convert -O vmdk -o subformat=streamOptimized your_file.raw systemvmtemplate-4.19.1-vmware-disk1.vmdk
ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk
# Changed ovf:size with the new image size in bytes from "ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk" inside file systemvmtemplate-4.19.1-vmware.ovf
sha256sum --tag systemvmtemplate-4.19.1-vmware.ovf > systemvmtemplate-4.19.1-vmware.mf
sha256sum --tag systemvmtemplate-4.19.1-vmware-disk1.vmdk >> systemvmtemplate-4.19.1-vmware.mf
tar -cvf systemvmtemplate-4.19.1-vmware.ova systemvmtemplate-4.19.1-vmware.ovf systemvmtemplate-4.19.1-vmware-disk1.vmdk systemvmtemplate-4.19.1-vmware.mf
# Published a new template image inside UI
# Changed global settings router.template.vmware to new image name

weizhouapache commented 2 months ago

this issue still exist in 4.19.0.0 RC4, so VMware 8.0.1 is still unsupported. cc @vladimirpetrov @shwstppr

I was able to workaround this problem forcing a fsck on first image boot. To do this you need to do a "touch /forcefsck"

It worked on 8.0u3 with vsan esa.

Detailed description bellow:

tar -xf systemvmtemplate-4.19.1-vmware.ova 
qemu-img convert -O raw systemvmtemplate-4.19.1-vmware-disk1.vmdk your_file.raw
losetup -fP your_file.raw
mount /dev/loop0p6 /mnt
touch /mnt/forcefsck
umount /mnt
losetup -d /dev/loop0
qemu-img convert -O vmdk -o subformat=streamOptimized your_file.raw systemvmtemplate-4.19.1-vmware-disk1.vmdk
ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk
# Changed ovf:size with the new image size in bytes from "ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk" inside file systemvmtemplate-4.19.1-vmware.ovf
sha256sum --tag systemvmtemplate-4.19.1-vmware.ovf > systemvmtemplate-4.19.1-vmware.mf
sha256sum --tag systemvmtemplate-4.19.1-vmware-disk1.vmdk >> systemvmtemplate-4.19.1-vmware.mf
tar -cvf systemvmtemplate-4.19.1-vmware.ova systemvmtemplate-4.19.1-vmware.ovf systemvmtemplate-4.19.1-vmware-disk1.vmdk systemvmtemplate-4.19.1-vmware.mf
# Published a new template image inside UI
# Changed global settings router.template.vmware to new image name

thanks a lot for the update @leolns

how long fsck take in your environment ?

leolns commented 2 months ago

this issue still exist in 4.19.0.0 RC4, so VMware 8.0.1 is still unsupported. cc @vladimirpetrov @shwstppr

I was able to workaround this problem forcing a fsck on first image boot. To do this you need to do a "touch /forcefsck" It worked on 8.0u3 with vsan esa. Detailed description bellow:

tar -xf systemvmtemplate-4.19.1-vmware.ova 
qemu-img convert -O raw systemvmtemplate-4.19.1-vmware-disk1.vmdk your_file.raw
losetup -fP your_file.raw
mount /dev/loop0p6 /mnt
touch /mnt/forcefsck
umount /mnt
losetup -d /dev/loop0
qemu-img convert -O vmdk -o subformat=streamOptimized your_file.raw systemvmtemplate-4.19.1-vmware-disk1.vmdk
ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk
# Changed ovf:size with the new image size in bytes from "ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk" inside file systemvmtemplate-4.19.1-vmware.ovf
sha256sum --tag systemvmtemplate-4.19.1-vmware.ovf > systemvmtemplate-4.19.1-vmware.mf
sha256sum --tag systemvmtemplate-4.19.1-vmware-disk1.vmdk >> systemvmtemplate-4.19.1-vmware.mf
tar -cvf systemvmtemplate-4.19.1-vmware.ova systemvmtemplate-4.19.1-vmware.ovf systemvmtemplate-4.19.1-vmware-disk1.vmdk systemvmtemplate-4.19.1-vmware.mf
# Published a new template image inside UI
# Changed global settings router.template.vmware to new image name

thanks a lot for the update @leolns

how long fsck take in your environment ?

It took only a few seconds to run and it only runs on the first boot.

apache / cloudstack