Open weizhouapache opened 1 year ago
regarding cks, it is worth to mention that
From ACS 4.16 onwards, if a CKS cluster is to be deployed on VMware, the 'vmware.create.full.clone' configuration parameter will need to be set to true, so as to allow resizing of root volumes of the cluster nodes.
by @rohityadavcloud
cc @weizhouapache @borisstoyanov @DaanHoogland @NuxRo I've started a discussion thread on VMware forum - https://communities.vmware.com/t5/ESXi-Discussions/VMware-disk-errors-when-booting-on-ESXi-8-0u1a/m-p/2980935#M289426
I tried to setup a mbx template and I can consistently reproduce issues with VMware 8.0u1a esxi (I think vcenter isn't an issue), I tried both NFS and local/datastore on ESXi 8.0u1 (using latest build/iso VMware-VMvisor-Installer-8.0U1a-21813344.x86_64.iso).
by @weizhouapache
I did few more testing and here are the results of actions: (1) register template and (2) deploy vm VCSA 8.0U1 and ESXi 8.0b: works Upgraded a host to 8.0c: works Upgraded a host to 8.1 U1
2023-05-30 16:46:02,790 INFO [vmware.util.VmwareContext] (agentRequest-Handler-9:job-104/job-105, cmd: CopyCommand) Connected, conn: sun.net.www.protocol.https.DelegateHttpsURLConnection:https://10.0.32.197/nfc/52e7741f-89f7-ed9c-01df-eea4c3eb6911/disk-0.vmdk
I suspect if it is caused by a change in ESXi 8.0 U1, which might cause data loss during NFC (Network File Copy). see https://docs.vmware.com/en/VMware-vSphere/8.0/rn/vsphere-esxi-801-release-notes/index.html
New file type for OSDATA volumes on SSD devices: vSphere 8.0 Update 1 adds a new file system type, VMFSOS, specifically for the ESX-OSData system partition on local SSD devices, which allows you to continue using virtual flash resources on other devices. The new file type prevents cases when you format an ESX-OSData volume on a local SSD device, and fsType returns a file of type Virtual Flash File System (VFFS). As a result, the disk backing of the ESX-OSData volume is listed under the Virtual Flash resources in vCenter, but such a disk belongs to the ESX-OSData volume and is not a part of the Virtual Flash resource pool.
@weizhouapache there's a new comment from a community member on the vmware community thread:
The issue you are experiencing is likely due to a change in the way that vSphere 8.0u1 handles storage. In 8.0u1, vSphere uses a new format for VMDK files, which is not compatible with older versions of vSphere. This is why you are not seeing the issue when you use 8.0 or older versions of vSphere.
There are a few things you can do to work around this issue:
You can upgrade your Apache CloudStack to a version that is compatible with vSphere 8.0u1.
You can create a new VMDK file in the older format. To do this, you will need to use the qemu-img command. For example, to create a 10GB VMDK file in the older format, you would use the following command:
qemu-img create -f raw /tmp/vmdk.raw 10G
Once you have created the new VMDK file, you can attach it to your VM and boot it up.
There also seems to be a new 8.0u2 release https://core.vmware.com/resource/whats-new-vsphere-8-update-2 ?
I'm having the same problem. Has anyone found a solution?
I'm having the same problem. Has anyone found a solution?
@omurozlu no, we will revisit 8.0U1 and 8.0U2 support in 4.19.1/4.20.0.
this issue still exist in 4.19.0.0 RC4, so VMware 8.0.1 is still unsupported.
cc @vladimirpetrov @shwstppr
I have done some tests with CS 4.18.1, ESXi 8.0u2 (ESXi-8.0U2-22380479-standard) and vCenter 8.0u2 and it looks like it's somewhat working. My environment is hosted in KVM and for the VMs to not have issues with the underlying storage I had to use RAW disk format instead of QCOW2. There are still issues though when deploying templates but they are intermittent which is odd, sometimes it doesn't work for 8 attempts in a row then it works (disk locked issue).
After a while I started getting read-only filesystem errors on my routers though.
Is there anything that CloudStack can do to fix this issue? Where exactly is the problem?
I have done some tests with CS 4.18.1, ESXi 8.0u2 (ESXi-8.0U2-22380479-standard) and vCenter 8.0u2 and it looks like it's somewhat working. My environment is hosted in KVM and for the VMs to not have issues with the underlying storage I had to use RAW disk format instead of QCOW2. There are still issues though when deploying templates but they are intermittent which is odd, sometimes it doesn't work for 8 attempts in a row then it works (disk locked issue).
After a while I started getting read-only filesystem errors on my routers though.
Is there anything that CloudStack can do to fix this issue? Where exactly is the problem?
@alexandru-bagu I have investigated the issue for some days last year, unfortunately I could not get the root cause and find a fix. Early this year I tested 4.20.0.0-SNAPSHOT with the new Debian12 systemvm template, see #8497, surprisingly it worked. I suspect the issue was caused by some linux kernel changes, I cannot confirm it. If it is true, some user vms might be impacted as well.
I suggest to use vmware 8.0, not 80u1/u2 which are not officially supported by ACS 4.19 but might be supported in ACS 4.20.
this issue still exist in 4.19.0.0 RC4, so VMware 8.0.1 is still unsupported.
cc @vladimirpetrov @shwstppr
I was able to workaround this problem forcing a fsck on first image boot. To do this you need to do a "touch /forcefsck"
It worked on 8.0u3 with vsan esa.
Detailed description bellow:
tar -xf systemvmtemplate-4.19.1-vmware.ova
qemu-img convert -O raw systemvmtemplate-4.19.1-vmware-disk1.vmdk your_file.raw
losetup -fP your_file.raw
mount /dev/loop0p6 /mnt
touch /mnt/forcefsck
umount /mnt
losetup -d /dev/loop0
qemu-img convert -O vmdk -o subformat=streamOptimized your_file.raw systemvmtemplate-4.19.1-vmware-disk1.vmdk
ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk
# Changed ovf:size with the new image size in bytes from "ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk" inside file systemvmtemplate-4.19.1-vmware.ovf
sha256sum --tag systemvmtemplate-4.19.1-vmware.ovf > systemvmtemplate-4.19.1-vmware.mf
sha256sum --tag systemvmtemplate-4.19.1-vmware-disk1.vmdk >> systemvmtemplate-4.19.1-vmware.mf
tar -cvf systemvmtemplate-4.19.1-vmware.ova systemvmtemplate-4.19.1-vmware.ovf systemvmtemplate-4.19.1-vmware-disk1.vmdk systemvmtemplate-4.19.1-vmware.mf
# Published a new template image inside UI
# Changed global settings router.template.vmware to new image name
this issue still exist in 4.19.0.0 RC4, so VMware 8.0.1 is still unsupported. cc @vladimirpetrov @shwstppr
I was able to workaround this problem forcing a fsck on first image boot. To do this you need to do a "touch /forcefsck"
It worked on 8.0u3 with vsan esa.
Detailed description bellow:
tar -xf systemvmtemplate-4.19.1-vmware.ova qemu-img convert -O raw systemvmtemplate-4.19.1-vmware-disk1.vmdk your_file.raw losetup -fP your_file.raw mount /dev/loop0p6 /mnt touch /mnt/forcefsck umount /mnt losetup -d /dev/loop0 qemu-img convert -O vmdk -o subformat=streamOptimized your_file.raw systemvmtemplate-4.19.1-vmware-disk1.vmdk ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk # Changed ovf:size with the new image size in bytes from "ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk" inside file systemvmtemplate-4.19.1-vmware.ovf sha256sum --tag systemvmtemplate-4.19.1-vmware.ovf > systemvmtemplate-4.19.1-vmware.mf sha256sum --tag systemvmtemplate-4.19.1-vmware-disk1.vmdk >> systemvmtemplate-4.19.1-vmware.mf tar -cvf systemvmtemplate-4.19.1-vmware.ova systemvmtemplate-4.19.1-vmware.ovf systemvmtemplate-4.19.1-vmware-disk1.vmdk systemvmtemplate-4.19.1-vmware.mf # Published a new template image inside UI # Changed global settings router.template.vmware to new image name
thanks a lot for the update @leolns
how long fsck
take in your environment ?
this issue still exist in 4.19.0.0 RC4, so VMware 8.0.1 is still unsupported. cc @vladimirpetrov @shwstppr
I was able to workaround this problem forcing a fsck on first image boot. To do this you need to do a "touch /forcefsck" It worked on 8.0u3 with vsan esa. Detailed description bellow:
tar -xf systemvmtemplate-4.19.1-vmware.ova qemu-img convert -O raw systemvmtemplate-4.19.1-vmware-disk1.vmdk your_file.raw losetup -fP your_file.raw mount /dev/loop0p6 /mnt touch /mnt/forcefsck umount /mnt losetup -d /dev/loop0 qemu-img convert -O vmdk -o subformat=streamOptimized your_file.raw systemvmtemplate-4.19.1-vmware-disk1.vmdk ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk # Changed ovf:size with the new image size in bytes from "ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk" inside file systemvmtemplate-4.19.1-vmware.ovf sha256sum --tag systemvmtemplate-4.19.1-vmware.ovf > systemvmtemplate-4.19.1-vmware.mf sha256sum --tag systemvmtemplate-4.19.1-vmware-disk1.vmdk >> systemvmtemplate-4.19.1-vmware.mf tar -cvf systemvmtemplate-4.19.1-vmware.ova systemvmtemplate-4.19.1-vmware.ovf systemvmtemplate-4.19.1-vmware-disk1.vmdk systemvmtemplate-4.19.1-vmware.mf # Published a new template image inside UI # Changed global settings router.template.vmware to new image name
thanks a lot for the update @leolns
how long
fsck
take in your environment ?
It took only a few seconds to run and it only runs on the first boot.
To support VMware 8.0U1 (8.0.1.0), I made some manual database changes below
and
However, I faced many issues which are related to storage
System VMs and VRs are booted into read-only file system, but it works fine after soft reboot (ctrl+alt+delete) or hard reboot
Sometimes cannot power on VM, this mostly happens in the first vm deployment of a new template This has been addressed by a commit https://github.com/apache/cloudstack/pull/7380/commits/a2fcf0d66ad3962b61d5aa12a4b17b96a2cca840 in PR #7380
marvin test failure with test_internal_lb.py it works inside some VMs, but in some VMs there is error below
sshClient: DEBUG: {Cmd: /usr/bin/wget -T3 -qO- --user=admin --password=password http://10.1.2.12:8081/admin?stats via Host: 10.0.52.187} {returns: ["/usr/bin/wget: '/usr/lib/libpcre.so.1' is not an ELF file", "/usr/bin/wget: can't load library 'libpcre.so.1'"]}
this has been addressed by a commit https://github.com/apache/cloudstack/pull/7380/commits/b1c08fddd6104fdd823411fbc1311fe2a136f307 in PR #7380kubernetes control/worker nodes have read-only file system
kubernetes cluster is stuck at Starting
Error cloning VM from template in primary storage
ISSUE TYPE
COMPONENT NAME
CLOUDSTACK VERSION
CONFIGURATION
OS / ENVIRONMENT
SUMMARY
STEPS TO REPRODUCE
EXPECTED RESULTS
ACTUAL RESULTS