Cloudstack volume is not deleted when PVC is deleted

swhitman0 commented 1 year ago

ISSUE TYPE

Bug Report

COMPONENT NAME

Storage volumes, Kubernetes Service

CLOUDSTACK VERSION

4.17.2.0

CONFIGURATION

Advanced networking with vxlan isolation of guest networks
Kubernetes service enabled
Cloudstack CSI driver deployed

OS / ENVIRONMENT

Rocky Linnux 8.7

SUMMARY

When a Kubernetes persistent volume claim is deleted, the Cloudstack volume is not deleted, and an error message is written to the management log.

STEPS TO REPRODUCE

Install the CSI driver in the K8s cluster using the command :

kubectl apply -f https://github.com/apalia/cloudstack-csi-driver/releases/latest/download/manifest.yaml

Deploy this storage class:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: cloudstack-custom
provisioner: csi.cloudstack.apache.org
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: false
parameters:
csi.cloudstack.apache.org/disk-offering-id: <id of custom disk offering>

Deploy this pvs and pod:


apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: example-pvc
spec:
storageClassName: cloudstack-custom
accessModes:
- ReadWriteOnce
resources:
requests:
  storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
name: example-pod
spec:
containers:
- name: example
  image: busybox
  volumeMounts:
    - mountPath: "/data"
      name: example-volume
  stdin: true
  stdinOnce: true
  tty: true
volumes:
- name: example-volume
  persistentVolumeClaim:
    claimName: example-pvc

* Delete the pod and pvc
<!-- You can also paste gist.github.com links for larger files -->

##### EXPECTED RESULTS
<!-- What did you expect to happen when running the steps above? -->

I expected the Cloudstack storage volume backing the PVC would be deleted.


##### ACTUAL RESULTS
<!-- What actually happened? -->
The storage volume was not deleted, and the following error was printed to the management log:
<!-- Paste verbatim command output between quotes below -->

2023-03-05 14:18:39,144 ERROR [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-92:ctx-3f7cf487 job-1657/job-1658 ctx-d6957fbc) (logid:fd682bcf) Invocation exception, caused by: com.cloud.utils.exception.CloudRuntimeException: Failed to detach volume pvc-0e82d1a8-9af2-40b3-b037-ebcf151febd3 from VM testk8s-node-186b0939378; com.cloud.exception.InternalErrorException: Could not detach volume. Probably the VM is in boot state at the moment 2023-03-05 14:18:39,144 INFO [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-92:ctx-3f7cf487 job-1657/job-1658 ctx-d6957fbc) (logid:fd682bcf) Rethrow exception com.cloud.utils.exception.CloudRuntimeException: Failed to detach volume pvc-0e82d1a8-9af2-40b3-b037-ebcf151febd3 from VM testk8s-node-186b0939378; com.cloud.exception.InternalErrorException: Could not detach volume. Probably the VM is in boot state at the moment 2023-03-05 14:18:39,144 DEBUG [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-92:ctx-3f7cf487 job-1657/job-1658) (logid:fd682bcf) Done with run of VM work job: com.cloud.vm.VmWorkDetachVolume for VM 188, job origin: 1657 2023-03-05 14:18:39,144 ERROR [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-92:ctx-3f7cf487 job-1657/job-1658) (logid:fd682bcf) Unable to complete AsyncJobVO: {id:1658, userId: 4, accountId: 2, instanceType: null, instanceId: null, cmd: com.cloud.vm.VmWorkDetachVolume, cmdInfo: rO0ABXNyAB9jb20uY2xvdWQudm0uVm1Xb3JrRGV0YWNoVm9sdW1lhvReA-s81AMCAAFMAAh2b2x1bWVJZHQAEExqYXZhL2xhbmcvTG9uZzt4cgATY29tLmNsb3VkLnZtLlZtV29ya5-ZtlbwJWdrAgAESgAJYWNjb3VudElkSgAGdXNlcklkSgAEdm1JZEwAC2hhbmRsZXJOYW1ldAASTGphdmEvbGFuZy9TdHJpbmc7eHAAAAAAAAAAAgAAAAAAAAAEAAAAAAAAALx0ABRWb2x1bWVBcGlTZXJ2aWNlSW1wbHNyAA5qYXZhLmxhbmcuTG9uZzuL5JDMjyPfAgABSgAFdmFsdWV4cgAQamF2YS5sYW5nLk51bWJlcoaslR0LlOCLAgAAeHAAAAAAAAAAvA, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 128437426857854, completeMsid: null, lastUpdated: null, lastPolled: null, created: Sun Mar 05 14:18:36 EST 2023, removed: null}, job origin:1657 com.cloud.utils.exception.CloudRuntimeException: Failed to detach volume pvc-0e82d1a8-9af2-40b3-b037-ebcf151febd3 from VM testk8s-node-186b0939378; com.cloud.exception.InternalErrorException: Could not detach volume. Probably the VM is in boot state at the moment at com.cloud.storage.VolumeApiServiceImpl.orchestrateDetachVolumeFromVM(VolumeApiServiceImpl.java:2793) at com.cloud.storage.VolumeApiServiceImpl.orchestrateDetachVolumeFromVM(VolumeApiServiceImpl.java:4404) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.cloud.vm.VmWorkJobHandlerProxy.handleVmWorkJob(VmWorkJobHandlerProxy.java:107) at com.cloud.storage.VolumeApiServiceImpl.handleVmWorkJob(VolumeApiServiceImpl.java:4435) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344) at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215) at com.sun.proxy.$Proxy231.handleVmWorkJob(Unknown Source) at com.cloud.vm.VmWorkJobDispatcher.runJob(VmWorkJobDispatcher.java:102) at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:620) at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:48) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:102) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:45) at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:568) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)

boring-cyborg[bot] commented 1 year ago

Thanks for opening your first issue here! Be sure to follow the issue template!

kiranchavala commented 1 year ago

@swhitman0

when i gave "kubectl delete pod <>" "kubectl delete pvc <>"

I was able to detach /delete the volume on cloud stack and delete the pods , there was no issue

Could you please check if the csi provider is running fine

kubectl get pods -n kube-system |grep cloudstack

cloudstack-csi-controller-5cf89485f6-xfnbm             3/3     Running   0              19m
cloudstack-csi-node-4w4kt                              2/2     Running   0              31m
cloudstack-csi-node-tl6nj                              2/2     Running   0              31m

also, i would request you to log the issue at this Git hub repo

https://github.com/apalia/cloudstack-csi-driver

swhitman0 commented 1 year ago

@kiranchavala

Thanks for the response. The csi provider is running fine - 3/3 Running - no restarts.

Instead of executing kubectl delete -f , I separated the commands by first deleting the pod and then the PVC, but unfortunately, I still have the problem.

I will file the issue in the apalla/cloudstack-csi-driver as you requested.

Would you happen to know how I could clean up these volumes in Cloudstack? When I press the "Detach disk" button, I get an error.

kiranchavala commented 1 year ago

@swhitman0

You can try to detach the volume from Cloudstack ui or from the api call

https://cloudstack.apache.org/api/apidocs-4.17/apis/detachVolume.html

detach

From the logs i see that the detachment failed because the vm-node was in booting stage

2023-03-05 14:18:39,144 ERROR [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-92:ctx-3f7cf487 job-1657/job-1658 ctx-d6957fbc) (logid:fd682bcf) Invocation exception, caused by: com.cloud.utils.exception.CloudRuntimeException: Failed to detach volume pvc-0e82d1a8-9af2-40b3-b037-ebcf151febd3 from VM testk8s-node-186b0939378; com.cloud.exception.InternalErrorException: Could not detach volume. Probably the VM is in boot state at the moment

weizhouapache commented 1 year ago

@kiranchavala did you test with 4.17.2.0 or 4.18.0.0-RC ? The issue might have been fixed by #6508

kiranchavala commented 1 year ago

@weizhouapache i tested on 4.17.2 and did not hit the issue of volume detachment

weizhouapache commented 1 year ago

@kiranchavala OK. thanks.

I tested with 4.18.0-RC2 just now. The volume can be detached and removed when pod/pvc is removed. Even if I detach the volume without removing pod/pvc, the operation also succeed.

swhitman0 commented 1 year ago

@kiranchavala The error is different when trying to detach the volume from the Cloudstack UI or using the cli. The error is:

  "jobresult": {
    "errorcode": 530,
    "errortext": "Failed to detach volume pvc-0e82d1a8-9af2-40b3-b037-ebcf151febd3 from VM testk8s-node-186b0939378; com.cloud.exception.InternalErrorException: disk: /mnt/cb06f23d-d0cb-3b1d-b0b2-8b64898dde9f/5998e60f-f7b0-4f13-b4fb-13b74eb8aef7 is not attached before"
  },

weizhouapache commented 1 year ago

@kiranchavala The error is different when trying to detach the volume from the Cloudstack UI or using the cli. The error is:

  "jobresult": {
    "errorcode": 530,
    "errortext": "Failed to detach volume pvc-0e82d1a8-9af2-40b3-b037-ebcf151febd3 from VM testk8s-node-186b0939378; com.cloud.exception.InternalErrorException: disk: /mnt/cb06f23d-d0cb-3b1d-b0b2-8b64898dde9f/5998e60f-f7b0-4f13-b4fb-13b74eb8aef7 is not attached before"
  },

@swhitman0 it looks like the behavior is same as below

when remove a pvc, the volume was detached successfully, but got an error message: com.cloud.utils.exception.CloudRuntimeException: Failed to detach volume pvc-0e82d1a8-9af2-40b3-b037-ebcf151febd3 from VM testk8s-node-186b0939378; com.cloud.exception.InternalErrorException: Could not detach volume. Probably the VM is in boot state at the moment
when detach the volume, got an error message as below, which is because the volume has been detached actually in step 1.

Failed to detach volume pvc-0e82d1a8-9af2-40b3-b037-ebcf151febd3 from VM testk8s-node-186b0939378; com.cloud.exception.InternalErrorException: disk: /mnt/cb06f23d-d0cb-3b1d-b0b2-8b64898dde9f/5998e60f-f7b0-4f13-b4fb-13b74eb8aef7 is not attached before

now I believe this has been fixed by #6508 can you test with 4.18.0.0-RC2 if possible ?

swhitman0 commented 1 year ago

@kiranchavala Is there a Yum repo for release candidates? I have not yet built Cloudstack from source code.

weizhouapache commented 1 year ago

@swhitman0 yes, you can find the packages at: http://packages.shapeblue.com/cloudstack/upstream/testing/4.18.0.0-RC20230224T1301/

weizhouapache commented 1 year ago

@swhitman0 I have reproduced the issue in rocky8 environment. it works with centos7

2023-03-08 13:59:36,331 DEBUG [kvm.storage.KVMStorageProcessor] (agentRequest-Handler-2:null) (logid:71e84de5) Detaching device: <disk  device='disk' type='file'>
<driver name='qemu' type='qcow2' cache='none' />
<source file='/mnt/28140a6d-390e-3e75-9cd1-431f2a8a777b/e6e94028-9772-4367-b80b-f42f689f535a'/>
<target dev='vdb' bus='virtio'/>
</disk>

2023-03-08 13:59:36,488 DEBUG [kvm.storage.KVMStorageProcessor] (agentRequest-Handler-2:null) (logid:71e84de5) Failed to detach volume: e6e94028-9772-4367-b80b-f42f689f535a, due to
com.cloud.exception.InternalErrorException: Could not detach volume. Probably the VM is in boot state at the moment

I will test with 4.18.0.0-RC2

weizhouapache commented 1 year ago

@swhitman0 I did a bunch of testing on cloudstack 4.18, and got the following results

register 4.17.2 systemvm template as a normal template
- create a vm . detach volume when it is booting, failed with error message Failed to detach volume DATA-14 from VM debian11; com.cloud.exception.InternalErrorException: Could not detach volume after sending the command and waiting for [10000] milliseconds. Probably the VM does not support the sent detach command or the device is busy at the moment. Try again in a couple of minutes.

tried again, and got a different error Failed to detach volume DATA-14 from VM debian11; org.libvirt.LibvirtException: internal error: unable to execute QEMU command 'device_del': Device virtio-disk1 is already in the process of unplug

Got the same error when the vm is booted into login screen.

create a vm, wait till the login screen appear. succeed to detach volume

register 4.18.0 systemvm template as a normal template the results are same as 4.17.2 systemvm template

I have also tested 4.17.2 systemvm template on 4.17.2 environment, the results are almost same except the different error message.

considering everything works fine on centos7, I think it is an issue with rocky8/libvirt/qemu, not cloudstack. cc @swhitman0

weizhouapache commented 1 year ago

everything works fine on ubuntu22 as well

swhitman0 commented 1 year ago

@weizhouapache

Thanks for reproducing the problem on Rocky 8. Although many of our systems are centos7, we avoid it with new projects because it is near EOL. We are familiar with the Red Hat ecosystem - kickstart, Identity Management, ansible, etc. - making ubuntu22 a heavier lift, but it could be done.

Will this issue with Rocky 8, and I assume other RHEL 8-based OSes, be resolved?

weizhouapache commented 1 year ago

@swhitman0 I did a few testing on alma8. The issue can be reproduced in 4.17.2. everything works fine on 4.18.0.0-RC2

swhitman0 commented 1 year ago

@weizhouapache That is good to know. I'll upgrade to 4.18.0.0-RC2 at the first opportunity. Can I use the upgrade documentation for 4.17.2, or do you have draft 4.18 upgrade docs somewhere?

weizhouapache commented 1 year ago

@weizhouapache That is good to know. I'll upgrade to 4.18.0.0-RC2 at the first opportunity. Can I use the upgrade documentation for 4.17.2, or do you have draft 4.18 upgrade docs somewhere?

The upgrade process is similar as 4.17.2. You just need to backup database, install packages and restart cloudstack services (cloudstack-management or cloudstack-agent). The new systemvm template registration is not needed since 4.17.

swhitman0 commented 1 year ago

@weizhouapache

The upgrade was successful, and persistent volume claims can be created and deleted without issue on Rocky 8.7. Thanks for helping.

weizhouapache commented 1 year ago

@swhitman0 Great. Closing this issue then.

thanks @kiranchavala for the investigation and testing.

apache / cloudstack