ceph / ceph-csi

CSI driver for Ceph
Apache License 2.0
1.26k stars 536 forks source link

rbd-nbd: failed to validate encrypted pvc with error timed out waiting for the condition #2610

Closed nixpanic closed 2 years ago

nixpanic commented 2 years ago

Describe the bug

e2e testing failed with

STEP: create a PVC and bind it to an app using rbd-nbd mounter with encryption
Nov  2 11:37:17.630: INFO: waiting for kubectl (delete -f args []) to finish
Nov  2 11:37:17.630: INFO: Running '/usr/bin/kubectl --server=https://192.168.39.185:8443 --kubeconfig=/root/.kube/config --namespace=cephcsi-e2e-7e6c1d0e delete -f -'
Nov  2 11:37:17.762: INFO: stderr: "warning: deleting cluster-scoped resources, not scoped to the provided namespace\n"
Nov  2 11:37:17.762: INFO: stdout: "storageclass.storage.k8s.io \"csi-rbd-sc\" deleted\n"
Nov  2 11:37:17.767: INFO: ExecWithOptions {Command:[/bin/sh -c ceph fsid] Namespace:rook-ceph PodName:rook-ceph-tools-7467d8bf8-rmbg5 ContainerName:rook-ceph-tools Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Nov  2 11:37:17.768: INFO: >>> kubeConfig: /root/.kube/config
Nov  2 11:37:19.955: INFO: Waiting up to &PersistentVolumeClaim{ObjectMeta:{rbd-pvc  rbd-8081    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},Spec:PersistentVolumeClaimSpec{AccessModes:[ReadWriteOnce],Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{storage: {{1073741824 0} {<nil>} 1Gi BinarySI},},},VolumeName:,Selector:nil,StorageClassName:*csi-rbd-sc,VolumeMode:nil,DataSource:nil,DataSourceRef:nil,},Status:PersistentVolumeClaimStatus{Phase:,AccessModes:[],Capacity:ResourceList{},Conditions:[]PersistentVolumeClaimCondition{},},} to be in Bound state
Nov  2 11:37:19.956: INFO: waiting for PVC rbd-pvc (0 seconds elapsed)
Nov  2 11:37:21.959: INFO: waiting for PVC rbd-pvc (2 seconds elapsed)
Nov  2 11:37:21.965: INFO: Waiting for PV pvc-778afe5f-9dcb-4762-88b1-8bcab7486ace to bind to PVC rbd-pvc
Nov  2 11:37:21.965: INFO: Waiting up to timeout=10m0s for PersistentVolumeClaims [rbd-pvc] to have phase Bound
Nov  2 11:37:21.968: INFO: PersistentVolumeClaim rbd-pvc found and phase=Bound (2.445725ms)
Nov  2 11:37:21.968: INFO: Waiting up to 10m0s for PersistentVolume pvc-778afe5f-9dcb-4762-88b1-8bcab7486ace to have phase Bound
Nov  2 11:37:21.970: INFO: PersistentVolume pvc-778afe5f-9dcb-4762-88b1-8bcab7486ace found and phase=Bound (2.453205ms)
Nov  2 11:37:21.984: INFO: Waiting up to csi-rbd-demo-pod to be in Running state
Nov  2 11:47:21.994: FAIL: failed to validate encrypted pvc with error timed out waiting for the condition

Actual results

The following errors are repeatedly reported while doing a NodeStageVolume call:

I1102 11:37:37.811210   36472 utils.go:177] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 GRPC call: /csi.v1.Node/NodeStageVolume
I1102 11:37:37.811673   36472 utils.go:181] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-778afe5f-9dcb-4762-88b1-8bcab7486ace/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4","mount_flags":["discard"]}},"access_mode":{"mode":1}},"volume_context":{"cephLogStrategy":"preserve","clusterID":"b4d40a47-9d3e-42ee-99ab-1b5764ee9852","csi.storage.k8s.io/pv/name":"pvc-778afe5f-9dcb-4762-88b1-8bcab7486ace","csi.storage.k8s.io/pvc/name":"rbd-pvc","csi.storage.k8s.io/pvc/namespace":"rbd-8081","encrypted":"true","imageFeatures":"layering","imageName":"csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108","journalPool":"replicapool","mapOptions":"debug-rbd=20,io-timeout=330","mounter":"rbd-nbd","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1635852566029-8081-rbd.csi.ceph.com"},"volume_id":"0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108"}
I1102 11:37:37.811936   36472 rbd_util.go:1316] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 setting disableInUseChecks: false image features: [layering] mounter: rbd-nbd
I1102 11:37:37.813530   36472 omap.go:87] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 got omap values: (pool="replicapool", namespace="", name="csi.volume.3d0a566e-3bd1-11ec-bd33-12dafb77c108"): map[csi.imageid:115e930a5abd csi.imagename:csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108 csi.volname:pvc-778afe5f-9dcb-4762-88b1-8bcab7486ace csi.volume.encryptKMS:default csi.volume.owner:rbd-8081]
E1102 11:37:37.813786   36472 util.go:247] kernel 4.19.202 does not support required features
I1102 11:37:38.413445   36472 cephcmds.go:62] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 command succeeded: rbd [device list --format=json --device-type nbd]
I1102 11:37:38.434804   36472 rbd_attach.go:330] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 rbd: map mon rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789
I1102 11:37:39.177691   36472 cephcmds.go:62] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 command succeeded: rbd [--id cephcsi-rbd-node -m rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789 --keyfile=***stripped*** --log-file /var/log/ceph/rbd-nbd-0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108.log map replicapool/csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108 --device-type nbd --options try-netlink --options reattach-timeout=300 --options io-timeout=0]
I1102 11:37:39.177748   36472 nodeserver.go:401] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 rbd image: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108/replicapool was successfully mapped at /dev/nbd0
I1102 11:37:39.209052   36472 encryption.go:80] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 image replicapool/csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108 encrypted state metadata reports "encryptionPrepared"
I1102 11:37:39.209085   36472 mount_linux.go:463] Attempting to determine if disk "/dev/nbd0" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/nbd0])
I1102 11:37:39.212830   36472 mount_linux.go:466] Output: ""
I1102 11:37:39.212879   36472 crypto.go:199] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 Encrypting device /dev/nbd0 with LUKS
E1102 11:37:40.079853   36472 encryption.go:204] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 failed to encrypt volume replicapool/csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108: failed to encrypt device /dev/nbd0 with LUKS: an error (exit status 1) occurred while running cryptsetup args: [-q luksFormat --type luks2 --hash sha256 /dev/nbd0 -d /dev/stdin]
I1102 11:37:40.637250   36472 cephcmds.go:62] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 command succeeded: rbd [unmap /dev/nbd0 --device-type nbd]
E1102 11:37:40.637440   36472 utils.go:186] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 GRPC error: rpc error: code = Internal desc = failed to encrypt rbd image replicapool/csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108: failed to encrypt volume replicapool/csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108: failed to encrypt device /dev/nbd0 with LUKS: an error (exit status 1) occurred while running cryptsetup args: [-q luksFormat --type luks2 --hash sha256 /dev/nbd0 -d /dev/stdin]

Expected behavior

Encrypting the /dev/nbd0 device should not fail, and NodeStageVolume should succeed.

Logs

The logs of the failed job are marked for keeping and can be found at mini-e2e_k8s-1.20/2974.

minikube logs from log system status:

Nov 02 11:37:39 minikube kernel: block nbd0: Connection timed out
Nov 02 11:37:39 minikube kernel: block nbd0: shutting down sockets
Nov 02 11:37:39 minikube kernel: print_req_error: I/O error, dev nbd0, sector 128
Nov 02 11:37:39 minikube kernel: print_req_error: I/O error, dev nbd0, sector 128
Nov 02 11:37:39 minikube kernel: Buffer I/O error on dev nbd0, logical block 16, async page read
Nov 02 11:37:39 minikube kernel: block nbd0: NBD_DISCONNECT
Nov 02 11:37:39 minikube kernel: block nbd0: Send disconnect failed -32
Nov 02 11:37:40 minikube kernel: print_req_error: I/O error, dev nbd0, sector 0
Nov 02 11:37:40 minikube kernel: Buffer I/O error on dev nbd0, logical block 0, async page read
Nov 02 11:37:40 minikube kernel: print_req_error: I/O error, dev nbd0, sector 0
Nov 02 11:37:40 minikube kernel: Buffer I/O error on dev nbd0, logical block 0, async page read
Nov 02 11:37:40 minikube kernel: print_req_error: I/O error, dev nbd0, sector 0
Nov 02 11:37:40 minikube kernel: Buffer I/O error on dev nbd0, logical block 0, async page read
Nov 02 11:37:40 minikube kernel: print_req_error: I/O error, dev nbd0, sector 0
Nov 02 11:37:40 minikube kernel: Buffer I/O error on dev nbd0, logical block 0, async page read
Nov 02 11:37:40 minikube kernel: print_req_error: I/O error, dev nbd0, sector 0
Nov 02 11:37:40 minikube kernel: Buffer I/O error on dev nbd0, logical block 0, async page read
Nov 02 11:37:40 minikube kernel: print_req_error: I/O error, dev nbd0, sector 0
Nov 02 11:37:40 minikube kernel: Buffer I/O error on dev nbd0, logical block 0, async page read
Nov 02 11:37:40 minikube kernel: Dev nbd0: unable to read RDB block 0
Nov 02 11:37:40 minikube kernel: print_req_error: I/O error, dev nbd0, sector 0
Nov 02 11:37:40 minikube kernel: Buffer I/O error on dev nbd0, logical block 0, async page read
Nov 02 11:37:40 minikube kernel: print_req_error: I/O error, dev nbd0, sector 0
Nov 02 11:37:40 minikube kernel: Buffer I/O error on dev nbd0, logical block 0, async page read
Nov 02 11:37:40 minikube kernel:  nbd0: unable to read partition table
Nov 02 11:37:40 minikube kernel: Buffer I/O error on dev nbd0, logical block 0, async page read
Nov 02 11:37:40 minikube kernel: Dev nbd0: unable to read RDB block 0
Nov 02 11:37:40 minikube kernel:  nbd0: unable to read partition table
Nov 02 11:37:40 minikube kubelet[4857]: E1102 11:37:40.637853    4857 csi_attacher.go:306] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = Internal desc = failed to encrypt rbd image replicapool/csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108: failed to encrypt volume replicapool/csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108: failed to encrypt device /dev/nbd0 with LUKS: an error (exit status 1) occurred while running cryptsetup args: [-q luksFormat --type luks2 --hash sha256 /dev/nbd0 -d /dev/stdin]
Nov 02 11:37:40 minikube kubelet[4857]: E1102 11:37:40.638292    4857 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 podName: nodeName:}" failed. No retries permitted until 2021-11-02 11:37:41.138229748 +0000 UTC m=+1086.160440397 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-778afe5f-9dcb-4762-88b1-8bcab7486ace\" (UniqueName: \"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108\") pod \"csi-rbd-demo-pod\" (UID: \"d5687432-6126-4bf2-885f-568c902d3ea6\") : rpc error: code = Internal desc = failed to encrypt rbd image replicapool/csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108: failed to encrypt volume replicapool/csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108: failed to encrypt device /dev/nbd0 with LUKS: an error (exit status 1) occurred while running cryptsetup args: [-q luksFormat --type luks2 --hash sha256 /dev/nbd0 -d /dev/stdin]"
pkalever commented 2 years ago

Hit this with: https://jenkins-ceph-csi.apps.ocp.ci.centos.org/blue/rest/organizations/jenkins/pipelines/mini-e2e_k8s-1.21/runs/1657/nodes/91/steps/94/log/?start=0

https://jenkins-ceph-csi.apps.ocp.ci.centos.org/blue/rest/organizations/jenkins/pipelines/mini-e2e-helm_k8s-1.22/runs/932/nodes/97/steps/100/log/?start=0

https://jenkins-ceph-csi.apps.ocp.ci.centos.org/blue/rest/organizations/jenkins/pipelines/mini-e2e-helm_k8s-1.21/runs/1642/nodes/97/steps/100/log/?start=0

nixpanic commented 2 years ago

This is really being hit a lot, not constantly, but still very often.

Maybe we should skip the test for now, so that PRs do not need the frequent /retest ... anymore?

pkalever commented 2 years ago

@nixpanic this would be the first thing that I'm going to look at once I'm back from holidays i.e. 08th Nov. Why are we hitting this frequently? what changed recently? I remember this test was there for a while now.

nixpanic commented 2 years ago

I do not know why we start to hit this so frequently, not sure what changed. It could be that there is a new Ceph base image with updated rbd-nbd or other components that handle failures differently.

My current suspicion is that there is a problem when NodeStageVolume fails to detect the filesystem on the RBD image (kernel logs now in the description of this issue). When NodeStageVolume fails after mapping the RBD image, the image should be unmapped, so that a next try can start with a cleanly mapped RBD image again. I think that the mapped RBD image is now used in the retry, but that mapped image already returned I/O errors, and continues to do so.

humblec commented 2 years ago

True, I did see this hit on recent PRs too. not sure what changed recently though. For ex: https://jenkins-ceph-csi.apps.ocp.ci.centos.org/blue/rest/organizations/jenkins/pipelines/mini-e2e-helm_k8s-1.22/runs/931/nodes/97/steps/100/log/?start=0

pkalever commented 2 years ago

Thanks for sharing your understanding @nixpanic, I will take a detailed look later as mentioned. If we are not able to solve this (say by Next weekend?), we will skip this test.

nixpanic commented 2 years ago

I have probably spotted the issue:

https://github.com/ceph/ceph-csi/blob/b95f3cdcbc1d3659afd48551e96744fd988db6bd/internal/rbd/nodeserver.go#L341-L359

On line 341 a transaction is created. This is passed to the deferred undoStagingTransaction() function when an error in the NodeStageVolume procedure is detected. So far, so good.

However, on line 356 a new transaction is returned. This new transaction is not used for the defer call.

So, either a pointer to the transaction could be used, and pass the pointer to undoStagingTransaction(), or the defer-after-error should be set after calling stageTransaction(), and remove the use of the transaction from line 341.

pkalever commented 2 years ago

@nixpanic yes make sense. Thanks!

Madhu-1 commented 2 years ago

E1102 11:37:40.079853 36472 encryption.go:204] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 failed to encrypt volume replicapool/csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108: failed to encrypt device /dev/nbd0 with LUKS: an error (exit status 1) occurred while running cryptsetup args: [-q luksFormat --type luks2 --hash sha256 /dev/nbd0 -d /dev/stdin] I1102 11:37:40.637250 36472 cephcmds.go:62] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 command succeeded: rbd [unmap /dev/nbd0 --device-type nbd] E1102 11:37:40.637440 36472 utils.go:186] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 GRPC error: rpc error: code = Internal desc = failed to encrypt rbd image replicapool/csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108: failed to encrypt volume replicapool/csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108: failed to encrypt device /dev/nbd0 with LUKS: an error (exit status 1) occurred while running cryptsetup args: [-q luksFormat --type luks2 --hash sha256 /dev/nbd0 -d /dev/stdin]

@nixpanic @pkalever #2618 is not the fix for this issue, isn't it? defer is a safer check to do unstage in normal case what if the plugin is restarted before hitting the defer?

pkalever commented 2 years ago

@nixpanic @pkalever #2618 is not the fix for this issue, isn't it? defer is a safer check to do unstage in normal case what if the plugin is restarted before hitting the defer?

We are hitting this even with #2618 as mentioned in PR comments. Not sure why cryptsetup format device is failing, which was not the case before?

I1103 14:28:32.431736   59770 cephcmds.go:62] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 command succeeded: rbd [--id cephcsi-rbd-node -m rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789 --keyfile=***stripped*** --log-file /var/log/ceph/rbd-nbd-0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586.log map replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 --device-type nbd --options try-netlink --options reattach-timeout=300 --options io-timeout=0]
I1103 14:28:32.431783   59770 nodeserver.go:397] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 rbd image: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586/replicapool was successfully mapped at /dev/nbd0
I1103 14:28:32.467204   59770 encryption.go:80] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 image replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 encrypted state metadata reports "encryptionPrepared"
I1103 14:28:32.467251   59770 mount_linux.go:463] Attempting to determine if disk "/dev/nbd0" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/nbd0])
I1103 14:28:32.471026   59770 mount_linux.go:466] Output: ""
I1103 14:28:32.471098   59770 crypto.go:199] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 Encrypting device /dev/nbd0 with LUKS
E1103 14:28:41.461191   59770 encryption.go:204] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 failed to encrypt volume replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586: failed to encrypt device /dev/nbd0 with LUKS: an error (exit status 1) occurred while running cryptsetup args: [-q luksFormat --type luks2 --hash sha256 /dev/nbd0 -d /dev/stdin]
I1103 14:28:42.017124   59770 cephcmds.go:62] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 command succeeded: rbd [unmap /dev/nbd0 --device-type nbd]
E1103 14:28:42.017320   59770 utils.go:186] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 GRPC error: rpc error: code = Internal desc = failed to encrypt rbd image replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586: failed to encrypt volume replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586: failed to encrypt device /dev/nbd0 with LUKS: an error (exit status 1) occurred while running cryptsetup args: [-q luksFormat --type luks2 --hash sha256 /dev/nbd0 -d /dev/stdin]

Any behavioural code changes with LuksFormat recently?

Madhu-1 commented 2 years ago

@nixpanic @pkalever #2618 is not the fix for this issue, isn't it? defer is a safer check to do unstage in normal case what if the plugin is restarted before hitting the defer?

We are hitting this even with #2618 as mentioned in PR comments. Not sure why cryptsetup format device is failing, which was not the case before?

I1103 14:28:32.431736   59770 cephcmds.go:62] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 command succeeded: rbd [--id cephcsi-rbd-node -m rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789 --keyfile=***stripped*** --log-file /var/log/ceph/rbd-nbd-0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586.log map replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 --device-type nbd --options try-netlink --options reattach-timeout=300 --options io-timeout=0]
I1103 14:28:32.431783   59770 nodeserver.go:397] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 rbd image: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586/replicapool was successfully mapped at /dev/nbd0
I1103 14:28:32.467204   59770 encryption.go:80] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 image replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 encrypted state metadata reports "encryptionPrepared"
I1103 14:28:32.467251   59770 mount_linux.go:463] Attempting to determine if disk "/dev/nbd0" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/nbd0])
I1103 14:28:32.471026   59770 mount_linux.go:466] Output: ""
I1103 14:28:32.471098   59770 crypto.go:199] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 Encrypting device /dev/nbd0 with LUKS
E1103 14:28:41.461191   59770 encryption.go:204] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 failed to encrypt volume replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586: failed to encrypt device /dev/nbd0 with LUKS: an error (exit status 1) occurred while running cryptsetup args: [-q luksFormat --type luks2 --hash sha256 /dev/nbd0 -d /dev/stdin]
I1103 14:28:42.017124   59770 cephcmds.go:62] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 command succeeded: rbd [unmap /dev/nbd0 --device-type nbd]
E1103 14:28:42.017320   59770 utils.go:186] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 GRPC error: rpc error: code = Internal desc = failed to encrypt rbd image replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586: failed to encrypt volume replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586: failed to encrypt device /dev/nbd0 with LUKS: an error (exit status 1) occurred while running cryptsetup args: [-q luksFormat --type luks2 --hash sha256 /dev/nbd0 -d /dev/stdin]

Any behavioural code changes with LuksFormat recently?

I don't think so. sent https://github.com/ceph/ceph-csi/pull/2621 to log stderror for better logging.