Closed nixpanic closed 2 years ago
This is really being hit a lot, not constantly, but still very often.
Maybe we should skip the test for now, so that PRs do not need the frequent /retest ...
anymore?
@nixpanic this would be the first thing that I'm going to look at once I'm back from holidays i.e. 08th Nov. Why are we hitting this frequently? what changed recently? I remember this test was there for a while now.
I do not know why we start to hit this so frequently, not sure what changed. It could be that there is a new Ceph base image with updated rbd-nbd or other components that handle failures differently.
My current suspicion is that there is a problem when NodeStageVolume
fails to detect the filesystem on the RBD image (kernel logs now in the description of this issue). When NodeStageVolume
fails after mapping the RBD image, the image should be unmapped, so that a next try can start with a cleanly mapped RBD image again. I think that the mapped RBD image is now used in the retry, but that mapped image already returned I/O errors, and continues to do so.
True, I did see this hit on recent PRs too. not sure what changed recently though. For ex: https://jenkins-ceph-csi.apps.ocp.ci.centos.org/blue/rest/organizations/jenkins/pipelines/mini-e2e-helm_k8s-1.22/runs/931/nodes/97/steps/100/log/?start=0
Thanks for sharing your understanding @nixpanic, I will take a detailed look later as mentioned. If we are not able to solve this (say by Next weekend?), we will skip this test.
I have probably spotted the issue:
On line 341 a transaction
is created. This is passed to the deferred undoStagingTransaction()
function when an error in the NodeStageVolume
procedure is detected. So far, so good.
However, on line 356 a new transaction
is returned. This new transaction
is not used for the defer call.
So, either a pointer to the transaction
could be used, and pass the pointer to undoStagingTransaction()
, or the defer-after-error should be set after calling stageTransaction()
, and remove the use of the transaction
from line 341.
@nixpanic yes make sense. Thanks!
E1102 11:37:40.079853 36472 encryption.go:204] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 failed to encrypt volume replicapool/csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108: failed to encrypt device /dev/nbd0 with LUKS: an error (exit status 1) occurred while running cryptsetup args: [-q luksFormat --type luks2 --hash sha256 /dev/nbd0 -d /dev/stdin] I1102 11:37:40.637250 36472 cephcmds.go:62] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 command succeeded: rbd [unmap /dev/nbd0 --device-type nbd] E1102 11:37:40.637440 36472 utils.go:186] ID: 7 Req-ID: 0001-0024-b4d40a47-9d3e-42ee-99ab-1b5764ee9852-0000000000000001-3d0a566e-3bd1-11ec-bd33-12dafb77c108 GRPC error: rpc error: code = Internal desc = failed to encrypt rbd image replicapool/csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108: failed to encrypt volume replicapool/csi-vol-3d0a566e-3bd1-11ec-bd33-12dafb77c108: failed to encrypt device /dev/nbd0 with LUKS: an error (exit status 1) occurred while running cryptsetup args: [-q luksFormat --type luks2 --hash sha256 /dev/nbd0 -d /dev/stdin]
@nixpanic @pkalever #2618 is not the fix for this issue, isn't it? defer is a safer check to do unstage in normal case what if the plugin is restarted before hitting the defer?
@nixpanic @pkalever #2618 is not the fix for this issue, isn't it? defer is a safer check to do unstage in normal case what if the plugin is restarted before hitting the defer?
We are hitting this even with #2618 as mentioned in PR comments. Not sure why cryptsetup format device is failing, which was not the case before?
I1103 14:28:32.431736 59770 cephcmds.go:62] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 command succeeded: rbd [--id cephcsi-rbd-node -m rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789 --keyfile=***stripped*** --log-file /var/log/ceph/rbd-nbd-0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586.log map replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 --device-type nbd --options try-netlink --options reattach-timeout=300 --options io-timeout=0]
I1103 14:28:32.431783 59770 nodeserver.go:397] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 rbd image: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586/replicapool was successfully mapped at /dev/nbd0
I1103 14:28:32.467204 59770 encryption.go:80] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 image replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 encrypted state metadata reports "encryptionPrepared"
I1103 14:28:32.467251 59770 mount_linux.go:463] Attempting to determine if disk "/dev/nbd0" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/nbd0])
I1103 14:28:32.471026 59770 mount_linux.go:466] Output: ""
I1103 14:28:32.471098 59770 crypto.go:199] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 Encrypting device /dev/nbd0 with LUKS
E1103 14:28:41.461191 59770 encryption.go:204] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 failed to encrypt volume replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586: failed to encrypt device /dev/nbd0 with LUKS: an error (exit status 1) occurred while running cryptsetup args: [-q luksFormat --type luks2 --hash sha256 /dev/nbd0 -d /dev/stdin]
I1103 14:28:42.017124 59770 cephcmds.go:62] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 command succeeded: rbd [unmap /dev/nbd0 --device-type nbd]
E1103 14:28:42.017320 59770 utils.go:186] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 GRPC error: rpc error: code = Internal desc = failed to encrypt rbd image replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586: failed to encrypt volume replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586: failed to encrypt device /dev/nbd0 with LUKS: an error (exit status 1) occurred while running cryptsetup args: [-q luksFormat --type luks2 --hash sha256 /dev/nbd0 -d /dev/stdin]
Any behavioural code changes with LuksFormat
recently?
@nixpanic @pkalever #2618 is not the fix for this issue, isn't it? defer is a safer check to do unstage in normal case what if the plugin is restarted before hitting the defer?
We are hitting this even with #2618 as mentioned in PR comments. Not sure why cryptsetup format device is failing, which was not the case before?
I1103 14:28:32.431736 59770 cephcmds.go:62] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 command succeeded: rbd [--id cephcsi-rbd-node -m rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789 --keyfile=***stripped*** --log-file /var/log/ceph/rbd-nbd-0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586.log map replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 --device-type nbd --options try-netlink --options reattach-timeout=300 --options io-timeout=0] I1103 14:28:32.431783 59770 nodeserver.go:397] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 rbd image: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586/replicapool was successfully mapped at /dev/nbd0 I1103 14:28:32.467204 59770 encryption.go:80] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 image replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 encrypted state metadata reports "encryptionPrepared" I1103 14:28:32.467251 59770 mount_linux.go:463] Attempting to determine if disk "/dev/nbd0" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/nbd0]) I1103 14:28:32.471026 59770 mount_linux.go:466] Output: "" I1103 14:28:32.471098 59770 crypto.go:199] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 Encrypting device /dev/nbd0 with LUKS E1103 14:28:41.461191 59770 encryption.go:204] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 failed to encrypt volume replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586: failed to encrypt device /dev/nbd0 with LUKS: an error (exit status 1) occurred while running cryptsetup args: [-q luksFormat --type luks2 --hash sha256 /dev/nbd0 -d /dev/stdin] I1103 14:28:42.017124 59770 cephcmds.go:62] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 command succeeded: rbd [unmap /dev/nbd0 --device-type nbd] E1103 14:28:42.017320 59770 utils.go:186] ID: 7 Req-ID: 0001-0024-8b62cc85-f86c-4761-8e3a-ed8a06725fc0-0000000000000002-4ba30f5c-3cb2-11ec-b55e-12aaf5056586 GRPC error: rpc error: code = Internal desc = failed to encrypt rbd image replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586: failed to encrypt volume replicapool/csi-vol-4ba30f5c-3cb2-11ec-b55e-12aaf5056586: failed to encrypt device /dev/nbd0 with LUKS: an error (exit status 1) occurred while running cryptsetup args: [-q luksFormat --type luks2 --hash sha256 /dev/nbd0 -d /dev/stdin]
Any behavioural code changes with
LuksFormat
recently?
I don't think so. sent https://github.com/ceph/ceph-csi/pull/2621 to log stderror for better logging.
Describe the bug
e2e testing failed with
Actual results
The following errors are repeatedly reported while doing a
NodeStageVolume
call:Expected behavior
Encrypting the
/dev/nbd0
device should not fail, andNodeStageVolume
should succeed.Logs
The logs of the failed job are marked for keeping and can be found at mini-e2e_k8s-1.20/2974.
minikube logs from log system status: