ceph / ceph-csi

CSI driver for Ceph
Apache License 2.0
1.27k stars 539 forks source link

Fail to mount RBD PVC with error Internal desc = exit status 2 #1217

Closed Ayden-Franklin closed 4 years ago

Ayden-Franklin commented 4 years ago

Describe the bug

I deployed ceph cluster using cephadm successfully. I can create StorageClass and pvc with CephFS , instead, RBD failed.

Environment details

Steps to reproduce

Steps to reproduce the behavior:

  1. Setup details: 'deploy kubernetes cluster, install cephadm and deploy the cluster (3 nodes), try to use rbd following the Guide'
  2. Deployment to trigger the issue Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[default-token-jn7tw data]: timed out waiting for the condition
  3. See error

Actual results

Pod status is still ContainerCreating,

Expected behavior

Pod can mount the PVC.

Logs

Pod event:

Normal   Scheduled               18m                  default-scheduler        Successfully assigned mars/pod-with-raw-block-volume to 10.0.0.202
  Normal   SuccessfulAttachVolume  18m                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-72e30530-4472-451d-b7bb-cdf2d1cd9ac4"
  Warning  FailedMount             9m30s (x2 over 11m)  kubelet, 10.0.0.202      Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[data default-token-jn7tw]: timed out waiting for the condition
  Warning  FailedMapVolume         100s (x16 over 18m)  kubelet, 10.0.0.202      MapVolume.SetUpDevice failed for volume "pvc-72e30530-4472-451d-b7bb-cdf2d1cd9ac4" : rpc error: code = Internal desc = exit status 2
  Warning  FailedMount             26s (x6 over 16m)    kubelet, 10.0.0.202      Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[default-token-jn7tw data]: timed out waiting for the condition

Kubelet log:

Jul 09 10:33:09 imagego-test-0002 kubelet[3119]: I0709 10:33:09.733346    3119 clientconn.go:106] parsed scheme: ""
Jul 09 10:33:09 imagego-test-0002 kubelet[3119]: I0709 10:33:09.733358    3119 clientconn.go:106] scheme "" not registered, fallback to default scheme
Jul 09 10:33:09 imagego-test-0002 kubelet[3119]: I0709 10:33:09.733374    3119 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins/rbd.csi.ceph.com/csi.sock  <nil> 0 <nil>}] <nil> <nil>}
Jul 09 10:33:09 imagego-test-0002 kubelet[3119]: I0709 10:33:09.733382    3119 clientconn.go:933] ClientConn switching balancer to "pick_first"
Jul 09 10:33:09 imagego-test-0002 kubelet[3119]: I0709 10:33:09.733423    3119 clientconn.go:882] blockingPicker: the picked transport is not ready, loop back to repick
Jul 09 10:33:09 imagego-test-0002 kubelet[3119]: I0709 10:33:09.736950    3119 clientconn.go:106] parsed scheme: ""
Jul 09 10:33:09 imagego-test-0002 kubelet[3119]: I0709 10:33:09.736960    3119 clientconn.go:106] scheme "" not registered, fallback to default scheme
Jul 09 10:33:09 imagego-test-0002 kubelet[3119]: I0709 10:33:09.736973    3119 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins/rbd.csi.ceph.com/csi.sock  <nil> 0 <nil>}] <nil> <nil>}
Jul 09 10:33:09 imagego-test-0002 kubelet[3119]: I0709 10:33:09.736980    3119 clientconn.go:933] ClientConn switching balancer to "pick_first"
Jul 09 10:33:09 imagego-test-0002 kubelet[3119]: I0709 10:33:09.737004    3119 clientconn.go:882] blockingPicker: the picked transport is not ready, loop back to repick
Jul 09 10:33:10 imagego-test-0002 kubelet[3119]: E0709 10:33:10.086394    3119 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-1e883026-c03d-11ea-9dd5-00163e101e9f-0000000000000002-360233ac-c184-11ea-aa46-d6eec0877e2d podName: nodeName:}" failed. No retries permitted until 2020-07-09 10:35:12.086366695 +0800 CST m=+167228.245629512 (durationBeforeRetry 2m2s). Error: "MapVolume.SetUpDevice failed for volume \"pvc-72e30530-4472-451d-b7bb-cdf2d1cd9ac4\" (UniqueName: \"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-1e883026-c03d-11ea-9dd5-00163e101e9f-0000000000000002-360233ac-c184-11ea-aa46-d6eec0877e2d\") pod \"pod-with-raw-block-volume\" (UID: \"2b7a5741-2fee-4bc9-a440-c218112b686a\") : rpc error: code = Internal desc = exit status 2"
Jul 09 10:33:49 imagego-test-0002 kubelet[3119]: E0709 10:33:49.563137    3119 kubelet.go:1681] Unable to attach or mount volumes for pod "pod-with-raw-block-volume_mars(2b7a5741-2fee-4bc9-a440-c218112b686a)": unmounted volumes=[data], unattached volumes=[default-token-jn7tw data]: timed out waiting for the condition; skipping pod
Jul 09 10:33:49 imagego-test-0002 kubelet[3119]: E0709 10:33:49.563166    3119 pod_workers.go:191] Error syncing pod 2b7a5741-2fee-4bc9-a440-c218112b686a ("pod-with-raw-block-volume_mars(2b7a5741-2fee-4bc9-a440-c218112b686a)"), skipping: unmounted volumes=[data], unattached volumes=[default-token-jn7tw data]: timed out waiting for the condition
Jul 09 10:35:12 imagego-test-0002 kubelet[3119]: I0709 10:35:12.108777    3119 operation_generator.go:950] MapVolume.WaitForAttach entering for volume "pvc-72e30530-4472-451d-b7bb-cdf2d1cd9ac4" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-1e883026-c03d-11ea-9dd5-00163e101e9f-0000000000000002-360233ac-c184-11ea-aa46-d6eec0877e2d") pod "pod-with-raw-block-volume" (UID: "2b7a5741-2fee-4bc9-a440-c218112b686a") DevicePath ""
Jul 09 10:35:12 imagego-test-0002 kubelet[3119]: I0709 10:35:12.110941    3119 operation_generator.go:959] MapVolume.WaitForAttach succeeded for volume "pvc-72e30530-4472-451d-b7bb-cdf2d1cd9ac4" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-1e883026-c03d-11ea-9dd5-00163e101e9f-0000000000000002-360233ac-c184-11ea-aa46-d6eec0877e2d") pod "pod-with-raw-block-volume" (UID: "2b7a5741-2fee-4bc9-a440-c218112b686a") DevicePath "csi-fa9d3f5524a7cf5f62e8f3df5a8710426944870b866c57fd437624f1f2480c89"

csi-rbdplugin log:

I0709 02:06:23.305910   21535 utils.go:159] ID: 25 GRPC call: /csi.v1.Identity/Probe
I0709 02:06:23.305922   21535 utils.go:160] ID: 25 GRPC request: {}
I0709 02:06:23.306322   21535 utils.go:165] ID: 25 GRPC response: {}
I0709 02:06:38.310790   21535 utils.go:159] ID: 26 GRPC call: /csi.v1.Node/NodeGetCapabilities
I0709 02:06:38.310803   21535 utils.go:160] ID: 26 GRPC request: {}
I0709 02:06:38.311247   21535 utils.go:165] ID: 26 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":3}}}]}
I0709 02:06:38.314646   21535 utils.go:159] ID: 27 Req-ID: 0001-0024-1e883026-c03d-11ea-9dd5-00163e101e9f-0000000000000002-360233ac-c184-11ea-aa46-d6eec0877e2d GRPC call: /csi.v1.Node/NodeStageVolume
I0709 02:06:38.314655   21535 utils.go:160] ID: 27 Req-ID: 0001-0024-1e883026-c03d-11ea-9dd5-00163e101e9f-0000000000000002-360233ac-c184-11ea-aa46-d6eec0877e2d GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/staging/pvc-72e30530-4472-451d-b7bb-cdf2d1cd9ac4","volume_capability":{"AccessType":{"Block":{}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"1e883026-c03d-11ea-9dd5-00163e101e9f","imageName":"csi-vol-360233ac-c184-11ea-aa46-d6eec0877e2d","journalPool":"rbd","pool":"rbd","storage.kubernetes.io/csiProvisionerIdentity":"1594195443622-8081-rbd.csi.ceph.com"},"volume_id":"0001-0024-1e883026-c03d-11ea-9dd5-00163e101e9f-0000000000000002-360233ac-c184-11ea-aa46-d6eec0877e2d"}
I0709 02:06:38.315790   21535 rbd_util.go:806] ID: 27 Req-ID: 0001-0024-1e883026-c03d-11ea-9dd5-00163e101e9f-0000000000000002-360233ac-c184-11ea-aa46-d6eec0877e2d setting disableInUseChecks on rbd volume to: false
I0709 02:06:38.316244   21535 omap.go:74] ID: 27 Req-ID: 0001-0024-1e883026-c03d-11ea-9dd5-00163e101e9f-0000000000000002-360233ac-c184-11ea-aa46-d6eec0877e2d got omap values: (pool="rbd", namespace="", name="csi.volume.360233ac-c184-11ea-aa46-d6eec0877e2d"): map[csi.imageid:5ee19ac08555 csi.imagename:csi-vol-360233ac-c184-11ea-aa46-d6eec0877e2d csi.volname:pvc-72e30530-4472-451d-b7bb-cdf2d1cd9ac4]
E0709 02:06:38.316348   21535 util.go:250] kernel 4.18.0-147.8.1.el8_1.x86_64 does not support required features
I0709 02:06:38.358623   21535 rbd_util.go:322] ID: 27 Req-ID: 0001-0024-1e883026-c03d-11ea-9dd5-00163e101e9f-0000000000000002-360233ac-c184-11ea-aa46-d6eec0877e2d executing [rbd task add flatten rbd/csi-vol-360233ac-c184-11ea-aa46-d6eec0877e2d --id kubernetes --keyfile=/tmp/csi/keys/keyfile-260375200 -m 10.0.0.201:6789,10.0.0.202:6789,10.0.0.203:6789] for image (csi-vol-360233ac-c184-11ea-aa46-d6eec0877e2d) using mon 10.0.0.201:6789,10.0.0.202:6789,10.0.0.203:6789, pool rbd
W0709 02:06:38.668018   21535 rbd_util.go:336] ID: 27 Req-ID: 0001-0024-1e883026-c03d-11ea-9dd5-00163e101e9f-0000000000000002-360233ac-c184-11ea-aa46-d6eec0877e2d uncaught error while scheduling a task: exit status 2
E0709 02:06:38.668037   21535 rbd_util.go:464] ID: 27 Req-ID: 0001-0024-1e883026-c03d-11ea-9dd5-00163e101e9f-0000000000000002-360233ac-c184-11ea-aa46-d6eec0877e2d failed to add task flatten for rbd/csi-vol-360233ac-c184-11ea-aa46-d6eec0877e2d : exit status 2
E0709 02:06:38.668142   21535 utils.go:163] ID: 27 Req-ID: 0001-0024-1e883026-c03d-11ea-9dd5-00163e101e9f-0000000000000002-360233ac-c184-11ea-aa46-d6eec0877e2d GRPC error: rpc error: code = Internal desc = exit status 2

Additional context

There was another issue driver name rbd.csi.ceph.com not found in the list of registered CSI drivers yesterday. I read the issue #1105 then recreated the rbd-plugin pod, it works fine.

For example: Nothing.

Madhu-1 commented 4 years ago

@Macrame please use latest version of cephcsi (v2.1.2), let us know if the issue exists.

Madhu-1 commented 4 years ago

From the log it's look like you are using canary image. Let me take a look at it

Ayden-Franklin commented 4 years ago

I am sorry that I was puzzled. The version information: driver-registrar :

image: quay.io/k8scsi/csi-node-driver-registrar:v1.3.0

csi-rbdplugin

image: quay.io/cephcsi/cephcsi:canary

Is it the latest version? I can not find version of river-registrar: v2.1.2

Madhu-1 commented 4 years ago

Yeah this is the latest version, can you please provide the pvc template used for your testing

Madhu-1 commented 4 years ago

And also provide the rbd info of the image from ceph cluster

Ayden-Franklin commented 4 years ago

I just used the example of ceph documentation and I changed the name of namespace and pool. csi-rbd-sc.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: csi-rbd-sc
provisioner: rbd.csi.ceph.com
parameters:
   clusterID: 1e883026-c03d-11ea-9dd5-00163e101e9f
   pool: rbd
   csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
   csi.storage.k8s.io/provisioner-secret-namespace: mars
   csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
   csi.storage.k8s.io/node-stage-secret-namespace: mars
reclaimPolicy: Delete
mountOptions:
   - discard

raw-block-pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: raw-block-pvc
  namespace: mars
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Block
  resources:
    requests:
      storage: 1Gi
  storageClassName: csi-rbd-sc

rbd info

rbd info csi-vol-360233ac-c184-11ea-aa46-d6eec0877e2d
rbd image 'csi-vol-360233ac-c184-11ea-aa46-d6eec0877e2d':
    size 1 GiB in 256 objects
    order 22 (4 MiB objects)
    snapshot_count: 0
    id: 5ee19ac08555
    block_name_prefix: rbd_data.5ee19ac08555
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    op_features: 
    flags: 
    create_timestamp: Thu Jul  9 09:33:37 2020
    access_timestamp: Thu Jul  9 09:33:37 2020
    modify_timestamp: Thu Jul  9 09:33:37 2020

Thanks for your response.

Madhu-1 commented 4 years ago

@dillaman if we don't provide the image features when creating a rbd image will these features comes as default in octopus?

Madhu-1 commented 4 years ago

@Macrame can you please pass image-feature=layering in storageclass and create new pvc and see

Madhu-1 commented 4 years ago

Here is the example https://github.com/ceph/ceph-csi/blob/4e82384b43478b4f034b41e90760ed1a9d962166/examples/rbd/storageclass.yaml#L30

Ayden-Franklin commented 4 years ago

I see. I am trying to delete them and create them again. But the previous issue appears again.

MapVolume.SetUpDevice failed for volume "pvc-7fcd65f2-42c6-4ab1-a95d-e9eb27de22f4" : kubernetes.io/csi: blockMapper.SetUpDevice failed to get CSI client: driver name rbd.csi.ceph.com not found in the list of registered CSI drivers

I am restarting the rbd-plugin pod.

Madhu-1 commented 4 years ago

For above error see driver-registrar logs it might help

Ayden-Franklin commented 4 years ago

This issue can resolved by adding imageFeatures: layering. The name rbd.csi.ceph.com not found issue only occurs some of nodes. Only one node can run the pod successfully. driver-registrar logs

I0709 03:19:46.396924   27642 main.go:110] Version: v1.3.0-0-g6e9fff3e
I0709 03:19:46.396985   27642 main.go:120] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0709 03:19:46.397005   27642 connection.go:151] Connecting to unix:///csi/csi.sock
I0709 03:19:47.397487   27642 main.go:127] Calling CSI driver to discover driver name
I0709 03:19:47.397511   27642 connection.go:180] GRPC call: /csi.v1.Identity/GetPluginInfo
I0709 03:19:47.397518   27642 connection.go:181] GRPC request: {}
I0709 03:19:47.400319   27642 connection.go:183] GRPC response: {"name":"rbd.csi.ceph.com","vendor_version":"canary"}
I0709 03:19:47.400617   27642 connection.go:184] GRPC error: <nil>
I0709 03:19:47.400625   27642 main.go:137] CSI driver name: "rbd.csi.ceph.com"
I0709 03:19:47.400676   27642 node_register.go:51] Starting Registration Server at: /registration/rbd.csi.ceph.com-reg.sock
I0709 03:19:47.400762   27642 node_register.go:60] Registration Server started at: /registration/rbd.csi.ceph.com-reg.sock

The log looks like normal. But driver name rbd.csi.ceph.com not found in the list of registered CSI drivers occurs.

Ayden-Franklin commented 4 years ago

Ok, I should open a new issue for driver name rbd.csi.ceph.com not found in the list of registered CSI drivers. Let me close this issue. Thanks for your help.

ankitladhania commented 4 years ago

@Madhu-1 I'm getting the same error: rpc error: code = Internal desc = an error (exit status 2) occurred while running ceph args: [rbd task add flatten kubernetes/csi-vol-4acc48c5-d61d-11ea-bc35-ba86e3e2c39d --id kubernetes --keyfile=stripped -m 10.8.0.14:6789,10.8.0.18:6789

But the cluster was working previously. I had to spawn a new node where I'm seeing this issue.

Any help is really appreciated.

My setup is exactly like that of Macrame but I'm not getting the driver name rbd.csi.ceph.com not found in the list of registered CSI drivers error

Madhu-1 commented 4 years ago

@ankitladhania do you have image-feature: layering in your storageclass?

ankitladhania commented 4 years ago

No, But when I try to update it in the existing storage class its not allowing me to.

How do I update the current storage. As we have 100's of users storage data. I cannot delete it, we need to convert it.

Madhu-1 commented 4 years ago

Delete and recreating only storageclass helps

ankitladhania commented 4 years ago

I deleted and recreated the storageclass but the old volumes still give the same error.

MountVolume.MountDevice failed for volume "pvc-6194351b-3fd4-4f76-a1ec-2e5537709b86" : rpc error: code = Internal desc = an error (exit status 2) occurred while running ceph args: [rbd task add flatten kubernetes/csi-vol-271fdd03-c1b5-11ea-a1b3-7e6749822114 --id kubernetes --keyfile=stripped -m 10.8.0.14:6789,10.8.0.18:6789]

Madhu-1 commented 4 years ago

Please disable the other features for older rbd images than layering

ankitladhania commented 4 years ago

Any way to disable it at mass? cause I have 464 volumes to do this with.

Madhu-1 commented 4 years ago

Any way to disable it at mass? cause I have 464 volumes to do this with.

you need to write some script for it

ankitladhania commented 4 years ago

Thanks @Madhu-1, wrote a small script to do the same:

for vol in `rbd ls kubernetes`;
do
  echo "$pool";
  rbd feature disable kubernetes/$vol exclusive-lock, object-map, fast-diff, deep-flatten;
done

@Madhu-1 - can you tell me the reason why this happened? We should be promoting the images like this. My team read the code from the documentation that was provided, there could be a small mention of use "vx.x.x" for production.

Madhu-1 commented 4 years ago

My team read the code from the documentation that was provided, there could be a small mention of use "vx.x.x" for production.

This can happen if you don't provide the imagefeature=layering in storage-class, the image will get created with default options in ceph.

yangyongzhi7 commented 3 years ago

My team read the code from the documentation that was provided, there could be a small mention of use "vx.x.x" for production.

This can happen if you don't provide the imagefeature=layering in storage-class, the image will get created with default options in ceph.

Which version should I use for my production?

Madhu-1 commented 3 years ago

My team read the code from the documentation that was provided, there could be a small mention of use "vx.x.x" for production.

This can happen if you don't provide the imagefeature=layering in storage-class, the image will get created with default options in ceph.

Which version should I use for my production?

The latest release is always better if you refer https://github.com/ceph/ceph-csi/releases for the latest releases

yangyongzhi7 commented 3 years ago

My team read the code from the documentation that was provided, there could be a small mention of use "vx.x.x" for production.

This can happen if you don't provide the imagefeature=layering in storage-class, the image will get created with default options in ceph.

Which version should I use for my production?

The latest release is always better if you refer https://github.com/ceph/ceph-csi/releases for the latest releases

The same error message has been arise, just like this: `` Warning FailedMount 0s (x6 over 17s) kubelet, 10.16.247.82 MountVolume.MountDevice failed for volume "pvc-b61c9537-57a5-4f58-86d2-1931e1565c3c" │ │ : rpc error: code = Internal desc = an error (exit status 2) occurred while running ceph args: [rbd task add flatten kubernetes/csi-vol-84e788e8-44fb-11eb-acbc-2a1 │ │ bf25e3414 --id kubernetes --keyfile=stripped -m 1.1.1.1:6789]

Madhu-1 commented 3 years ago

are you specifying the imageFeature in storageclass?

yangyongzhi7 commented 3 years ago

are you specifying the imageFeature in storageclass?

No, I haven't added it yet as you said, let me try. I will tell you the consequence as soon as I finished it.

yangyongzhi7 commented 3 years ago

are you specifying the imageFeature in storageclass?

Yes, it really looks like as you said, the pod has been bound with a pv successfully, we should update the content about this issue in order to avoid make others users be confused. https://docs.ceph.com/en/latest/rbd/rbd-kubernetes/

Thank you ,much appreciated, BTW , Merry Christmas!

phucnv282 commented 1 year ago

hi @Madhu-1 , I faced the same issue MountVolume.MountDevice failed for volume "pvc-xxx" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 22) occurred while running rbd args: [--id csi-rbd-node -m x.x.x.x:3300,x.x.x.x:3300,x.x.x.x:3300 --keyfile=***stripped*** map blockpool/csi-vol-xxxx --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed I tried adding the option imageFeatures but it didn't help

Madhu-1 commented 1 year ago

@phucnv282 check dmesg on the node and also please open a new issue with required logs.

phucnv282 commented 1 year ago

I provision Ceph on K8s and connected to it via an external cluster what does dmesg do?

phoerious commented 11 months ago

@Madhu-1 Same problem here. I've been googling like mad, but I cannot find a solution that works.

I am running Quincy and Kubernetes 1.28. The CSI driver is deployed via the Helm chart. I've created the following StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-rbd
provisioner: rbd.csi.ceph.com
allowVolumeExpansion: true
reclaimPolicy: Delete
mountOptions:
 - discard
parameters:
  clusterID: XXX
  pool: rbd.k8s-pvs
  imageFeatures: "layering"
  csi.storage.k8s.io/fstype: ext4
  csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
  csi.storage.k8s.io/provisioner-secret-namespace: ceph-csi-rbd
  csi.storage.k8s.io/controller-expand-secret-name: csi-rbd-secret
  csi.storage.k8s.io/controller-expand-secret-namespace: ceph-csi-rbd
  csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
  csi.storage.k8s.io/node-stage-secret-namespace: ceph-csi-rbd

I can also verify that the node-plugin port with three containers is running on the node that I want to bind to and there is a rbd.csi.ceph.com-reg.sock in /var/lib/kubelet/plugins_registry.

Provisioning the PV works flawlessly:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: rbd-pvc
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 1Gi
  storageClassName: csi-rbd
  Type    Reason                 Age    From                                                                                             Message
  ----    ------                 ----   ----                                                                                             -------
  Normal  ExternalProvisioning   7m34s  persistentvolume-controller                                                                      Waiting for a volume to be created either by the external provisioner 'rbd.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal  Provisioning           7m34s  rbd.csi.ceph.com_ceph-csi-rbd-provisioner-589cdbb858-5qj7t_e26cacee-5c25-4b46-afe2-9d7362844514  External provisioner is provisioning volume for claim "ceph-csi-rbd/rbd-pvc"
  Normal  ProvisioningSucceeded  7m34s  rbd.csi.ceph.com_ceph-csi-rbd-provisioner-589cdbb858-5qj7t_e26cacee-5c25-4b46-afe2-9d7362844514  Successfully provisioned volume pvc-3b2820f9-b137-4f12-baaf-b537f73fb56a

But binding fails:

apiVersion: v1
kind: Pod
metadata:
  name: csi-rbd-demo-pod
spec:
  nodeSelector:
    kubernetes.io/hostname: XXX
  containers:
    - name: web-server
      image: nginx
      volumeMounts:
        - name: mypvc
          mountPath: /var/lib/www/html
  volumes:
    - name: mypvc
      persistentVolumeClaim:
        claimName: rbd-pvc
        readOnly: false

Logs:

  Type     Reason                  Age              From                     Message
  ----     ------                  ----             ----                     -------
  Normal   Scheduled               11s              default-scheduler        Successfully assigned ceph-csi-rbd/csi-rbd-demo-pod to XXX
  Normal   SuccessfulAttachVolume  11s              attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-3b2820f9-b137-4f12-baaf-b537f73fb56a"
  Warning  FailedMount             0s (x4 over 4s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-3b2820f9-b137-4f12-baaf-b537f73fb56a" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name rbd.csi.ceph.com not found in the list of registered CSI drivers

I cannot find anything peculiar in the system logs and I've run out of ideas where to look. Please send help!

Madhu-1 commented 11 months ago

@phoerious couple of things

phoerious commented 11 months ago

Yes, it's vanilla k8s and the --root-path has not been set explicitly, so it's the default of /var/lib/kubelet and everything else is also in there.

These are the relevant YAML snippets from kubectl get csidriver:

apiVersion: v1
items:
- apiVersion: storage.k8s.io/v1
  kind: CSIDriver
  metadata:
    annotations:
      meta.helm.sh/release-name: ceph-csi-cephfs
      meta.helm.sh/release-namespace: ceph-csi-cephfs
    creationTimestamp: "2023-11-06T16:48:56Z"
    labels:
      app.kubernetes.io/managed-by: Helm
    name: cephfs.csi.ceph.com
    resourceVersion: "3417670156"
    uid: b3ead413-bc5f-4a51-89a0-2c32adafec31
  spec:
    attachRequired: false
    fsGroupPolicy: File
    podInfoOnMount: false
    requiresRepublish: false
    seLinuxMount: false
    storageCapacity: false
    volumeLifecycleModes:
    - Persistent
- apiVersion: storage.k8s.io/v1
  kind: CSIDriver
  metadata:
    annotations:
      meta.helm.sh/release-name: ceph-csi-rbd
      meta.helm.sh/release-namespace: ceph-csi-rbd
    creationTimestamp: "2023-11-06T14:54:56Z"
    labels:
      app.kubernetes.io/managed-by: Helm
    name: rbd.csi.ceph.com
    resourceVersion: "3417415888"
    uid: 205d6822-1b64-4b38-80ed-a9bfc8733c88
  spec:
    attachRequired: true
    fsGroupPolicy: File
    podInfoOnMount: false
    requiresRepublish: false
    seLinuxMount: false
    storageCapacity: false
    volumeLifecycleModes:
    - Persistent
kind: List
metadata:
  resourceVersion: ""

(I skipped the Calico drivers)

The provisioner pods have 7/7 containers running and ready, the nodeplugin pods have 3/3 containers running and ready.

Madhu-1 commented 11 months ago

Is your kubelet -root-path is /var/lib/kubelet/ ? can you check the root-dir on the kubelet process on the nodes ps -ef |grep kubelet should help you

Can you please double-check this and confirm the rootPath in kubelet?

check do you have the attacher sidecar running with the RBD provisioner pod without any problem?

please check this one as well.

Madhu-1 commented 11 months ago

The provisioner pods have 7/7 containers running and ready, the nodeplugin pods have 3/3 containers running and ready.

please check attacher sidecar container logs in the provisioner pod and ensure there are no errors as well.

phoerious commented 11 months ago

Can you please double-check this and confirm the rootPath in kubelet?

Yes, it's running with --root-dir=/var/lib/kubelet

please check this one as well.

See my edit

The provisioner pods have 7/7 containers running and ready, the nodeplugin pods have 3/3 containers running and ready.

please check attacher sidecar container logs in the provisioner pod and ensure there are no errors as well.

There are three provisioner pods running in the cluster and the attacher logs of the first are:

I1106 15:56:09.791852       1 main.go:96] Version: v4.3.0
I1106 15:56:11.981414       1 common.go:111] Probing CSI driver for readiness
I1106 15:56:11.989139       1 common.go:111] Probing CSI driver for readiness
I1106 15:56:11.992841       1 leaderelection.go:245] attempting to acquire leader lease ceph-csi-rbd/external-attacher-leader-rbd-csi-ceph-com...
I1106 15:56:32.221588       1 leaderelection.go:255] successfully acquired lease ceph-csi-rbd/external-attacher-leader-rbd-csi-ceph-com
I1106 15:56:32.221827       1 controller.go:130] Starting CSI attacher

The other two are identical, but minus the acquired leader lease.

Madhu-1 commented 11 months ago

okay, Thanks for providing details. please provide the csi-rbdplugin pod (all container logs) from the node where you have the problem.

phoerious commented 11 months ago

The problem occurs on all nodes, even after a reboot (the machines are RAM-booted, so I wouldn't expect that to make a difference anyway). I only added a nodeSelector to make this problem easier to debug. These are the logs of the containers:

driver-registrar:

I1106 16:42:12.765557   67794 main.go:167] Version: v2.8.0
I1106 16:42:12.765950   67794 main.go:168] Running node-driver-registrar in mode=registration
I1106 16:42:12.767668   67794 main.go:192] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1106 16:42:12.768556   67794 connection.go:164] Connecting to unix:///csi/csi.sock
I1106 16:42:16.837800   67794 main.go:199] Calling CSI driver to discover driver name
I1106 16:42:16.837854   67794 connection.go:193] GRPC call: /csi.v1.Identity/GetPluginInfo
I1106 16:42:16.837865   67794 connection.go:194] GRPC request: {}
I1106 16:42:16.848145   67794 connection.go:200] GRPC response: {"name":"rbd.csi.ceph.com","vendor_version":"v3.9.0"}
I1106 16:42:16.848167   67794 connection.go:201] GRPC error: <nil>
I1106 16:42:16.848185   67794 main.go:209] CSI driver name: "rbd.csi.ceph.com"
I1106 16:42:16.848230   67794 node_register.go:53] Starting Registration Server at: /registration/rbd.csi.ceph.com-reg.sock
I1106 16:42:16.848486   67794 node_register.go:62] Registration Server started at: /registration/rbd.csi.ceph.com-reg.sock
I1106 16:42:16.848733   67794 node_register.go:92] Skipping HTTP server because endpoint is set to: ""

csi-rbdplugin:

I1106 16:42:16.681975   67849 cephcsi.go:199] Driver version: v3.9.0 and Git version: c6db73f0daef5570756b4257043c00bf58b5fd3e
I1106 16:42:16.706083   67849 cephcsi.go:276] Initial PID limit is set to -1
I1106 16:42:16.706181   67849 cephcsi.go:282] Reconfigured PID limit to -1 (max)
I1106 16:42:16.706246   67849 cephcsi.go:231] Starting driver type: rbd with name: rbd.csi.ceph.com
I1106 16:42:16.755941   67849 mount_linux.go:284] Detected umount with safe 'not mounted' behavior
I1106 16:42:16.756332   67849 rbd_attach.go:230] nbd module loaded
W1106 16:42:16.756432   67849 util.go:280] kernel 5.4.0-165-generic does not support required features
W1106 16:42:16.756454   67849 rbd_attach.go:240] kernel version "5.4.0-165-generic" doesn't support cookie feature
I1106 16:42:16.757359   67849 server.go:114] listening for CSI-Addons requests on address: &net.UnixAddr{Name:"/csi/csi-addons.sock", Net:"unix"}
I1106 16:42:16.757962   67849 server.go:126] Listening for connections on address: &net.UnixAddr{Name:"//csi/csi.sock", Net:"unix"}
I1106 16:42:16.844352   67849 utils.go:195] ID: 1 GRPC call: /csi.v1.Identity/GetPluginInfo
I1106 16:42:16.847360   67849 utils.go:206] ID: 1 GRPC request: {}
I1106 16:42:16.847401   67849 identityserver-default.go:39] ID: 1 Using default GetPluginInfo
I1106 16:42:16.847528   67849 utils.go:212] ID: 1 GRPC response: {"name":"rbd.csi.ceph.com","vendor_version":"v3.9.0"}
...
[many more of the lines below]
I1107 12:43:17.721356   67849 utils.go:195] ID: 1202 GRPC call: /csi.v1.Identity/Probe
I1107 12:43:17.721535   67849 utils.go:206] ID: 1202 GRPC request: {}
I1107 12:43:17.721609   67849 utils.go:212] ID: 1202 GRPC response: {}
I1107 12:44:17.721596   67849 utils.go:195] ID: 1203 GRPC call: /csi.v1.Identity/Probe
I1107 12:44:17.721732   67849 utils.go:206] ID: 1203 GRPC request: {}
I1107 12:44:17.721779   67849 utils.go:212] ID: 1203 GRPC response: {}
I1107 12:45:17.709858   67849 utils.go:195] ID: 1204 GRPC call: /csi.v1.Identity/Probe
I1107 12:45:17.710017   67849 utils.go:206] ID: 1204 GRPC request: {}
I1107 12:45:17.710079   67849 utils.go:212] ID: 1204 GRPC response: {}
I1107 12:46:17.709068   67849 utils.go:195] ID: 1205 GRPC call: /csi.v1.Identity/Probe
I1107 12:46:17.709210   67849 utils.go:206] ID: 1205 GRPC request: {}
I1107 12:46:17.709273   67849 utils.go:212] ID: 1205 GRPC response: {}
I1107 12:47:17.721311   67849 utils.go:195] ID: 1206 GRPC call: /csi.v1.Identity/Probe
I1107 12:47:17.721438   67849 utils.go:206] ID: 1206 GRPC request: {}
I1107 12:47:17.721525   67849 utils.go:212] ID: 1206 GRPC response: {}

liveness-prometheus has no log output (probably not relevant anyway)

Madhu-1 commented 11 months ago

The logs of the driver-registrar should looks like below

I1107 12:57:58.903924    5717 node_register.go:53] Starting Registration Server at: /registration/rbd.csi.ceph.com-reg.sock
I1107 12:57:58.904594    5717 node_register.go:62] Registration Server started at: /registration/rbd.csi.ceph.com-reg.sock
I1107 12:57:58.905550    5717 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I1107 12:57:59.074407    5717 main.go:102] Received GetInfo call: &InfoRequest{}
I1107 12:57:59.074743    5717 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/rbd.csi.ceph.com/registration"
I1107 12:57:59.111429    5717 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,

it looks like kubelet is not getting registered with csi driver. please check kubelet logs.

phoerious commented 11 months ago

Found it! Plugins not being registered properly with kubelet was my assumption as well, but I didn't know what to look for. Turns out, we had two inconsistent deployments between our clusters, so both /var/lib/kubelet and another persistent location were being deployed as root dir, but only one was actually used and I checked the wrong one, so everything looked fine when in fact it wasn't. Thanks for the hint, now it's working!