Closed Ayden-Franklin closed 4 years ago
@Macrame please use latest version of cephcsi (v2.1.2), let us know if the issue exists.
From the log it's look like you are using canary image. Let me take a look at it
I am sorry that I was puzzled. The version information: driver-registrar :
image: quay.io/k8scsi/csi-node-driver-registrar:v1.3.0
csi-rbdplugin
image: quay.io/cephcsi/cephcsi:canary
Is it the latest version? I can not find version of river-registrar: v2.1.2
Yeah this is the latest version, can you please provide the pvc template used for your testing
And also provide the rbd info of the image from ceph cluster
I just used the example of ceph documentation and I changed the name of namespace
and pool
.
csi-rbd-sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-rbd-sc
provisioner: rbd.csi.ceph.com
parameters:
clusterID: 1e883026-c03d-11ea-9dd5-00163e101e9f
pool: rbd
csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
csi.storage.k8s.io/provisioner-secret-namespace: mars
csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
csi.storage.k8s.io/node-stage-secret-namespace: mars
reclaimPolicy: Delete
mountOptions:
- discard
raw-block-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: raw-block-pvc
namespace: mars
spec:
accessModes:
- ReadWriteOnce
volumeMode: Block
resources:
requests:
storage: 1Gi
storageClassName: csi-rbd-sc
rbd info
rbd info csi-vol-360233ac-c184-11ea-aa46-d6eec0877e2d
rbd image 'csi-vol-360233ac-c184-11ea-aa46-d6eec0877e2d':
size 1 GiB in 256 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 5ee19ac08555
block_name_prefix: rbd_data.5ee19ac08555
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
op_features:
flags:
create_timestamp: Thu Jul 9 09:33:37 2020
access_timestamp: Thu Jul 9 09:33:37 2020
modify_timestamp: Thu Jul 9 09:33:37 2020
Thanks for your response.
@dillaman if we don't provide the image features when creating a rbd image will these features comes as default in octopus?
@Macrame can you please pass image-feature=layering in storageclass and create new pvc and see
I see. I am trying to delete them and create them again. But the previous issue appears again.
MapVolume.SetUpDevice failed for volume "pvc-7fcd65f2-42c6-4ab1-a95d-e9eb27de22f4" : kubernetes.io/csi: blockMapper.SetUpDevice failed to get CSI client: driver name rbd.csi.ceph.com not found in the list of registered CSI drivers
I am restarting the rbd-plugin pod.
For above error see driver-registrar logs it might help
This issue can resolved by adding imageFeatures: layering
.
The name rbd.csi.ceph.com not found
issue only occurs some of nodes. Only one node can run the pod successfully.
driver-registrar logs
I0709 03:19:46.396924 27642 main.go:110] Version: v1.3.0-0-g6e9fff3e
I0709 03:19:46.396985 27642 main.go:120] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0709 03:19:46.397005 27642 connection.go:151] Connecting to unix:///csi/csi.sock
I0709 03:19:47.397487 27642 main.go:127] Calling CSI driver to discover driver name
I0709 03:19:47.397511 27642 connection.go:180] GRPC call: /csi.v1.Identity/GetPluginInfo
I0709 03:19:47.397518 27642 connection.go:181] GRPC request: {}
I0709 03:19:47.400319 27642 connection.go:183] GRPC response: {"name":"rbd.csi.ceph.com","vendor_version":"canary"}
I0709 03:19:47.400617 27642 connection.go:184] GRPC error: <nil>
I0709 03:19:47.400625 27642 main.go:137] CSI driver name: "rbd.csi.ceph.com"
I0709 03:19:47.400676 27642 node_register.go:51] Starting Registration Server at: /registration/rbd.csi.ceph.com-reg.sock
I0709 03:19:47.400762 27642 node_register.go:60] Registration Server started at: /registration/rbd.csi.ceph.com-reg.sock
The log looks like normal. But driver name rbd.csi.ceph.com not found in the list of registered CSI drivers
occurs.
Ok, I should open a new issue for driver name rbd.csi.ceph.com not found in the list of registered CSI drivers
. Let me close this issue. Thanks for your help.
@Madhu-1 I'm getting the same error: rpc error: code = Internal desc = an error (exit status 2) occurred while running ceph args: [rbd task add flatten kubernetes/csi-vol-4acc48c5-d61d-11ea-bc35-ba86e3e2c39d --id kubernetes --keyfile=stripped -m 10.8.0.14:6789,10.8.0.18:6789
But the cluster was working previously. I had to spawn a new node where I'm seeing this issue.
Any help is really appreciated.
My setup is exactly like that of Macrame but I'm not getting the driver name rbd.csi.ceph.com not found in the list of registered CSI drivers
error
@ankitladhania do you have image-feature: layering in your storageclass?
No, But when I try to update it in the existing storage class its not allowing me to.
How do I update the current storage. As we have 100's of users storage data. I cannot delete it, we need to convert it.
Delete and recreating only storageclass helps
I deleted and recreated the storageclass but the old volumes still give the same error.
MountVolume.MountDevice failed for volume "pvc-6194351b-3fd4-4f76-a1ec-2e5537709b86" : rpc error: code = Internal desc = an error (exit status 2) occurred while running ceph args: [rbd task add flatten kubernetes/csi-vol-271fdd03-c1b5-11ea-a1b3-7e6749822114 --id kubernetes --keyfile=stripped -m 10.8.0.14:6789,10.8.0.18:6789]
Please disable the other features for older rbd images than layering
Any way to disable it at mass? cause I have 464 volumes to do this with.
Any way to disable it at mass? cause I have 464 volumes to do this with.
you need to write some script for it
Thanks @Madhu-1, wrote a small script to do the same:
for vol in `rbd ls kubernetes`;
do
echo "$pool";
rbd feature disable kubernetes/$vol exclusive-lock, object-map, fast-diff, deep-flatten;
done
@Madhu-1 - can you tell me the reason why this happened? We should be promoting the images like this. My team read the code from the documentation that was provided, there could be a small mention of use "vx.x.x" for production.
My team read the code from the documentation that was provided, there could be a small mention of use "vx.x.x" for production.
This can happen if you don't provide the imagefeature=layering in storage-class, the image will get created with default options in ceph.
My team read the code from the documentation that was provided, there could be a small mention of use "vx.x.x" for production.
This can happen if you don't provide the imagefeature=layering in storage-class, the image will get created with default options in ceph.
Which version should I use for my production?
My team read the code from the documentation that was provided, there could be a small mention of use "vx.x.x" for production.
This can happen if you don't provide the imagefeature=layering in storage-class, the image will get created with default options in ceph.
Which version should I use for my production?
The latest release is always better if you refer https://github.com/ceph/ceph-csi/releases for the latest releases
My team read the code from the documentation that was provided, there could be a small mention of use "vx.x.x" for production.
This can happen if you don't provide the imagefeature=layering in storage-class, the image will get created with default options in ceph.
Which version should I use for my production?
The latest release is always better if you refer https://github.com/ceph/ceph-csi/releases for the latest releases
The same error message has been arise, just like this: `` Warning FailedMount 0s (x6 over 17s) kubelet, 10.16.247.82 MountVolume.MountDevice failed for volume "pvc-b61c9537-57a5-4f58-86d2-1931e1565c3c" │ │ : rpc error: code = Internal desc = an error (exit status 2) occurred while running ceph args: [rbd task add flatten kubernetes/csi-vol-84e788e8-44fb-11eb-acbc-2a1 │ │ bf25e3414 --id kubernetes --keyfile=stripped -m 1.1.1.1:6789]
are you specifying the imageFeature in storageclass?
are you specifying the imageFeature in storageclass?
No, I haven't added it yet as you said, let me try. I will tell you the consequence as soon as I finished it.
are you specifying the imageFeature in storageclass?
Yes, it really looks like as you said, the pod has been bound with a pv successfully, we should update the content about this issue in order to avoid make others users be confused. https://docs.ceph.com/en/latest/rbd/rbd-kubernetes/
Thank you ,much appreciated, BTW , Merry Christmas!
hi @Madhu-1 ,
I faced the same issue
MountVolume.MountDevice failed for volume "pvc-xxx" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 22) occurred while running rbd args: [--id csi-rbd-node -m x.x.x.x:3300,x.x.x.x:3300,x.x.x.x:3300 --keyfile=***stripped*** map blockpool/csi-vol-xxxx --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed
I tried adding the option imageFeatures but it didn't help
@phucnv282 check dmesg on the node and also please open a new issue with required logs.
I provision Ceph on K8s and connected to it via an external cluster what does dmesg do?
@Madhu-1 Same problem here. I've been googling like mad, but I cannot find a solution that works.
I am running Quincy and Kubernetes 1.28. The CSI driver is deployed via the Helm chart. I've created the following StorageClass:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-rbd
provisioner: rbd.csi.ceph.com
allowVolumeExpansion: true
reclaimPolicy: Delete
mountOptions:
- discard
parameters:
clusterID: XXX
pool: rbd.k8s-pvs
imageFeatures: "layering"
csi.storage.k8s.io/fstype: ext4
csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
csi.storage.k8s.io/provisioner-secret-namespace: ceph-csi-rbd
csi.storage.k8s.io/controller-expand-secret-name: csi-rbd-secret
csi.storage.k8s.io/controller-expand-secret-namespace: ceph-csi-rbd
csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
csi.storage.k8s.io/node-stage-secret-namespace: ceph-csi-rbd
I can also verify that the node-plugin port with three containers is running on the node that I want to bind to and there is a rbd.csi.ceph.com-reg.sock
in /var/lib/kubelet/plugins_registry
.
Provisioning the PV works flawlessly:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: rbd-pvc
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 1Gi
storageClassName: csi-rbd
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ExternalProvisioning 7m34s persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'rbd.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
Normal Provisioning 7m34s rbd.csi.ceph.com_ceph-csi-rbd-provisioner-589cdbb858-5qj7t_e26cacee-5c25-4b46-afe2-9d7362844514 External provisioner is provisioning volume for claim "ceph-csi-rbd/rbd-pvc"
Normal ProvisioningSucceeded 7m34s rbd.csi.ceph.com_ceph-csi-rbd-provisioner-589cdbb858-5qj7t_e26cacee-5c25-4b46-afe2-9d7362844514 Successfully provisioned volume pvc-3b2820f9-b137-4f12-baaf-b537f73fb56a
But binding fails:
apiVersion: v1
kind: Pod
metadata:
name: csi-rbd-demo-pod
spec:
nodeSelector:
kubernetes.io/hostname: XXX
containers:
- name: web-server
image: nginx
volumeMounts:
- name: mypvc
mountPath: /var/lib/www/html
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: rbd-pvc
readOnly: false
Logs:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11s default-scheduler Successfully assigned ceph-csi-rbd/csi-rbd-demo-pod to XXX
Normal SuccessfulAttachVolume 11s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-3b2820f9-b137-4f12-baaf-b537f73fb56a"
Warning FailedMount 0s (x4 over 4s) kubelet MountVolume.MountDevice failed for volume "pvc-3b2820f9-b137-4f12-baaf-b537f73fb56a" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name rbd.csi.ceph.com not found in the list of registered CSI drivers
I cannot find anything peculiar in the system logs and I've run out of ideas where to look. Please send help!
@phoerious couple of things
-root-path
is /var/lib/kubelet/
?
ps -ef |grep kubelet
should help youkubectl get csidriver -oyaml
output and also check do you have attacher sidecar running with RBD provisioner pod without any problem?Yes, it's vanilla k8s and the --root-path has not been set explicitly, so it's the default of /var/lib/kubelet and everything else is also in there.
These are the relevant YAML snippets from kubectl get csidriver:
apiVersion: v1
items:
- apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
annotations:
meta.helm.sh/release-name: ceph-csi-cephfs
meta.helm.sh/release-namespace: ceph-csi-cephfs
creationTimestamp: "2023-11-06T16:48:56Z"
labels:
app.kubernetes.io/managed-by: Helm
name: cephfs.csi.ceph.com
resourceVersion: "3417670156"
uid: b3ead413-bc5f-4a51-89a0-2c32adafec31
spec:
attachRequired: false
fsGroupPolicy: File
podInfoOnMount: false
requiresRepublish: false
seLinuxMount: false
storageCapacity: false
volumeLifecycleModes:
- Persistent
- apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
annotations:
meta.helm.sh/release-name: ceph-csi-rbd
meta.helm.sh/release-namespace: ceph-csi-rbd
creationTimestamp: "2023-11-06T14:54:56Z"
labels:
app.kubernetes.io/managed-by: Helm
name: rbd.csi.ceph.com
resourceVersion: "3417415888"
uid: 205d6822-1b64-4b38-80ed-a9bfc8733c88
spec:
attachRequired: true
fsGroupPolicy: File
podInfoOnMount: false
requiresRepublish: false
seLinuxMount: false
storageCapacity: false
volumeLifecycleModes:
- Persistent
kind: List
metadata:
resourceVersion: ""
(I skipped the Calico drivers)
The provisioner pods have 7/7 containers running and ready, the nodeplugin pods have 3/3 containers running and ready.
Is your kubelet -root-path is /var/lib/kubelet/ ? can you check the root-dir on the kubelet process on the nodes
ps -ef |grep kubelet
should help you
Can you please double-check this and confirm the rootPath in kubelet?
check do you have the attacher sidecar running with the RBD provisioner pod without any problem?
please check this one as well.
The provisioner pods have 7/7 containers running and ready, the nodeplugin pods have 3/3 containers running and ready.
please check attacher sidecar container logs in the provisioner pod and ensure there are no errors as well.
Can you please double-check this and confirm the rootPath in kubelet?
Yes, it's running with --root-dir=/var/lib/kubelet
please check this one as well.
See my edit
The provisioner pods have 7/7 containers running and ready, the nodeplugin pods have 3/3 containers running and ready.
please check attacher sidecar container logs in the provisioner pod and ensure there are no errors as well.
There are three provisioner pods running in the cluster and the attacher logs of the first are:
I1106 15:56:09.791852 1 main.go:96] Version: v4.3.0
I1106 15:56:11.981414 1 common.go:111] Probing CSI driver for readiness
I1106 15:56:11.989139 1 common.go:111] Probing CSI driver for readiness
I1106 15:56:11.992841 1 leaderelection.go:245] attempting to acquire leader lease ceph-csi-rbd/external-attacher-leader-rbd-csi-ceph-com...
I1106 15:56:32.221588 1 leaderelection.go:255] successfully acquired lease ceph-csi-rbd/external-attacher-leader-rbd-csi-ceph-com
I1106 15:56:32.221827 1 controller.go:130] Starting CSI attacher
The other two are identical, but minus the acquired leader lease.
okay, Thanks for providing details. please provide the csi-rbdplugin pod (all container logs) from the node where you have the problem.
The problem occurs on all nodes, even after a reboot (the machines are RAM-booted, so I wouldn't expect that to make a difference anyway). I only added a nodeSelector to make this problem easier to debug. These are the logs of the containers:
driver-registrar:
I1106 16:42:12.765557 67794 main.go:167] Version: v2.8.0
I1106 16:42:12.765950 67794 main.go:168] Running node-driver-registrar in mode=registration
I1106 16:42:12.767668 67794 main.go:192] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1106 16:42:12.768556 67794 connection.go:164] Connecting to unix:///csi/csi.sock
I1106 16:42:16.837800 67794 main.go:199] Calling CSI driver to discover driver name
I1106 16:42:16.837854 67794 connection.go:193] GRPC call: /csi.v1.Identity/GetPluginInfo
I1106 16:42:16.837865 67794 connection.go:194] GRPC request: {}
I1106 16:42:16.848145 67794 connection.go:200] GRPC response: {"name":"rbd.csi.ceph.com","vendor_version":"v3.9.0"}
I1106 16:42:16.848167 67794 connection.go:201] GRPC error: <nil>
I1106 16:42:16.848185 67794 main.go:209] CSI driver name: "rbd.csi.ceph.com"
I1106 16:42:16.848230 67794 node_register.go:53] Starting Registration Server at: /registration/rbd.csi.ceph.com-reg.sock
I1106 16:42:16.848486 67794 node_register.go:62] Registration Server started at: /registration/rbd.csi.ceph.com-reg.sock
I1106 16:42:16.848733 67794 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
csi-rbdplugin:
I1106 16:42:16.681975 67849 cephcsi.go:199] Driver version: v3.9.0 and Git version: c6db73f0daef5570756b4257043c00bf58b5fd3e
I1106 16:42:16.706083 67849 cephcsi.go:276] Initial PID limit is set to -1
I1106 16:42:16.706181 67849 cephcsi.go:282] Reconfigured PID limit to -1 (max)
I1106 16:42:16.706246 67849 cephcsi.go:231] Starting driver type: rbd with name: rbd.csi.ceph.com
I1106 16:42:16.755941 67849 mount_linux.go:284] Detected umount with safe 'not mounted' behavior
I1106 16:42:16.756332 67849 rbd_attach.go:230] nbd module loaded
W1106 16:42:16.756432 67849 util.go:280] kernel 5.4.0-165-generic does not support required features
W1106 16:42:16.756454 67849 rbd_attach.go:240] kernel version "5.4.0-165-generic" doesn't support cookie feature
I1106 16:42:16.757359 67849 server.go:114] listening for CSI-Addons requests on address: &net.UnixAddr{Name:"/csi/csi-addons.sock", Net:"unix"}
I1106 16:42:16.757962 67849 server.go:126] Listening for connections on address: &net.UnixAddr{Name:"//csi/csi.sock", Net:"unix"}
I1106 16:42:16.844352 67849 utils.go:195] ID: 1 GRPC call: /csi.v1.Identity/GetPluginInfo
I1106 16:42:16.847360 67849 utils.go:206] ID: 1 GRPC request: {}
I1106 16:42:16.847401 67849 identityserver-default.go:39] ID: 1 Using default GetPluginInfo
I1106 16:42:16.847528 67849 utils.go:212] ID: 1 GRPC response: {"name":"rbd.csi.ceph.com","vendor_version":"v3.9.0"}
...
[many more of the lines below]
I1107 12:43:17.721356 67849 utils.go:195] ID: 1202 GRPC call: /csi.v1.Identity/Probe
I1107 12:43:17.721535 67849 utils.go:206] ID: 1202 GRPC request: {}
I1107 12:43:17.721609 67849 utils.go:212] ID: 1202 GRPC response: {}
I1107 12:44:17.721596 67849 utils.go:195] ID: 1203 GRPC call: /csi.v1.Identity/Probe
I1107 12:44:17.721732 67849 utils.go:206] ID: 1203 GRPC request: {}
I1107 12:44:17.721779 67849 utils.go:212] ID: 1203 GRPC response: {}
I1107 12:45:17.709858 67849 utils.go:195] ID: 1204 GRPC call: /csi.v1.Identity/Probe
I1107 12:45:17.710017 67849 utils.go:206] ID: 1204 GRPC request: {}
I1107 12:45:17.710079 67849 utils.go:212] ID: 1204 GRPC response: {}
I1107 12:46:17.709068 67849 utils.go:195] ID: 1205 GRPC call: /csi.v1.Identity/Probe
I1107 12:46:17.709210 67849 utils.go:206] ID: 1205 GRPC request: {}
I1107 12:46:17.709273 67849 utils.go:212] ID: 1205 GRPC response: {}
I1107 12:47:17.721311 67849 utils.go:195] ID: 1206 GRPC call: /csi.v1.Identity/Probe
I1107 12:47:17.721438 67849 utils.go:206] ID: 1206 GRPC request: {}
I1107 12:47:17.721525 67849 utils.go:212] ID: 1206 GRPC response: {}
liveness-prometheus has no log output (probably not relevant anyway)
The logs of the driver-registrar should looks like below
I1107 12:57:58.903924 5717 node_register.go:53] Starting Registration Server at: /registration/rbd.csi.ceph.com-reg.sock
I1107 12:57:58.904594 5717 node_register.go:62] Registration Server started at: /registration/rbd.csi.ceph.com-reg.sock
I1107 12:57:58.905550 5717 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I1107 12:57:59.074407 5717 main.go:102] Received GetInfo call: &InfoRequest{}
I1107 12:57:59.074743 5717 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/rbd.csi.ceph.com/registration"
I1107 12:57:59.111429 5717 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,
it looks like kubelet is not getting registered with csi driver. please check kubelet logs.
Found it! Plugins not being registered properly with kubelet was my assumption as well, but I didn't know what to look for. Turns out, we had two inconsistent deployments between our clusters, so both /var/lib/kubelet and another persistent location were being deployed as root dir, but only one was actually used and I checked the wrong one, so everything looked fine when in fact it wasn't. Thanks for the hint, now it's working!
Describe the bug
I deployed ceph cluster using cephadm successfully. I can create StorageClass and pvc with CephFS , instead, RBD failed.
Environment details
fuse
orkernel
. for rbd itskrbd
orrbd-nbd
) : krbd(? not sure)Steps to reproduce
Steps to reproduce the behavior:
Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[default-token-jn7tw data]: timed out waiting for the condition
Actual results
Pod status is still ContainerCreating,
Expected behavior
Pod can mount the PVC.
Logs
Pod event:
Kubelet log:
csi-rbdplugin log:
Additional context
There was another issue
driver name rbd.csi.ceph.com not found in the list of registered CSI drivers
yesterday. I read the issue #1105 then recreated the rbd-plugin pod, it works fine.For example: Nothing.