Closed pratik705 closed 1 year ago
@pratik705 101
error looks like a connection issue with mount/map and ceph cluster. Can you restart the node and see if it fixes the problem?
Thanks for the reply, @Madhu-1
I can try restarting the worker node. But, I am able to create new pod/volume on the same node from the same CEPH backend[1]. Also, from the same node, I am able to connect the mon/osds. All existing pods running on different nodes are stuck due to this issue. You still want me to restart the node?
[1]
root@rpck-ir14:/var/log/ceph# kubectl get pods -n test-velero -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
alpha-nginx 1/1 Running 0 77m 10.20.2.17 172.22.0.149 <none> <none> <=== new pod(ceph-csi 3.6.1)
test-nginx 1/1 Running 0 37h 10.20.2.43 172.22.0.149 <none> <none> <=== existing pod(ceph-csi 3.5.1)
root@rpck-ir14:/var/log/ceph# kubectl exec -it alpha-nginx -n test-velero -- bash
root@alpha-nginx:/# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 435G 85G 350G 20% /
tmpfs 64M 0 64M 0% /dev
tmpfs 95G 0 95G 0% /sys/fs/cgroup
/dev/rbd1 11G 28K 11G 1% /mnt <<===
/dev/mapper/vglocal00-root00 435G 85G 350G 20% /etc/hosts
shm 64M 0 64M 0% /dev/shm
tmpfs 95G 12K 95G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 95G 0 95G 0% /proc/acpi
tmpfs 95G 0 95G 0% /proc/scsi
tmpfs 95G 0 95G 0% /sys/firmware
root@alpha-nginx:/# cd /mnt
root@alpha-nginx:/mnt# ls
abc lost+found
root@alpha-nginx:/mnt# echo "this is new file with ceph-csi v3.6.1" >new-file.txt
root@alpha-nginx:/mnt# ls
abc lost+found new-file.txt
root@alpha-nginx:/mnt# cat new-file.txt
this is new file with ceph-csi v3.6.1
root@rpck-ir16:/var/log/ceph# nc -vz 172.22.0.148 6810
Connection to 172.22.0.148 6810 port [tcp/*] succeeded!
root@rpck-ir16:/var/log/ceph# nc -vz 172.22.0.149 6789
Connection to 172.22.0.149 6789 port [tcp/*] succeeded!
@pratik705 yes please restart the node where application pod is running or scale down all the applications and wait for all the applications to be down and scale it back again.
@Madhu-1 it helped. I am able to access the pods and data. Thanks a lot for the workaround :-)
Is it a bug in the upgrade process?
its not a bug its a connection problem, the clients might have not reconnected to the ceph cluster. i have not seen this in any upgraded cluster.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
Describe the bug
Existing volumes are not usable after upgrading ceph-csi from 3.5.1 --> 3.6.1.
Environment details
Image/version of Ceph CSI driver : v3.6.1
fuse
orkernel
. for rbd itskrbd
orrbd-nbd
) :Steps to reproduce
Steps to reproduce the behavior:
Actual results
Expected behavior
We should be able to access the data from the existing volumes after upgrading ceph-csi from 3.5.1. to 3.6.1
Logs
If the issue is in PVC mounting please attach complete logs of below containers.
csi-rbdplugin/csi-cephfsplugin and driver-registrar container logs from plugin pod from the node where the mount is failing.
if required attach dmesg logs.
Note:- If its a rbd issue please provide only rbd related logs, if its a cephFS issue please provide cephFS logs.
Additional context
Add any other context about the problem here.
Ceph status:
root@rpck-ir14:~# kubectl exec -it test-nginx -n test-velero -- bash root@test-nginx:/# df -h Filesystem Size Used Avail Use% Mounted on overlay 435G 85G 350G 20% / tmpfs 64M 0 64M 0% /dev tmpfs 95G 0 95G 0% /sys/fs/cgroup shm 64M 0 64M 0% /dev/shm /dev/mapper/vglocal00-root00 435G 85G 350G 20% /etc/hosts /dev/rbd0 6.8G 28K 6.8G 1% /usr/share/nginx/html tmpfs 95G 12K 95G 1% /run/secrets/kubernetes.io/serviceaccount tmpfs 95G 0 95G 0% /proc/acpi tmpfs 95G 0 95G 0% /proc/scsi tmpfs 95G 0 95G 0% /sys/firmware root@test-nginx:/# cd /usr/share/nginx/html root@test-nginx:/usr/share/nginx/html# ls
^^ hung
On the worker node where the pod is running, the processes are in "D" state:
root 2011442 0.0 0.0 3444 672 ? D+ Feb21 0:00 ls root 2044814 0.0 0.0 3444 728 ? D+ Feb21 0:00 ls root 2055402 0.0 0.0 3524 2340 ? D+ Feb21 0:00 ls -ltr /usr/share/nginx/html