ceph / ceph-csi

CSI driver for Ceph
Apache License 2.0
1.27k stars 539 forks source link

Existing volumes are not usable after upgrading ceph-csi from 3.5.1 --> 3.6.1 #3687

Closed pratik705 closed 1 year ago

pratik705 commented 1 year ago

Describe the bug

Existing volumes are not usable after upgrading ceph-csi from 3.5.1 --> 3.6.1.

Environment details

Steps to reproduce

Steps to reproduce the behavior:

  1. Deploy ceph-csi 3.5.1
  2. Create pvc and attached to the pod
  3. login to the pod and write some data on pvc
  4. Upgrade ceph-csi to 3.61
  5. Try accessing the same data created in step 3
  6. The process will hang
  7. Try creating new volume and attach to some pod
  8. The operation will be successful and you can access the data

Actual results

Expected behavior

Additional context

Add any other context about the problem here.

root@rpck-ir14:~# kubectl exec -it test-nginx -n test-velero -- bash root@test-nginx:/# df -h Filesystem Size Used Avail Use% Mounted on overlay 435G 85G 350G 20% / tmpfs 64M 0 64M 0% /dev tmpfs 95G 0 95G 0% /sys/fs/cgroup shm 64M 0 64M 0% /dev/shm /dev/mapper/vglocal00-root00 435G 85G 350G 20% /etc/hosts /dev/rbd0 6.8G 28K 6.8G 1% /usr/share/nginx/html tmpfs 95G 12K 95G 1% /run/secrets/kubernetes.io/serviceaccount tmpfs 95G 0 95G 0% /proc/acpi tmpfs 95G 0 95G 0% /proc/scsi tmpfs 95G 0 95G 0% /sys/firmware root@test-nginx:/# cd /usr/share/nginx/html root@test-nginx:/usr/share/nginx/html# ls

^^ hung

On the worker node where the pod is running, the processes are in "D" state:

root 2011442 0.0 0.0 3444 672 ? D+ Feb21 0:00 ls root 2044814 0.0 0.0 3444 728 ? D+ Feb21 0:00 ls root 2055402 0.0 0.0 3524 2340 ? D+ Feb21 0:00 ls -ltr /usr/share/nginx/html

Madhu-1 commented 1 year ago

@pratik705 101 error looks like a connection issue with mount/map and ceph cluster. Can you restart the node and see if it fixes the problem?

pratik705 commented 1 year ago

Thanks for the reply, @Madhu-1

I can try restarting the worker node. But, I am able to create new pod/volume on the same node from the same CEPH backend[1]. Also, from the same node, I am able to connect the mon/osds. All existing pods running on different nodes are stuck due to this issue. You still want me to restart the node?

[1]

root@rpck-ir14:/var/log/ceph# kubectl get pods -n test-velero -o wide
NAME          READY   STATUS    RESTARTS   AGE   IP           NODE           NOMINATED NODE   READINESS GATES
alpha-nginx   1/1     Running   0          77m   10.20.2.17   172.22.0.149   <none>           <none> <=== new pod(ceph-csi 3.6.1)
test-nginx    1/1     Running   0          37h   10.20.2.43   172.22.0.149   <none>           <none>  <=== existing pod(ceph-csi 3.5.1)

root@rpck-ir14:/var/log/ceph# kubectl exec -it alpha-nginx -n test-velero -- bash
root@alpha-nginx:/# df -h
Filesystem                    Size  Used Avail Use% Mounted on
overlay                       435G   85G  350G  20% /
tmpfs                          64M     0   64M   0% /dev
tmpfs                          95G     0   95G   0% /sys/fs/cgroup
/dev/rbd1                      11G   28K   11G   1% /mnt                <<===
/dev/mapper/vglocal00-root00  435G   85G  350G  20% /etc/hosts
shm                            64M     0   64M   0% /dev/shm
tmpfs                          95G   12K   95G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                          95G     0   95G   0% /proc/acpi
tmpfs                          95G     0   95G   0% /proc/scsi
tmpfs                          95G     0   95G   0% /sys/firmware
root@alpha-nginx:/# cd /mnt
root@alpha-nginx:/mnt# ls
abc  lost+found
root@alpha-nginx:/mnt# echo "this is new file with ceph-csi v3.6.1" >new-file.txt
root@alpha-nginx:/mnt# ls
abc  lost+found  new-file.txt
root@alpha-nginx:/mnt# cat new-file.txt
this is new file with ceph-csi v3.6.1

root@rpck-ir16:/var/log/ceph# nc -vz 172.22.0.148 6810
Connection to 172.22.0.148 6810 port [tcp/*] succeeded!
root@rpck-ir16:/var/log/ceph# nc -vz 172.22.0.149 6789
Connection to 172.22.0.149 6789 port [tcp/*] succeeded!
Madhu-1 commented 1 year ago

@pratik705 yes please restart the node where application pod is running or scale down all the applications and wait for all the applications to be down and scale it back again.

pratik705 commented 1 year ago

@Madhu-1 it helped. I am able to access the pods and data. Thanks a lot for the workaround :-)

Is it a bug in the upgrade process?

Madhu-1 commented 1 year ago

its not a bug its a connection problem, the clients might have not reconnected to the ceph cluster. i have not seen this in any upgraded cluster.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.