ceph / ceph-csi

CSI driver for Ceph
Apache License 2.0
1.27k stars 539 forks source link

Need help tracking down source of NFS mount bug blocking Rook CI #4251

Closed BlaineEXE closed 11 months ago

BlaineEXE commented 11 months ago

Describe the bug

I need some help trying to determine why NFS mounts began failing in Rook CI on October 27th. From what I have looked at so far, this doesn't correspond to any changes in Rook or Ceph-CSI. I'm hoping someone here might have a better idea of where to look. @Rakshith-R or @nixpanic perhaps?

Environment details

Steps to reproduce

Steps to reproduce the behavior:

This is failing in all of Rook's CI runs starting October 27th. E.g., https://github.com/rook/rook/actions/runs/6786937018/job/18495134986

Actual results

The best indication of what is going wrong is the No such file or directory error given by CSI when mounting to the test pod, described below.

runner@fv-az619-734:~/work/rook/rook$ k describe pod/csinfs-demo-pod
Name:             csinfs-demo-pod
Namespace:        default
Priority:         0
Service Account:  default
Node:             fv-az619-734/10.1.0.9
Start Time:       Wed, 08 Nov 2023 18:32:00 +0000
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Containers:
  web-server:
    Container ID:   
    Image:          nginx
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/www/html from mypvc (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-swdlh (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  mypvc:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  nfs-pvc
    ReadOnly:   false
  kube-api-access-swdlh:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                   From                     Message
  ----     ------                  ----                  ----                     -------
  Normal   Scheduled               6m23s                 default-scheduler        Successfully assigned default/csinfs-demo-pod to fv-az619-734
  Normal   SuccessfulAttachVolume  6m23s                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-26696fa1-6369-4245-8ca3-1fc41269c7bb"
  Warning  FailedMount             2m10s (x9 over 6m4s)  kubelet                  MountVolume.SetUp failed for volume "pvc-26696fa1-6369-4245-8ca3-1fc41269c7bb" : rpc error: code = Internal desc = nfs: failed to mount "rook-ceph-nfs-my-nfs-a:/0001-0009-rook-ceph-0000000000000001-03ec9b2d-0fd5-4689-a5cd-c8f0daf3bdf9" to "/var/lib/kubelet/pods/cc20b8a6-aca7-4709-a270-cc43b87f39d7/volumes/kubernetes.io~csi/pvc-26696fa1-6369-4245-8ca3-1fc41269c7bb/mount" : an error (exit status 32) occurred while running nsenter args: [--net=/var/lib/kubelet/plugins/rook-ceph.nfs.csi.ceph.com/rook-ceph.net.ns -- mount -t nfs rook-ceph-nfs-my-nfs-a:/0001-0009-rook-ceph-0000000000000001-03ec9b2d-0fd5-4689-a5cd-c8f0daf3bdf9 /var/lib/kubelet/pods/cc20b8a6-aca7-4709-a270-cc43b87f39d7/volumes/kubernetes.io~csi/pvc-26696fa1-6369-4245-8ca3-1fc41269c7bb/mount -o nolock] stderr: "mount.nfs: mounting rook-ceph-nfs-my-nfs-a:/0001-0009-rook-ceph-0000000000000001-03ec9b2d-0fd5-4689-a5cd-c8f0daf3bdf9 failed, reason given by server: No such file or directory\n"

Describe what happened

Expected behavior

Users should be able to mount NFS volumes in their pods.

Logs

NFS Provisioner logs:

I1108 18:29:53.628203       1 main.go:93] Version : v1.8.0
I1108 18:30:00.213913       1 common.go:111] Probing CSI driver for readiness
I1108 18:30:00.240073       1 leaderelection.go:245] attempting to acquire leader lease rook-ceph/external-resizer-rook-ceph-nfs-csi-ceph-com...
I1108 18:30:00.251760       1 leaderelection.go:255] successfully acquired lease rook-ceph/external-resizer-rook-ceph-nfs-csi-ceph-com
I1108 18:30:00.253703       1 controller.go:255] Starting external resizer rook-ceph.nfs.csi.ceph.com
I1108 18:29:54.284671       1 csi-provisioner.go:154] Version: v3.5.0
I1108 18:29:54.284790       1 csi-provisioner.go:177] Building kube configs for running in cluster...
I1108 18:30:00.148600       1 common.go:111] Probing CSI driver for readiness
I1108 18:30:00.200650       1 csi-provisioner.go:302] CSI driver does not support PUBLISH_UNPUBLISH_VOLUME, not watching VolumeAttachments
I1108 18:30:00.201564       1 leaderelection.go:245] attempting to acquire leader lease rook-ceph/rook-ceph-nfs-csi-ceph-com...
I1108 18:30:00.217522       1 leaderelection.go:255] successfully acquired lease rook-ceph/rook-ceph-nfs-csi-ceph-com
I1108 18:30:00.322099       1 clone_controller.go:66] Starting CloningProtection controller
I1108 18:30:00.322156       1 clone_controller.go:82] Started CloningProtection controller
I1108 18:30:00.322190       1 volume_store.go:97] Starting save volume queue
I1108 18:30:00.322049       1 controller.go:811] Starting provisioner controller rook-ceph.nfs.csi.ceph.com_csi-nfsplugin-provisioner-d546db784-44m2n_3a20f039-19ec-49ff-ade6-2fc1275e781a!
I1108 18:30:00.422803       1 controller.go:860] Started provisioner controller rook-ceph.nfs.csi.ceph.com_csi-nfsplugin-provisioner-d546db784-44m2n_3a20f039-19ec-49ff-ade6-2fc1275e781a!
I1108 18:31:59.417433       1 controller.go:1359] provision "default/nfs-pvc" class "rook-nfs": started
I1108 18:31:59.419248       1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"nfs-pvc", UID:"26696fa1-6369-4245-8ca3-1fc41269c7bb", APIVersion:"v1", ResourceVersion:"3022", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/nfs-pvc"
I1108 18:31:59.677176       1 controller.go:1442] provision "default/nfs-pvc" class "rook-nfs": volume "pvc-26696fa1-6369-4245-8ca3-1fc41269c7bb" provisioned
I1108 18:31:59.677204       1 controller.go:1455] provision "default/nfs-pvc" class "rook-nfs": succeeded
I1108 18:31:59.764973       1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"nfs-pvc", UID:"26696fa1-6369-4245-8ca3-1fc41269c7bb", APIVersion:"v1", ResourceVersion:"3022", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-26696fa1-6369-4245-8ca3-1fc41269c7bb

NFS driver logs:

$ kr logs csi-rbdplugin-fbbck --all-containers
I1108 18:31:13.837902  171895 main.go:167] Version: v2.8.0
I1108 18:31:13.837999  171895 main.go:168] Running node-driver-registrar in mode=registration
I1108 18:31:17.707175  171895 node_register.go:53] Starting Registration Server at: /registration/rook-ceph.rbd.csi.ceph.com-reg.sock
I1108 18:31:17.707369  171895 node_register.go:62] Registration Server started at: /registration/rook-ceph.rbd.csi.ceph.com-reg.sock
I1108 18:31:17.707612  171895 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I1108 18:31:17.917672  171895 main.go:102] Received GetInfo call: &InfoRequest{}
I1108 18:31:17.918083  171895 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/rook-ceph.rbd.csi.ceph.com/registration"
I1108 18:31:17.982437  171895 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

Dmesg logs: https://gist.github.com/BlaineEXE/df404fb8d79fef06258a07c780abb382

I looked at the location on the CI host to see if there is a host problem, and there may be, but the behavior is strange to me. I can sudo ls the host mount location and see the dir that is supposed to exist, but sudo ls of that dir gives No such file or directory

runner@fv-az619-734:~/work/rook/rook$ sudo ls /var/lib/kubelet/pods/cc20b8a6-aca7-4709-a270-cc43b87f39d7/volumes/kubernetes.io~csi
pvc-26696fa1-6369-4245-8ca3-1fc41269c7bb
runner@fv-az619-734:~/work/rook/rook$ sudo ls /var/lib/kubelet/pods/cc20b8a6-aca7-4709-a270-cc43b87f39d7/volumes/kubernetes.io~csi/pvc-26696fa1-6369-4245-8ca3-1fc41269c7bb
ls: cannot access '/var/lib/kubelet/pods/cc20b8a6-aca7-4709-a270-cc43b87f39d7/volumes/kubernetes.io~csi/pvc-26696fa1-6369-4245-8ca3-1fc41269c7bb': No such file or directory

I tried listing again afterwards, with more list options, and it appears to be gone, like it is being created and deleted by something seemingly at random:

runner@fv-az619-734:~/work/rook/rook$ sudo ls -alFh /var/lib/kubelet/pods/cc20b8a6-aca7-4709-a270-cc43b87f39d7/volumes/kubernetes.io~csi/
# NO OUTPUT!
Rakshith-R commented 11 months ago

I think something changed in ceph image. Just changing nfs-ganesha image lead to a successful mount.

https://github.com/rook/rook/issues/13151#issuecomment-1798215739

BlaineEXE commented 11 months ago

I was told that the Ganesha 5.7 version present in the most recent v18 images fixes that issue. I will investigate further.

[Update] You're right. The image running in CI still has 5.6

[root@rook-ceph-my-nfs /]# ganesha.nfsd -v
NFS-Ganesha Release = V5.6

It is using the quay.io/ceph/ceph:v18 image; however, I realized today that there are differences between the ganesha packages present in the v18 arm image and the v18 x86 image.

ARM

[root@f011d07eaf05 /]# ganesha.nfsd -v
NFS-Ganesha Release = V5.7

x86

[root@180aaa0a8afe /]# ganesha.nfsd -v
NFS-Ganesha Release = V5.6
BlaineEXE commented 11 months ago

Closing this as solved, and I'll figure out how to get this fixed in ceph-container. Thanks, Rakshith!