Closed andreasvandaalen closed 7 months ago
Just a few observations that might be helpful:
I suggest you create a new namespace (you would usually not provisioning anything but Trident itself into the Trident namespace). Then apply a proper YAML (e.g. without any manual PV) into that namespace and check the result.
@wonderland Thank you for the observations and the hints. As its getting late here, we'll revise the situation based on that tomorrow 👍 Much appreciated for your input!
It appears that we are only able to use TCP for communication between our test GKE cluster and the CVO. As a result I've two questions, can I add them here, or shall I create new issues?
We see that the GKE cluster nodes are NFS "ready". However when trying to manually mount an NFS share from i.e. an Ubuntu pod it requires nfs-common and services nfsbind and rpc-statd running to do so.
root@ubuntu-v5:/# showmount -e [redacted] Export list for [redacted]: / (everyone) /trident_pvc_2604074c_8aaa_4571_b225_81245cd221d0 (everyone) /trident_pvc_2fa53c4e_c2ee_4c21_9601_81ab14a0922b (everyone) etc... manual mounting works with tcp, as a result I've create a storage class for tcp
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: andreas-tcp
provisioner: csi.trident.netapp.io
mountOptions: ["rwx", "nfsvers=3", "proto=tcp"]
parameters:
backendType: "ontap-nas"
and the pvc shows up
kubectl get pvc -n andreas
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
pvc-andreas Bound pvc-2fa53c4e-c2ee-4c21-9601-81ab14a0922b 1Gi RWO andreas-tcp 85s
[redacted]::> vol show -fields create-time -volume *922b
vserver volume create-time
------------------ ------------------------------------------------ ------------------------
[redacted] trident_pvc_2fa53c4e_c2ee_4c21_9601_81ab14a0922b Fri Dec 08 10:48:07 2023
however when trying to use PVC at POD creation with it ends up with a Exit status 32.
i.e.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 33m default-scheduler Successfully assigned andreas/ubuntu-v5 to gke-e-infra-gke-e-infra-gke-88607822-luiw
Normal SuccessfulAttachVolume 33m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-2fa53c4e-c2ee-4c21-9601-81ab14a0922b"
Warning FailedMount 17m (x2 over 28m) kubelet Unable to attach or mount volumes: unmounted volumes=[pvc-andreas], unattached volumes=[kube-api-access-ns54g pvc-andreas]: timed out waiting for the condition
Warning FailedMount 10m (x8 over 30m) kubelet Unable to attach or mount volumes: unmounted volumes=[pvc-andreas], unattached volumes=[pvc-andreas kube-api-access-ns54g]: timed out waiting for the condition
Warning FailedMount 2m18s (x23 over 32m) kubelet MountVolume.SetUp failed for volume "pvc-2fa53c4e-c2ee-4c21-9601-81ab14a0922b" : rpc error: code = Internal desc = error mounting NFS volume
[redacted]:/trident_pvc_2fa53c4e_c2ee_4c21_9601_81ab14a0922b on mountpoint /var/lib/kubelet/pods/2c31700c-1bdc-49a5-a135-bff148115654/volumes/kubernetes.io~csi/pvc-2fa53c4e-c2ee-4c21-9601-81ab14a0922b/mount: exit status 32
I didn't expect I should add services to the POD, is that misunderstanding? and do you have other hints what to look for with the "exit status 32", because the export rules are open, and don't see more details what could hint on the cause of the inability to mount the PVC?
NFS would always run on TCP (UDP usage with NFS has been stopped decades ago). No need to specify it explicitly. Also, I've never seen the "rwx" NFS mount option, are you sure that is valid?
All storage access will always be at the node level. The mount is done by the worker node, then passed on to the container as a bind mount. Therefore no need to install any NFS packages inside the pod (though technically you could do that and access NFS in this way - but this is definitely not the K8s model for storage access!).
Unfortunately "exit status 32" is a pretty generic NFS error code. Usually something in the networking/connectivity area but hard to tell more from that code alone. You could SSH into the worker node and try to manually mount with verbose flags, that should give you more details. The output from the pod events gives you the full mount path (e.g. [redacted]:/trident_pvc_2fa53c4e_c2ee_4c21_9601_81ab14a0922b) so something like
mount -vvv -o vers=3 [redacted]:/trident_pvc_2fa53c4e_c2ee_4c21_9601_81ab14a0922b /mnt/test
NFS would always run on TCP (UDP usage with NFS has been stopped decades ago). No need to specify it explicitly.
We started a couple of months with documentation of 19 and later with 21, i expect that we adopted initially proto=udp from the examples. But that got easily adjusted. (There's an example which I found through Google, but is fro the documentation used earlier, but reflects why we did use it https://github.com/NetApp/trident/blob/master/trident-installer/sample-input/storage-class-samples/storage-class-ontapnas-k8s1.8-mountoptions.yaml)
Also, I've never seen the "rwx" NFS mount option, are you sure that is valid?
Those options are documented i.e. here: https://github.com/NetAppDocs/trident/blob/main/trident-use/ontap-nas.adoc and although the POD could only initialize and not actually boot, if you cofnigure with "rwo", starting a second POD which would also only initialize because of the issue dealing with NFS in our case, it actually "complaints" that the PVC is already in use by another POD; which I would say works well, it's with RWO only to be provisioned to one POD.
All storage access will always be at the node level. The mount is done by the worker node, then passed on to the container as a bind mount. Therefore no need to install any NFS packages inside the pod (though technically you could do that and access NFS in this way - but this is definitely not the K8s model for storage access!).
This clarification is very helpful+ (at least to me) to get the picture better on how this fundamentally should work. What we did see is that trident says: NFS check on the nodes. But but while verifying the GCP / GKE cluster details we see that next to the default "Container-Optimized OS (COS)" which we use, there's the options: like support for NFS 😨
and the cluster where we did do a test with filestore there were nfs csi drivers added to the COS.
Unfortunately "exit status 32" is a pretty generic NFS error code. Usually something in the networking/connectivity area but hard to tell more from that code alone. You could SSH into the worker node and try to manually mount with verbose flags, that should give you more details. The output from the pod events gives you the full mount path (e.g. [redacted]:/trident_pvc_2fa53c4e_c2ee_4c21_9601_81ab14a0922b) so something like
Unfortunately we don't have the option to ssh into the nodes because of COS. Because of this we found out the above where Ubuntu supplies the support for NFS, and probably has the ability that we can login to the nodes. We did want to do that today, but well, that went differently. We try to do this however soon.
Thanks for the quick response, and thanks for the valuable hints and thoughts @wonderland!
We've removed the "rwx":
mountOptions: ["rwx", "nfsvers=3", "proto=tcp"]
to
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: andreas-tcp
provisioner: csi.trident.netapp.io
mountOptions: ["nfsvers=3", "proto=tcp"]
parameters:
backendType: "ontap-nas"
for some reason we got confused by how we (specifically I) did read and (mis)interpreted the rwx.
Now it proceeds building the POD and mounting the PVC and once on the shell we can see:
root@pv-pod:/usr/share/nginx/html# mount | grep nfs
[redacted]:/trident_pvc_9089df00_f6cd_470b_b1c1_5f37da50c6a0 on /usr/share/nginx/html type nfs (rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=[redacted],mountvers=3,mountport=635,mountproto=tcp,local_lock=none,addr=[redacted])
or
root@pv-pod:/usr/share/nginx/html# df -h /usr/share/nginx/html
Filesystem Size Used Avail Use% Mounted on
[redacted]:/trident_pvc_9089df00_f6cd_470b_b1c1_5f37da50c6a0 1.0G 320K 1.0G 1% /usr/share/nginx/html
Thank you once again, we're happy having the share connected 💯
Describe the bug Following a fresh Trident installation and backend creation according to the NetApp Trident Backend Configuration, we had an issue with volume creation and mounting on the NetApp backend (CVO).
Despite the backend's successful creation, the expected volume still needs to be created on the backend. Instead, a "magic" volume appears mounted in the pod as the PVC, which is of "tmpfs" type rather than the anticipated shared volume from the NetApp backend.
The Trident operator, controller, and node pods fail to bind to the "ontap-nas" storage class and do not create a volume on the NetApp backend upon PV or PVC creation. Although the PV, PVC, and pod are successfully created, the NFS shared NetApp volume is not displayed in the pod.
Trident-controller pod logs show errors and warnings potentially related to this issue.
Environment Provide accurate information about the environment to help us reproduce the issue.
To Reproduce kubectl apply -f merged_manifests.yml -n trident
Expected behavior
Upon successful creation of the Trident backend, it is expected that a volume would be created on the NetApp backend (CVO) that corresponds to any PV or PVC created in the Kubernetes cluster. The Trident operator, controller, and node pods should bind to the "ontap-nas" storage class and initiate the volume creation on the backend.
Once the PV, PVC, and pod are created, the NFS shared NetApp volume should be mounted in the pod and be visible when inspecting the pod's volume details. Thus, the expected behavior is a seamless creation and mounting of NetApp volumes in the Kubernetes pods through the Trident operator.
Storage Class:
Physical Volume (PV)
Physical VOlume Claim (PVC)
the mount in pv-pod (the pod)
Additional context We have errors, and some additional information from the trident-controller pod:
csi-attacher I1205 10:46:30.285060 1 connection.go:201] GRPC error:
<nil>
csi-attacher I1205 10:47:30.287315 1 connection.go:201] GRPC error:<nil>
│ │ csi-attacher I1205 10:48:30.294277 1 connection.go:201] GRPC error:<nil>
│ │ trident-main time="2023-12-05T10:48:31Z" level=error msg="Trident-ACP version is empty." error="<nil>
" logLayer=rest_frontend requestID=19bcda76-3268-4fa1-b7e1-6f2ae7ff0833 requestSource=REST work │ │ flow="core=version"csi-attacher W1205 10:48:30.294433 1 csi_handler.go:173] Failed to repair volume handle for driver pd.csi.storage.gke.io: node handle has wrong number of elements; got 1, wanted 6 or more │ │ csi-attacher I1205 10:48:30.294442 1 csi_handler.go:740] Found NodeID
<redacted>
in CSINode<redacted>
│ │ csi-attacher W1205 10:48:30.294461 1 csi_handler.go:173] Failed to repair volume handle for driver pd.csi.storage.gke.io: node handle has wrong number of elements; got 1, wanted 6 or more │ │ trident-main time="2023-12-05T10:48:31Z" level=debug msg="REST API call received." Duration="10.192µs" Method=GET RequestURL=/trident/v1/version Route=GetVersion logLayer=rest_frontend requestID=1 │ │ 9bcda76-3268-4fa1-b7e1-6f2ae7ff0833 requestSource=REST workflow="trident_rest=logger" │ │ trident-main time="2023-12-05T10:48:31Z" level=debug msg="Getting Trident-ACP version." logLayer=rest_frontend requestID=19bcda76-3268-4fa1-b7e1-6f2ae7ff0833 requestSource=REST workflow="core=vers │ │ ion" │ │ trident-main time="2023-12-05T10:48:31Z" level=warning msg="ACP is not enabled." logLayer=rest_frontend requestID=19bcda76-3268-4fa1-b7e1-6f2ae7ff0833 requestSource=REST workflow="core=version" │ │ trident-main time="2023-12-05T10:48:31Z" level=error msg="Trident-ACP version is empty." error="<nil>
" logLayer=rest_frontend requestID=19bcda76-3268-4fa1-b7e1-6f2ae7ff0833 requestSource=REST work │ │ flow="core=version" │ │ trident-main time="2023-12-05T10:48:31Z" level=debug msg="REST API call complete." Duration="978.427µs" Method=GET RequestURL=/trident/v1/version Route=GetVersion StatusCode=200 logLayer=rest_fron │ │ tend requestID=19bcda76-3268-4fa1-b7e1-6f2ae7ff0833 requestSource=REST workflow="trident_rest=logger" │ │ trident-main time="2023-12-05T10:48:37Z" level=debug msg="Node updated in cache." logLayer=csi_frontend name=<redacted>
requestID=7a27627a-c480-446c-a35a-9addc41b169 │ │ 2 requestSource=Kubernetes workflow="node=update" │ │ trident-main time="2023-12-05T10:48:37Z" level=debug msg="Node updated in cache." logLayer=csi_frontend name=<redacted>
requestID=d33b8f54-6142-4e3d-bc0b-f9458a7c37e │ │ c requestSource=Kubernetes workflow="node=update" │ │ trident-main time="2023-12-05T10:48:37Z" level=warning msg="K8S helper has no record of the updated storage class; instead it will try to create it." logLayer=csi_frontend name=ontapnasudp paramet │ │ ers="map[backendType:ontap-nas]" provisioner=csi.trident.netapp.io requestID=c2f7f1ba-2333-4d73-b142-ac072a9ef5fd requestSource=Kubernetes workflow="storage_class=update" │ │ trident-main time="2023-12-05T10:48:37Z" level=debug msg="Node updated in cache." logLayer=csi_frontend name=gke-e-infra-gke-e-infra-gke-b083af15-zn8s requestID=86a4ae35-eeb7-4340-a415-929d93b662c │ │ f requestSource=Kubernetes workflow="node=update" │ │ trident-main time="2023-12-05T10:48:37Z" level=debug msg="Node updated in cache." logLayer=csi_frontend name=gke-e-infra-gke-e-infra-gke-88607822-luiw requestID=8ea3a1cc-0807-4ae3-893e-ddb8f4ed476 │ │ 6 requestSource=Kubernetes workflow="node=update" │ │ trident-main time="2023-12-05T10:48:37Z" level=warning msg="K8S helper could not add a storage class: object is being deleted: tridentstorageclasses.trident.netapp.io \"ontapnasudp\" already exist │ │ s" logLayer=csi_frontend name=ontapnasudp parameters="map[backendType:ontap-nas]" provisioner=csi.trident.netapp.io requestID=c2f7f1ba-2333-4d73-b142-ac072a9ef5fd requestSource=Kubernetes workflow │ │ ="storage_class=update" │ │ trident-main time="2023-12-05T10:48:49Z" level=debug msg="Node updated in cache." logLayer=csi_frontend name=<redacted>
requestID=e3db19b3-4afd-441d-85dd-37022ed9831 │ │ b requestSource=Kubernetes workflow="node=update"