Open phhutter opened 8 months ago
We had the same issue also after updating from Openshift 4.13.34 to 4.14.19. Trident version: 24.02.0 Kernel: 5.14.0-284.52.1.el9_2.x86_64 Container runtime: cri-o: 1.27.4-6.1.rhaos4.14.gitd09e4c0.el9 Kubernetes version: v1.27.11+749fe1d
Kubernetes orchestrator: OpenShift 4.14.19 OS: CoreOS/RedHat 9.2
One pod was in ContainerCreating state. The events was
Warning FailedMount 54s (x8 over 15m) kubelet MountVolume.SetUp failed for volume "pvc-c476790e-1b18-4080-967d-99af41b1122a" : rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-c476790e-1b18-4080-967d-99af41b1122a.json: no such file or directory
After I have restarted the pod of the daemonSet trident-node-linux, that was running on the same node as the failing pod, the error message was still in the log. But after I have restarted the failing pod, the problem has been solved. The json file has been recreated automatically.
Hey @phhutter and @Xavier-0965, I'm looking into this bug. If possible can you upload logs of both controller and node pod with log-level set to debug and also any steps to reproduce this?
To set log-level to debug you can use the following command:
tridentctl update logconfig --log-level debug -n trident
Or set it to trace:
tridentctl update logconfig --log-level trace -n trident
Thanks!
Hey @shashank-netapp
I've fixed all affected PVCs by using the workaround mentioned in my initial comment, which makes it now nearly impossible to gather the requested debug logs. I did this because NetApp support told me that you already have a solution for it and that the fix will be delivered in the next release. ;-) Of course, I was puzzled because whenever I asked about the root cause, no answer was ever given. I will probably stick to GitHub issues only in the future and rely on GitHub issues instead of NetApp support cases.
Unfortunately, it doesn't seem reproducible to me. I've noticed the same issue on 4 clusters out of 30.
What surprises me is that for @Xavier-0965, a restart of the DS supposedly solved the issue. I also tried this when the problem occurred. I restarted the controller, operator, and DaemonSet without any luck. So, it could also be that the problem mentoned by Xavier is unrelated.
Cheers
Hi @shashank-netapp, I had only one persistentVolume with that problem.
As the problem does not occures anymore, I do not have any actual logs. But at the time of the problem, I had the following events for the pod using the PV (oc describe pod POD): (for corporate reason, I have changed the namespace, pod and nodename with the corresponding text (NAMESPACE, e.g.)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 162m default-scheduler Successfully assigned NAMESPACE/POD to NODENAME
Normal SuccessfulAttachVolume 162m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-c476790e-1b18-4080-967d-99af41b1122a"
Warning NetworkNotReady 161m (x2 over 161m) kubelet network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
Warning FailedMount 161m (x23 over 161m) kubelet MountVolume.SetUp failed for volume "pvc-c476790e-1b18-4080-967d-99af41b1122a" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name csi.trident.netapp.io not found in the list of registered CSI drivers
Warning FailedMount 11m (x83 over 161m) kubelet MountVolume.SetUp failed for volume "pvc-c476790e-1b18-4080-967d-99af41b1122a" : rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-c476790e-1b18-4080-967d-99af41b1122a.json: no such file or directory
Warning FailedMount 87s (x71 over 159m) kubelet Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
Then I have first, delete the pod, to see if the problem is solved. It was not the case. Here the (same) events:
$ oc describe pod $pod
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 15m default-scheduler Successfully assigned NAMESPACE/POD to NODENAME
Warning FailedMount 109s (x6 over 13m) kubelet Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
Warning FailedMount 54s (x8 over 15m) kubelet MountVolume.SetUp failed for volume "pvc-c476790e-1b18-4080-967d-99af41b1122a" : rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-c476790e-1b18-4080-967d-99af41b1122a.json: no such file or directory
Here a part of the log of trident-node-linux-kcm6s
:
(At 16. april 2024 around 2:00 UTC, Openshift has been upgraded)
time="2024-04-16T02:11:53Z" level=info msg="GRPC call: /csi.v1.Node/NodePublishVolume" audit=csi logLayer=csi_frontend requestID=7e86e260-e757-44c4-b7f7-371210dfc6c6 requestSource=CSI
time="2024-04-16T02:11:53Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-c476790e-1b18-4080-967d-99af41b1122a.json." error="open /var/lib/trident/tracking/pvc-c476790e-1b18-4080-967d-99af41b1122a.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-c476790e-1b18-4080-967d-99af41b1122a.json logLayer=csi_frontend requestID=7e86e260-e757-44c4-b7f7-371210dfc6c6 requestSource=CSI workflow="node_server=publish"
time="2024-04-16T02:11:53Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-c476790e-1b18-4080-967d-99af41b1122a.json: no such file or directory" logLayer=csi_frontend requestID=7e86e260-e757-44c4-b7f7-371210dfc6c6 requestSource=CSI
time="2024-04-16T02:11:53Z" level=info msg="GRPC call: /csi.v1.Node/NodeUnpublishVolume" audit=csi logLayer=csi_frontend requestID=8d38f4e5-63e0-4b98-aa90-dfe3200ca61e requestSource=CSI
time="2024-04-16T02:11:53Z" level=info msg="target path (/var/lib/kubelet/pods/e97fbc6a-1ed5-41a9-a175-a47fa4d4cfa8/volumes/kubernetes.io~csi/pvc-c476790e-1b18-4080-967d-99af41b1122a/mount) not found; volume is not mounted." Method=NodeUnpublishVolume Type=CSI_Node logLayer=csi_frontend requestID=8d38f4e5-63e0-4b98-aa90-dfe3200ca61e requestSource=CSI workflow="node_server=unpublish"
...
Then I have deleted the pod trident-node-linux-kcm6s
, to restart the pod.
But the error message "no such file or directory" was still there.
After I have restartet the pod $pod (that was consuming the persistent Volume), it worked. The PV was successfully mounted.
Regards Xavier
Hi,
first of all: Thank you for providing the fix in the first post.
we are on 24.02.0 and OpenShift 4.14.20 and had the same issue for one PV with two Pods. Restarting (deleting) the Pods did not help at all. Creating the JSON file by hand, empty, did not work. Copying it without removing the "publishedTargetPaths" did not work too.
Describe the bug I have encountered an issue after upgrading from OpenShift 4.12.x to OpenShift 4.14.x. Following the upgrade, as the updated nodes were brought back online, I noticed that certain NFS volumes were unable to be mounted, resulting in the corresponding application pods remaining in a "Pending" state. Below, I have attached the log from a Linux-Trident DaemonSet pod which seems to indicate that it is looking for a status/tracking file in "/var/lib/trident/tracking/" for the PVC to be mounted, but is unable to find it. This issue only affects some PVCs (5-10% of all PVCs). - Other PVCs from the same backend storage were mounted without any issues.
As a workaround, I manually copied the missing JSON tracking file from another remaining CoreOS node and deleted the "publishedTargetPaths" value. This temporarily allowed Trident to remount the volume.
Steps to temporarily fix it: find and copy the tracking file from a remaining worker node
/var/lib/trident/tracking/pvc-xxx.json
remove the value from
publishedTargetPaths
and let the corresponding linux trident pod reconcile its value.I also tried to delete the Operator/Controller and DaemonSet before manually creating the file, hoping this would resolve the issue. Unfortunately, this did not work.
Here is the log from the Trident DaemonSet pod:
Error-Message:
Environment Provide accurate information about the environment to help us reproduce the issue.
We have been using Trident for 3-4 years now and have never encountered this error before.
-- EDIT -- We also face the same issue with Trident Version 24.02.0.