Partial NFS Volume Mount Failures Following OpenShift 4.14 Upgrade (error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-xx-xx-xx-xx-xx.json: no such file or directory)

phhutter commented 8 months ago

Describe the bug I have encountered an issue after upgrading from OpenShift 4.12.x to OpenShift 4.14.x. Following the upgrade, as the updated nodes were brought back online, I noticed that certain NFS volumes were unable to be mounted, resulting in the corresponding application pods remaining in a "Pending" state. Below, I have attached the log from a Linux-Trident DaemonSet pod which seems to indicate that it is looking for a status/tracking file in "/var/lib/trident/tracking/" for the PVC to be mounted, but is unable to find it. This issue only affects some PVCs (5-10% of all PVCs). - Other PVCs from the same backend storage were mounted without any issues.

As a workaround, I manually copied the missing JSON tracking file from another remaining CoreOS node and deleted the "publishedTargetPaths" value. This temporarily allowed Trident to remount the volume.

Steps to temporarily fix it: find and copy the tracking file from a remaining worker node /var/lib/trident/tracking/pvc-xxx.json

{"localhost":true,"fstype":"nfs","useCHAP":false,"nfsServerIp":"x.x.x.2","nfsPath":"/trident_qtree_pool_mypool/pvc_xxxx","mountOptions":"nfsvers=4","VolumeTrackingInfoPath":"","stagingTargetPath":"/var/lib/kubelet/plugins/kubernetes.io/csi/csi.trident.netapp.io/6734d5d5887c0af8b6f03b9d5a6bf4ff633330fe61d2af0be3380023662eadd4/globalmount","publishedTargetPaths":{"/var/lib/kubelet/pods/359a2c7b-d230-4b8c-b92e-ef4e64c806a0/volumes/kubernetes.io~csi/pvc-c37c2a27-9c57-4a3a-8380-b561ccfd7a4d/mount":{}}}

remove the value from publishedTargetPaths and let the corresponding linux trident pod reconcile its value.

{"localhost":true,"fstype":"nfs","useCHAP":false,"nfsServerIp":"x.x.x.2","nfsPath":"/trident_qtree_pool_mypool/pvc_xxxx","mountOptions":"nfsvers=4","VolumeTrackingInfoPath":"","stagingTargetPath":"/var/lib/kubelet/plugins/kubernetes.io/csi/csi.trident.netapp.io/6734d5d5887c0af8b6f03b9d5a6bf4ff633330fe61d2af0be3380023662eadd4/globalmount","publishedTargetPaths":{}}

I also tried to delete the Operator/Controller and DaemonSet before manually creating the file, hoping this would resolve the issue. Unfortunately, this did not work.

Here is the log from the Trident DaemonSet pod:

Error-Message:

time="2024-03-07T10:30:34Z" level=info msg="Running Trident storage orchestrator." binary=/trident_orchestrator build_time="Fri Apr 28 04:08:32 PM EDT 2023" version=23.04.0
time="2024-03-07T10:30:34Z" level=info msg="Initializing K8S helper frontend." logLayer=csi_frontend requestID=09334420-b8ee-4e60-85b8-6dd719cb7716 requestSource=Internal workflow="plugin=create"
time="2024-03-07T10:30:34Z" level=info msg="K8S helper determined the container orchestrator version." gitVersion=v1.27.10+28ed2d7 logLayer=csi_frontend requestID=09334420-b8ee-4e60-85b8-6dd719cb7716 requestSource=Internal version=1.27 workflow="plugin=create"
time="2024-03-07T10:30:34Z" level=info msg="K8S helper frontend initialized." logLayer=csi_frontend requestID=09334420-b8ee-4e60-85b8-6dd719cb7716 requestSource=Internal workflow="plugin=create"
time="2024-03-07T10:30:34Z" level=info msg="Initializing K8S helper frontend." logLayer=csi_frontend requestID=bcd93b1c-909a-46d4-8f9d-3713f3769253 requestSource=Internal workflow="plugin=create"
time="2024-03-07T10:30:34Z" level=info msg="Added frontend." logLayer=core name=k8s_csi_node_helper requestID=1e065204-a09c-4fb6-86c0-94575e459dbd requestSource=Unknown workflow="core=init"
time="2024-03-07T10:30:34Z" level=info msg="Initializing CSI frontend." name=replace-mynode.replace-mydomain.local version=23.04.0
time="2024-03-07T10:30:34Z" level=info msg="Force detach feature disabled." logLayer=csi_frontend requestID=982f9c65-e1c8-43c1-b43c-3e9dc0caf346 requestSource=Internal workflow="plugin=create"
time="2024-03-07T10:30:34Z" level=info msg="Enabling node service capability." capability=STAGE_UNSTAGE_VOLUME
time="2024-03-07T10:30:34Z" level=info msg="Enabling node service capability." capability=SINGLE_NODE_MULTI_WRITER
time="2024-03-07T10:30:34Z" level=info msg="Enabling node service capability." capability=EXPAND_VOLUME
time="2024-03-07T10:30:34Z" level=info msg="Enabling node service capability." capability=GET_VOLUME_STATS
time="2024-03-07T10:30:34Z" level=info msg="Enabling volume access mode." logLayer=csi_frontend mode=SINGLE_NODE_WRITER requestID=982f9c65-e1c8-43c1-b43c-3e9dc0caf346 requestSource=Internal workflow="plugin=create"
time="2024-03-07T10:30:34Z" level=info msg="Enabling volume access mode." logLayer=csi_frontend mode=SINGLE_NODE_SINGLE_WRITER requestID=982f9c65-e1c8-43c1-b43c-3e9dc0caf346 requestSource=Internal workflow="plugin=create"
time="2024-03-07T10:30:34Z" level=info msg="Enabling volume access mode." logLayer=csi_frontend mode=SINGLE_NODE_MULTI_WRITER requestID=982f9c65-e1c8-43c1-b43c-3e9dc0caf346 requestSource=Internal workflow="plugin=create"
time="2024-03-07T10:30:34Z" level=info msg="Enabling volume access mode." logLayer=csi_frontend mode=SINGLE_NODE_READER_ONLY requestID=982f9c65-e1c8-43c1-b43c-3e9dc0caf346 requestSource=Internal workflow="plugin=create"
time="2024-03-07T10:30:34Z" level=info msg="Enabling volume access mode." logLayer=csi_frontend mode=MULTI_NODE_READER_ONLY requestID=982f9c65-e1c8-43c1-b43c-3e9dc0caf346 requestSource=Internal workflow="plugin=create"
time="2024-03-07T10:30:34Z" level=info msg="Enabling volume access mode." logLayer=csi_frontend mode=MULTI_NODE_SINGLE_WRITER requestID=982f9c65-e1c8-43c1-b43c-3e9dc0caf346 requestSource=Internal workflow="plugin=create"
time="2024-03-07T10:30:34Z" level=info msg="Enabling volume access mode." logLayer=csi_frontend mode=MULTI_NODE_MULTI_WRITER requestID=982f9c65-e1c8-43c1-b43c-3e9dc0caf346 requestSource=Internal workflow="plugin=create"
time="2024-03-07T10:30:34Z" level=info msg="Added frontend." logLayer=core name=csi requestID=1e065204-a09c-4fb6-86c0-94575e459dbd requestSource=Unknown workflow="core=init"
time="2024-03-07T10:30:34Z" level=info msg="Initializing HTTPS REST frontend." address=":17546"
time="2024-03-07T10:30:34Z" level=info msg="Added frontend." name="HTTPS REST"
time="2024-03-07T10:30:34Z" level=info msg="Added 0 existing volume(s)" logLayer=core requestID=678814fc-c7d4-4021-8a73-0a15d0789e02 requestSource=Internal workflow="core=bootstrap"
time="2024-03-07T10:30:34Z" level=info msg="Trident bootstrapped successfully." logLayer=core requestID=678814fc-c7d4-4021-8a73-0a15d0789e02 requestSource=Internal workflow="core=bootstrap"
time="2024-03-07T10:30:34Z" level=info msg="Activating K8S helper frontend." logLayer=csi_frontend requestID=58b81a80-e94f-4ca0-ab22-6fc57c4c02b2 requestSource=Internal workflow="plugin=activate"
time="2024-03-07T10:30:34Z" level=info msg="Activating HTTPS REST frontend." address=":17546"
time="2024-03-07T10:30:34Z" level=info msg="Activating CSI frontend." logLayer=csi_frontend requestID=ef7ad744-7171-4e12-8dae-c9692c12b052 requestSource=Internal workflow="plugin=activate"
time="2024-03-07T10:30:34Z" level=info msg="Starting periodic node access reconciliation service." logLayer=core requestID=1f9e0cf6-a4ad-45e5-8715-dba39cb0410f requestSource=Periodic workflow="core=node_reconcile"
time="2024-03-07T10:30:34Z" level=info msg="Discovered iSCSI initiator name." IQN="iqn.1994-05.com.redhat:9bd57cd33634" logLayer=csi_frontend requestID=ef7ad744-7171-4e12-8dae-c9692c12b052 requestSource=Internal workflow="plugin=activate"
time="2024-03-07T10:30:34Z" level=info msg="Discovered IP addresses." IP Addresses="[x.x.x.1 x.x.x.2 x.x.x.3]" logLayer=csi_frontend requestID=ef7ad744-7171-4e12-8dae-c9692c12b052 requestSource=Internal workflow="plugin=activate"
time="2024-03-07T10:30:34Z" level=warning msg="Error discovering SMB service on host." error="SMBActiveOnHost is not supported for linux" logLayer=csi_frontend requestID=ef7ad744-7171-4e12-8dae-c9692c12b052 requestSource=Internal workflow="plugin=activate"
time="2024-03-07T10:30:36Z" level=info msg="Updated Trident controller with node registration." logLayer=csi_frontend node=replace-mynode.replace-mydomain.local requestID=ef7ad744-7171-4e12-8dae-c9692c12b052 requestSource=Internal workflow="plugin=activate"
time="2024-03-07T10:30:36Z" level=info msg="Listening for GRPC connections." name=/plugin/csi.sock net=unix
time="2024-03-07T10:30:37Z" level=info msg="target path (/var/lib/kubelet/pods/7cbf593a-72c1-485c-9b2d-a0d6e1444b2f/volumes/kubernetes.io~csi/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842/mount) not found; volume is not mounted." Method=NodeUnpublishVolume Type=CSI_Node logLayer=csi_frontend requestID=02eea5af-5f30-44c1-b071-4b3b5ae24ef6 requestSource=CSI workflow="node_server=unpublish"
time="2024-03-07T10:30:37Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json." error="open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json logLayer=csi_frontend requestID=8b5199f3-e9c4-416c-9157-bb9f1dca5b76 requestSource=CSI workflow="node_server=publish"
time="2024-03-07T10:30:37Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" logLayer=csi_frontend requestID=8b5199f3-e9c4-416c-9157-bb9f1dca5b76 requestSource=CSI
time="2024-03-07T10:30:38Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json." error="open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json logLayer=csi_frontend requestID=947f2fb1-2640-4de6-9755-39c2534b040d requestSource=CSI workflow="node_server=publish"
time="2024-03-07T10:30:38Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" logLayer=csi_frontend requestID=947f2fb1-2640-4de6-9755-39c2534b040d requestSource=CSI
time="2024-03-07T10:30:39Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json." error="open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json logLayer=csi_frontend requestID=56e36a53-3f0a-49a0-9c30-b62aadd94262 requestSource=CSI workflow="node_server=publish"
time="2024-03-07T10:30:39Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" logLayer=csi_frontend requestID=56e36a53-3f0a-49a0-9c30-b62aadd94262 requestSource=CSI
time="2024-03-07T10:30:41Z" level=info msg="target path (/var/lib/kubelet/pods/7cbf593a-72c1-485c-9b2d-a0d6e1444b2f/volumes/kubernetes.io~csi/pvc-9e991357-c89d-455c-b859-4f1573d4bb2d/mount) not found; volume is not mounted." Method=NodeUnpublishVolume Type=CSI_Node logLayer=csi_frontend requestID=48f438b5-a617-4e4f-b747-b2dea7a77ccf requestSource=CSI workflow="node_server=unpublish"
time="2024-03-07T10:30:41Z" level=info msg="target path (/var/lib/kubelet/pods/7cbf593a-72c1-485c-9b2d-a0d6e1444b2f/volumes/kubernetes.io~csi/pvc-cfb9e03b-54dc-4f2f-88b2-3a0e5a546325/mount) not found; volume is not mounted." Method=NodeUnpublishVolume Type=CSI_Node logLayer=csi_frontend requestID=58d74894-c153-4961-b2f9-5152f289f3a5 requestSource=CSI workflow="node_server=unpublish"
time="2024-03-07T10:30:41Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json." error="open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json logLayer=csi_frontend requestID=eb2fee5f-08bf-4c84-b72c-75aadca85a8d requestSource=CSI workflow="node_server=publish"
time="2024-03-07T10:30:41Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" logLayer=csi_frontend requestID=eb2fee5f-08bf-4c84-b72c-75aadca85a8d requestSource=CSI
time="2024-03-07T10:30:46Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json." error="open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json logLayer=csi_frontend requestID=91a3518a-b02f-4d93-ba40-bc5e40c69582 requestSource=CSI workflow="node_server=publish"
time="2024-03-07T10:30:46Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" logLayer=csi_frontend requestID=91a3518a-b02f-4d93-ba40-bc5e40c69582 requestSource=CSI
time="2024-03-07T10:30:54Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json." error="open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json logLayer=csi_frontend requestID=933dd1d3-43fb-43fb-a3f2-7dc429802d59 requestSource=CSI workflow="node_server=publish"
time="2024-03-07T10:30:54Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" logLayer=csi_frontend requestID=933dd1d3-43fb-43fb-a3f2-7dc429802d59 requestSource=CSI
time="2024-03-07T10:30:55Z" level=info msg="Target exists already" Target="iqn.1992-08.com.netapp:sn.46c6cd3fa93111ec9b7000a098ef4776:vs.183" logLayer=csi_frontend requestID=d04dd4ae-023e-4e94-aaad-7ddf5a1e2fa9 requestSource=CSI workflow="node_server=stage"
time="2024-03-07T10:30:55Z" level=info msg="Target exists already" Target="iqn.1992-08.com.netapp:sn.46c6cd3fa93111ec9b7000a098ef4776:vs.183" logLayer=csi_frontend requestID=d04dd4ae-023e-4e94-aaad-7ddf5a1e2fa9 requestSource=CSI workflow="node_server=stage"
time="2024-03-07T10:30:55Z" level=warning msg="Multipath device not found." logLayer=csi_frontend multipathDevice= requestID=d04dd4ae-023e-4e94-aaad-7ddf5a1e2fa9 requestSource=CSI workflow="node_server=stage"
time="2024-03-07T10:31:10Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json." error="open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json logLayer=csi_frontend requestID=96706ca4-742e-4fcf-95ee-72001fcc0ff1 requestSource=CSI workflow="node_server=publish"
time="2024-03-07T10:31:10Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" logLayer=csi_frontend requestID=96706ca4-742e-4fcf-95ee-72001fcc0ff1 requestSource=CSI
time="2024-03-07T10:31:42Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json." error="open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json logLayer=csi_frontend requestID=94fa7af5-6ab0-4062-aa44-6377b1ff1b0e requestSource=CSI workflow="node_server=publish"
time="2024-03-07T10:31:42Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" logLayer=csi_frontend requestID=94fa7af5-6ab0-4062-aa44-6377b1ff1b0e requestSource=CSI
time="2024-03-07T10:32:46Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json." error="open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json logLayer=csi_frontend requestID=ad78a8ec-71ed-43e2-8ad9-f5682ce994f9 requestSource=CSI workflow="node_server=publish"
time="2024-03-07T10:32:46Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" logLayer=csi_frontend requestID=ad78a8ec-71ed-43e2-8ad9-f5682ce994f9 requestSource=CSI
time="2024-03-07T10:34:48Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json." error="open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json logLayer=csi_frontend requestID=7f7fa185-f056-4107-85fb-691bb24bf9c5 requestSource=CSI workflow="node_server=publish"
time="2024-03-07T10:34:48Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" logLayer=csi_frontend requestID=7f7fa185-f056-4107-85fb-691bb24bf9c5 requestSource=CSI
time="2024-03-07T10:36:50Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json." error="open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json logLayer=csi_frontend requestID=3f30d347-bb33-405a-8262-fbd5dece4d5b requestSource=CSI workflow="node_server=publish"
time="2024-03-07T10:36:50Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" logLayer=csi_frontend requestID=3f30d347-bb33-405a-8262-fbd5dece4d5b requestSource=CSI
time="2024-03-07T10:38:52Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json." error="open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json logLayer=csi_frontend requestID=b5156ee7-c0c6-46c5-9adf-5eee66786886 requestSource=CSI workflow="node_server=publish"
time="2024-03-07T10:38:52Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" logLayer=csi_frontend requestID=b5156ee7-c0c6-46c5-9adf-5eee66786886 requestSource=CSI
time="2024-03-07T10:40:54Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json." error="open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json logLayer=csi_frontend requestID=ee4bead9-5afa-4e1a-a5b8-6701113e10f2 requestSource=CSI workflow="node_server=publish"
time="2024-03-07T10:40:54Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" logLayer=csi_frontend requestID=ee4bead9-5afa-4e1a-a5b8-6701113e10f2 requestSource=CSI
time="2024-03-07T10:42:56Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json." error="open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json logLayer=csi_frontend requestID=1afa282e-0546-47f6-b708-2ca04980bbd4 requestSource=CSI workflow="node_server=publish"
time="2024-03-07T10:42:56Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" logLayer=csi_frontend requestID=1afa282e-0546-47f6-b708-2ca04980bbd4 requestSource=CSI
time="2024-03-07T10:44:58Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json." error="open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json logLayer=csi_frontend requestID=cf905fd5-57e7-4239-aa82-a2b134eba539 requestSource=CSI workflow="node_server=publish"
time="2024-03-07T10:44:58Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-0dfb5719-508b-41ed-8746-4470ccbc2842.json: no such file or directory" logLayer=csi_frontend requestID=cf905fd5-57e7-4239-aa82-a2b134eba539 requestSource=CSI

Environment Provide accurate information about the environment to help us reproduce the issue.

Trident version: 23.04.0
Kernel: 5.14.0-284.52.1.el9_2.x86_64
Container runtime: cri-o://1.27.3-6.rhaos4.14.git7eb2281.el9
Kubernetes version: v1.27.10
Kubernetes orchestrator: OpenShift 4.14.13
OS: CoreOS/RedHat 9.2

We have been using Trident for 3-4 years now and have never encountered this error before.

-- EDIT -- We also face the same issue with Trident Version 24.02.0.

Trident version: 24.02.0
Kernel: 5.14.0-284.52.1.el9_2.x86_64
Container runtime: cri-o://1.27.3-6.rhaos4.14.git7eb2281.el9
Kubernetes version: v1.27.10
Kubernetes orchestrator: OpenShift 4.14.13
OS: CoreOS/RedHat 9.2

Xavier-0965 commented 7 months ago

We had the same issue also after updating from Openshift 4.13.34 to 4.14.19. Trident version: 24.02.0 Kernel: 5.14.0-284.52.1.el9_2.x86_64 Container runtime: cri-o: 1.27.4-6.1.rhaos4.14.gitd09e4c0.el9 Kubernetes version: v1.27.11+749fe1d

Kubernetes orchestrator: OpenShift 4.14.19 OS: CoreOS/RedHat 9.2

One pod was in ContainerCreating state. The events was

  Warning  FailedMount  54s (x8 over 15m)   kubelet            MountVolume.SetUp failed for volume "pvc-c476790e-1b18-4080-967d-99af41b1122a" : rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-c476790e-1b18-4080-967d-99af41b1122a.json: no such file or directory

After I have restarted the pod of the daemonSet trident-node-linux, that was running on the same node as the failing pod, the error message was still in the log. But after I have restarted the failing pod, the problem has been solved. The json file has been recreated automatically.

shashank-netapp commented 5 months ago

Hey @phhutter and @Xavier-0965, I'm looking into this bug. If possible can you upload logs of both controller and node pod with log-level set to debug and also any steps to reproduce this?

To set log-level to debug you can use the following command: tridentctl update logconfig --log-level debug -n trident Or set it to trace: tridentctl update logconfig --log-level trace -n trident

Thanks!

phhutter commented 5 months ago

Hey @shashank-netapp

I've fixed all affected PVCs by using the workaround mentioned in my initial comment, which makes it now nearly impossible to gather the requested debug logs. I did this because NetApp support told me that you already have a solution for it and that the fix will be delivered in the next release. ;-) Of course, I was puzzled because whenever I asked about the root cause, no answer was ever given. I will probably stick to GitHub issues only in the future and rely on GitHub issues instead of NetApp support cases.

Unfortunately, it doesn't seem reproducible to me. I've noticed the same issue on 4 clusters out of 30.

What surprises me is that for @Xavier-0965, a restart of the DS supposedly solved the issue. I also tried this when the problem occurred. I restarted the controller, operator, and DaemonSet without any luck. So, it could also be that the problem mentoned by Xavier is unrelated.

Cheers

Xavier-0965 commented 5 months ago

Hi @shashank-netapp, I had only one persistentVolume with that problem.

As the problem does not occures anymore, I do not have any actual logs. But at the time of the problem, I had the following events for the pod using the PV (oc describe pod POD): (for corporate reason, I have changed the namespace, pod and nodename with the corresponding text (NAMESPACE, e.g.)

Events:
  Type     Reason                  Age                   From                     Message
  ----     ------                  ----                  ----                     -------
  Normal   Scheduled               162m                  default-scheduler        Successfully assigned NAMESPACE/POD to NODENAME
  Normal   SuccessfulAttachVolume  162m                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-c476790e-1b18-4080-967d-99af41b1122a"
  Warning  NetworkNotReady         161m (x2 over 161m)   kubelet                  network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
  Warning  FailedMount             161m (x23 over 161m)  kubelet                  MountVolume.SetUp failed for volume "pvc-c476790e-1b18-4080-967d-99af41b1122a" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name csi.trident.netapp.io not found in the list of registered CSI drivers
  Warning  FailedMount             11m (x83 over 161m)   kubelet                  MountVolume.SetUp failed for volume "pvc-c476790e-1b18-4080-967d-99af41b1122a" : rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-c476790e-1b18-4080-967d-99af41b1122a.json: no such file or directory
  Warning  FailedMount             87s (x71 over 159m)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition

Then I have first, delete the pod, to see if the problem is solved. It was not the case. Here the (same) events:

$ oc describe pod $pod
...
Events:
  Type     Reason       Age                 From               Message
  ----     ------       ----                ----               -------
  Normal   Scheduled    15m                 default-scheduler  Successfully assigned NAMESPACE/POD to NODENAME
  Warning  FailedMount  109s (x6 over 13m)  kubelet            Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
  Warning  FailedMount  54s (x8 over 15m)   kubelet            MountVolume.SetUp failed for volume "pvc-c476790e-1b18-4080-967d-99af41b1122a" : rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-c476790e-1b18-4080-967d-99af41b1122a.json: no such file or directory

Here a part of the log of trident-node-linux-kcm6s: (At 16. april 2024 around 2:00 UTC, Openshift has been upgraded)

time="2024-04-16T02:11:53Z" level=info msg="GRPC call: /csi.v1.Node/NodePublishVolume" audit=csi logLayer=csi_frontend requestID=7e86e260-e757-44c4-b7f7-371210dfc6c6 requestSource=CSI
time="2024-04-16T02:11:53Z" level=warning msg="Could not find JSON file: /var/lib/trident/tracking/pvc-c476790e-1b18-4080-967d-99af41b1122a.json." error="open /var/lib/trident/tracking/pvc-c476790e-1b18-4080-967d-99af41b1122a.json: no such file or directory" filepath=/var/lib/trident/tracking/pvc-c476790e-1b18-4080-967d-99af41b1122a.json logLayer=csi_frontend requestID=7e86e260-e757-44c4-b7f7-371210dfc6c6 requestSource=CSI workflow="node_server=publish"
time="2024-04-16T02:11:53Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-c476790e-1b18-4080-967d-99af41b1122a.json: no such file or directory" logLayer=csi_frontend requestID=7e86e260-e757-44c4-b7f7-371210dfc6c6 requestSource=CSI
time="2024-04-16T02:11:53Z" level=info msg="GRPC call: /csi.v1.Node/NodeUnpublishVolume" audit=csi logLayer=csi_frontend requestID=8d38f4e5-63e0-4b98-aa90-dfe3200ca61e requestSource=CSI
time="2024-04-16T02:11:53Z" level=info msg="target path (/var/lib/kubelet/pods/e97fbc6a-1ed5-41a9-a175-a47fa4d4cfa8/volumes/kubernetes.io~csi/pvc-c476790e-1b18-4080-967d-99af41b1122a/mount) not found; volume is not mounted." Method=NodeUnpublishVolume Type=CSI_Node logLayer=csi_frontend requestID=8d38f4e5-63e0-4b98-aa90-dfe3200ca61e requestSource=CSI workflow="node_server=unpublish"
...

Then I have deleted the pod trident-node-linux-kcm6s, to restart the pod.

But the error message "no such file or directory" was still there.

After I have restartet the pod $pod (that was consuming the persistent Volume), it worked. The PV was successfully mounted.

Regards Xavier

dgabrysch commented 1 month ago

Hi,

first of all: Thank you for providing the fix in the first post.

we are on 24.02.0 and OpenShift 4.14.20 and had the same issue for one PV with two Pods. Restarting (deleting) the Pods did not help at all. Creating the JSON file by hand, empty, did not work. Copying it without removing the "publishedTargetPaths" did not work too.

NetApp / trident

Partial NFS Volume Mount Failures Following OpenShift 4.14 Upgrade (error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-xx-xx-xx-xx-xx.json: no such file or directory) #891