NetApp / trident

Storage orchestrator for containers
Apache License 2.0
762 stars 222 forks source link

PVCs fail to mount on a node but it previously worked - context deadline exceeded #922

Open rusLukasRath opened 2 months ago

rusLukasRath commented 2 months ago

Describe the bug Trident PVCs could be mounted as normal on the worker node, but after some time or because of some unknown reason, Trident PVCs stop being able to be mounted on this exact node. Pods that are trying to mount a Trident PVC get the error message: "context deadline exceeded"

%pn_2024-08-30_11-34-41

The exact same PVC can still be mounted on other worker nodes. This issue happens with all Trident PVCs, old and newly created after the issue started. Restarting the trident-node pod on said worker node does not fix the issue.

Trying to mount the NetApp shares manually on said node works completly fine.

WindowsTerminal_2024-08-30_10-35-49 WindowsTerminal_2024-08-30_10-40-36

Environment

Provide accurate information about the environment to help us reproduce the issue.

To Reproduce

Unknown

Expected behavior

Trident PVCs should be able to be mounted at all times.

Additional context

The cluster on which this problem occures is running all of our GitLab Runner build jobs. On this cluster dozens of build jobs are running simultaneously and multiple build jobs are starting at the same time that want to mount the same Trident PVCs.

Attached is the log of the trident-node pod on the node before we terminated and started a new one. trident-node-linux-t5f54.txt

MallocArray commented 1 week ago

We also are starting to see this. We recently changed the SVM name in our TridentBackendConfig and things were running ok. We then upgraded to Openshift 4.16.18 and as it restarted pods, several are encountering the same context deadline exceeded message and won't mount.

Not sure if it is related to the 4.16 upgrade, the fact that we updated the Backend, or unrelated entirely Trident v24.06.1 via Helm Chart