NetApp / trident

Storage orchestrator for containers
Apache License 2.0
760 stars 222 forks source link

msg="GRPC error: rpc error: code = NotFound desc = node XXXXX was not found" #328

Closed titansmc closed 4 years ago

titansmc commented 4 years ago

Describe the bug Following the basic example in the documentation fails to attach the volume to the Pod.

Environment Provide accurate information about the environment to help us reproduce the issue.

[root@k3n trident-installer]# ./tridentctl -n trident get backend
+----------------------+----------------+--------------------------------------+--------+---------+
|         NAME         | STORAGE DRIVER |                 UUID                 | STATE  | VOLUMES |
+----------------------+----------------+--------------------------------------+--------+---------+
| ontapnas_10.11.5.186 | ontap-nas      | 57a270cb-051a-4107-8146-1111111e7a5 | online |       2 |
+----------------------+----------------+--------------------------------------+--------+---------+

[root@k3n trident-installer]# ./tridentctl -n trident  version
+----------------+----------------+
| SERVER VERSION | CLIENT VERSION |
+----------------+----------------+
| 19.10.0        | 19.10.0        |
+----------------+----------------+

Docker

Client: Docker Engine - Community
 Version:           19.03.5
 API version:       1.39 (downgraded from 1.40)
 Go version:        go1.12.12
 Git commit:        633a0ea
 Built:             Wed Nov 13 07:25:41 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.7
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       2d0083d
  Built:            Thu Jun 27 17:26:28 2019
  OS/Arch:          linux/amd64
  Experimental:     false

k8s version

[root@k3n trident-installer]# kubectl  version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:07:57Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
[root@k3n trident-installer]# kubectl get nodes
NAME          STATUS   ROLES    AGE   VERSION
k1m.domain.com   Ready    master   28d   v1.15.5
k3n.domain.com   Ready    <none>   28d   v1.15.5
k4n.domain.com   Ready    <none>   28d   v1.15.5

To Reproduce Follow the basic example

Expected behavior attach the created volume to the Pod

Additional context I also see in the logs errors related to iSCSI, which I believe we are not using.

time="2019-12-12T09:47:06Z" level=warning msg="Couldn't retrieve volume transaction logs: Unable to find key"
time="2019-12-12T09:47:06Z" level=info msg="Trident bootstrapped successfully."
time="2019-12-12T09:47:06Z" level=info msg="Activating plain CSI helper frontend."
time="2019-12-12T09:47:06Z" level=info msg="Activating CSI frontend."
time="2019-12-12T09:47:06Z" level=info msg="Listening for GRPC connections." name=/plugin/csi.sock net=unix
time="2019-12-12T09:47:06Z" level=error msg="Error gathering initiator names."
time="2019-12-12T09:47:06Z" level=error msg="Could not get iSCSI initiator name." error="exit status 1"
kmwm3 commented 4 years ago

I am having the same issue.

Docker version 18.06.2-ce K8s version 1.16.3 Trident version 19.10 Storage driver - ontap-nas

tridentctl logs

"Node info not found." node=<node_name>
"GRPC error: rpc error: code = NotFound desc = node <node_name> was not found"

kubectl describe pod that's requesting the pvc AttachVolume.Attach failed for volume "pvc-245d157b-f450-4fed-8e0b-29affcb6d53b" : rpc error: code = NotFound desc = node <node_name> was not found

I think this may have something to do with an old install that did not clean up properly? How can we completely remove Trident to try again? I have tried clearing out the trident entries in /var/lib/kubelet and in /var/lib/trident. but to no avail so far.

balaramesh commented 4 years ago

@titansmc and @kmwm3 can you share some more info on your k8s environment? Are you running vanilla k8s? What's the underlying OS on your underlying nodes?

titansmc commented 4 years ago

I am using CentOS 7 deployed through kubespray. Cheers.

On Tue, Jan 14, 2020, 17:04 Balasubramanian Ramesh Babu < notifications@github.com> wrote:

@titansmc https://github.com/titansmc and @kmwm3 https://github.com/kmwm3 can you share some more info on your k8s environment? Are you running vanilla k8s? What's the underlying OS on your underlying nodes?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NetApp/trident/issues/328?email_source=notifications&email_token=AB6QAYGBBWBFRPJBJWLPPV3Q5XO7LA5CNFSM4KFETPN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI5E5SA#issuecomment-574246600, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6QAYHOU7ECZAWPWTQFEZDQ5XO7LANCNFSM4KFETPNQ .

kmwm3 commented 4 years ago

@titansmc and @kmwm3 can you share some more info on your k8s environment? Are you running vanilla k8s? What's the underlying OS on your underlying nodes?

I am running vanilla k8s on RHEL 7.7.

teramucho commented 4 years ago

i have same issue

the problem is trident did not get my cluster node asset

through log

it only join part of cluster node...

so pvc only mount on specific node, else all failed...


time="2020-02-04T09:18:10Z" level=debug msg="Authenticated by HTTPS REST frontend." peerCert=trident-node time="2020-02-04T09:18:10Z" level=debug msg="REST API call received." duration="1.523µs" method=PUT requestID=bosjdknr0f3d5tg4cl0g route=AddOrUpdateNode uri=/trident/v1/node/ddp-deveco-master02 time="2020-02-04T09:18:10Z" level=info msg="Added a new node." handler=AddOrUpdateNode node=ddp-deveco-master02 time="2020-02-04T09:18:10Z" level=debug msg="REST API call complete." duration=6.158862ms method=PUT requestID=bosjdknr0f3d5tg4cl0g route=AddOrUpdateNode uri=/trident/v1/node/ddp-deveco-master02 time="2020-02-04T09:18:17Z" level=debug msg="REST API call received." duration="2.491µs" method=GET requestID=bosjdmfr0f3d5tg4cl10 route=GetVersion uri=/trident/v1/version time="2020-02-04T09:18:17Z" level=debug msg="REST API call complete." duration="161.897µs" method=GET requestID=bosjdmfr0f3d5tg4cl10 route=GetVersion uri=/trident/v1/version time="2020-02-04T09:18:34Z" level=debug msg="Authenticated by HTTPS REST frontend." peerCert=trident-node time="2020-02-04T09:18:34Z" level=debug msg="REST API call received." duration="1.538µs" method=PUT requestID=bosjdqnr0f3d5tg4cl1g route=AddOrUpdateNode uri=/trident/v1/node/ddp-deveco-master03 time="2020-02-04T09:18:34Z" level=info msg="Added a new node." handler=AddOrUpdateNode node=ddp-deveco-master03 time="2020-02-04T09:18:34Z" level=debug msg="REST API call complete." duration=5.725727ms method=PUT requestID=bosjdqnr0f3d5tg4cl1g route=AddOrUpdateNode uri=/trident/v1/node/ddp-deveco-master03 time="2020-02-04T09:18:58Z" level=debug msg="Storage class updated in cache." name=nfs-client parameters="map[backendType:ontap-nas snapshots:true]" provisioner=csi.trident.netapp.io time="2020-02-04T09:19:08Z" level=debug msg="REST API call received." duration="3.05µs" method=POST requestID=bosje37r0f3d5tg4cl20 route=AddBackend uri=/trident/v1/backend

gnarl commented 4 years ago

@teramucho, Kubernetes calls Trident's API to add the node once it is successfully registered. If a node in the cluster isn't added to Trident then that node may not have properly registered. Check the Trident node and driver registrar sidecar logs for errors. Also, check the kubelet logs. If this doesn't resolve your issue please contact NetApp Support.

gnarl commented 4 years ago

All, a fix was just merged to address a situation where K8S DNS is not configured properly which can lead to the error as reported in this issue. Trident patches that contain the fix will be released in the near future. Thanks for your patience.

gnarl commented 4 years ago

This issue was fixed with the Trident 20.01.1 release.

presidenten commented 4 years ago

@gnarl Still got the issue on one of our clusters:

 $ tridentctl -n trident get backend
+------------------+----------------+--------------------------------------+--------+---------+
|       NAME       | STORAGE DRIVER |                 UUID                 | STATE  | VOLUMES |
+------------------+----------------+--------------------------------------+--------+---------+
| <redacted>       | ontap-nas      | <redacted>                           | online |       1 |
+------------------+----------------+--------------------------------------+--------+---------+
$
$
$ tridentctl -n trident version
+----------------+----------------+
| SERVER VERSION | CLIENT VERSION |
+----------------+----------------+
| 20.01.1        | 20.01.0        |
+----------------+----------------+

Trident cant find a few of the nodes in the cluster:

time="2020-04-25T14:28:49Z" level=error msg="Node info not found." node=node020
time="2020-04-25T14:28:49Z" level=error msg="GRPC error: rpc error: code = NotFound desc = node node020 was not found"
time="2020-04-25T14:28:49Z" level=error msg="Node info not found." node=node020
time="2020-04-25T14:28:49Z" level=error msg="GRPC error: rpc error: code = NotFound desc = node node020 was not found"
time="2020-04-25T14:28:50Z" level=error msg="Node info not found." node=node018
time="2020-04-25T14:28:50Z" level=error msg="GRPC error: rpc error: code = NotFound desc = node node018 was not found"
time="2020-04-25T14:28:50Z" level=error msg="Node info not found." node=node018
time="2020-04-25T14:28:50Z" level=error msg="GRPC error: rpc error: code = NotFound desc = node node018 was not found"

Any ideas what to try to get them up and running?

These machines were correctly connected before. Now we reinstalled the cluster (as training for new ops) and then the nodes dont get added anymore.

ramancde commented 4 years ago

Is there a latest update on this issue. Do we have the fix

bigg01 commented 4 years ago

I have this problem to running OCP4 80% of the nodes are working the other 20% fails.

./tridentctl version +----------------+----------------+ | SERVER VERSION | CLIENT VERSION | +----------------+----------------+ | 20.04.0 | 20.04.0 | +----------------+----------------+

Server Version: 4.4.9 Kubernetes Version: v1.17.1+912792b

The node is missing because the Trident object was not created "oc get tridentnode"

gnarl commented 4 years ago

Hi @presidenten, @ramancde, and @bigg01 we've investigated the issue and have not been able to reproduce it. If you see the issue again please contact NetApp support and provide Trident logs so that we can determine what is causing the issue.

torirevilla commented 4 years ago

There are two likely scenarios why Trident does not find a Kubernetes node. It can be because of a networking issue within Kubernetes or a DNS issue. The Trident node daemonset that runs on each Kubernetes node must be able to communicate with the Trident controller to register the node with Trident. If networking changes occurred after Trident was installed this problem may only be observed with new Kubernetes nodes that are added to the cluster.

khatrig commented 4 years ago

There are two likely scenarios why Trident does not find a Kubernetes node. It can be because of a networking issue within Kubernetes or a DNS issue. The Trident node daemonset that runs on each Kubernetes node must be able to communicate with the Trident controller to register the node with Trident. If networking changes occurred after Trident was installed this problem may only be observed with new Kubernetes nodes that are added to the cluster.

This matches the kind of issue I am facing. Only newly added nodes won't register with the trident. I tried restarting the trident pods, tried removing/adding the impacted nodes but nothing helps. There have been no networking changes on the cluster and I don't see any networking/DNS related issues on the cluster.

Any pointers on how I can investigate this further?

oleimann commented 4 years ago

The same error about not finding the node (not registered with Trident controller) seems to happen with K8s 1.17 and Trident 20.07 when the Autoscaler of Kubernetes adds a node to bring a pod in - the PV for the pod doesn't get added as a consequence, and the Pod is Pending. Do nodes in the "free pool" need to be prepared with Trident somehow, so the daemon is available when the Node starts up, and it can register ?)

gnarl commented 4 years ago

@khatrig and @oleimann,

As indicated above we haven't been able to reproduce this issue yet. Please open a case with NetApp support so that we can collect additional information.

To open a case with NetApp, please go to https://mysupport.netapp.com/site/.

khatrig commented 4 years ago

In my case, it turned out to be an issue with DNS on some nodes, trident-csi pod running on some nodes could not resolve trident-csi.trident service hence could not register the node.

gnarl commented 4 years ago

@khatrig thanks for updating this issue.

gnarl commented 4 years ago

For everyone that encountered this reported issue it was determined that either a DNS or a networking issue kept the Trident node DaemonSet from registering with the Trident controller. Commit 8e51987 improves the Info log message to help the Trident user resolve this registration issue.