NVIDIA / gpu-feature-discovery

GPU plugin to the node feature discovery for Kubernetes
Apache License 2.0
287 stars 47 forks source link

grpc: addrConn.createTransport failed to connect #20

Closed syncharny closed 2 years ago

syncharny commented 2 years ago

Hello,

The workers on my helm deployment cannot seem to startup - there are three workers, two on nodes that have GPUs and one that doesn't. Here is the error I find in the worker logs:

$ kubectl logs -n node-feature-discovery nfd-worker-dvrhk
INFO: 2022/02/25 05:16:42 parsed scheme: ""
INFO: 2022/02/25 05:16:42 scheme "" not registered, fallback to default scheme
2022/02/25 05:16:42 Node Feature Discovery Worker v0.6.0
2022/02/25 05:16:42 NodeName: 'my_node'
INFO: 2022/02/25 05:16:42 ccResolverWrapper: sending update to cc: {[{nfd-master:8080 0  <nil>}] <nil>}
INFO: 2022/02/25 05:16:42 ClientConn switching balancer to "pick_first"
WARNING: 2022/02/25 05:17:02 grpc: addrConn.createTransport failed to connect to {nfd-master:8080 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: i/o timeout". Reconnecting...
WARNING: 2022/02/25 05:17:23 grpc: addrConn.createTransport failed to connect to {nfd-master:8080 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: i/o timeout". Reconnecting...
2022/02/25 05:17:42 ERROR: failed to connect: context deadline exceeded

The master appears happy, it has the following logs:

$ kubectl logs -n node-feature-discovery nfd-master-64996ffdfc-4lt82
2022/02/25 05:13:01 Node Feature Discovery Master v0.6.0
2022/02/25 05:13:01 NodeName: 'my_node'
2022/02/25 05:13:01 gRPC server serving on port: 8080

Any ideas on how I can troubleshoot this? I can port-forward into the master service, but I'm not sure how to verify that the gprc server is running. I've seen a log where the hostname to the master service resolved to the correct ip address of the master service, but still timed out. Looking for some advice on what to look for.

Thanks

syncharny commented 2 years ago

turned out to be a configuration issue with calico