NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.8k stars 292 forks source link

nvidia-dcgm-exporter error in a ipv6 only environment #288

Closed zbsarashki closed 2 years ago

zbsarashki commented 2 years ago

Hello,

Are nvidia-dcgm and nvidia-dcgm-exporter expected to work in an strictly ipv6 environment?

Thanks, Babak

gpu-operator version 1.8.1 Failure with nvidia-dcgm-exporter: [sysadmin@controller-0 ~(keystone_admin)]$ kubectl logs -n gpu-operator-resources nvidia-dcgm-exporter-nnmpn time="2021-11-30T15:37:19Z" level=info msg="Starting dcgm-exporter" time="2021-11-30T15:37:19Z" level=info msg="Attemping to connect to remote hostengine at abcd:204::2:5555" time="2021-11-30T15:37:24Z" level=fatal msg="Error connecting to nv-hostengine: Host engine connection invalid/disconnected"

And checking the dcgm we have:

$ kubectl exec -it -n gpu-operator-resources nvidia-dcgm-l8b2l -c nvidia-dcgm-ctr -- /usr/bin/ps -ax PID TTY STAT TIME COMMAND 1 ? Ssl 0:00 /usr/bin/nv-hostengine -n -b 0.0.0.0 65 pts/0 Rs+ 0:00 /usr/bin/ps -ax

And: $kubectl get pods -n gpu-operator-resources nvidia-dcgm-l8b2l --template={{.status.podIP}}; echo '' abcd:206::8e22:765f:6121:eb68

shivamerla commented 2 years ago

@zbsarashki Can you check if NODE_IP env is set correctly on dcgm-exporter pod? Also, can you try adding hostNetwork: true to both dcgm/exporter Daemonsets and verify this?

zbsarashki commented 2 years ago

NODE_IP points to the node (managment interface). The result of setting hostNetwork: true is identical to the failure case:

[sysadmin@controller-0 debug(keystone_admin)]$ kubectl logs -n gpu-operator-resources nvidia-dcgm-exporter-dkbtm time="2021-12-01T23:54:01Z" level=info msg="Starting dcgm-exporter" time="2021-12-01T23:54:01Z" level=info msg="Attemping to connect to remote hostengine at abcd:204::2:5555" time="2021-12-01T23:54:06Z" level=fatal msg="Error connecting to nv-hostengine: Host engine connection invalid/disconnected"

shivamerla commented 2 years ago

@zbsarashki we will look into this with IPV6, please edit kubectl edit clusterpolicy and change dcgm.enabled=false. With this dcgm-exporter will use embedded hostengine rather than using a separate pod.

dbugit commented 2 years ago

The logs from nvidia-dcgm-exporter are exactly what I saw in #294. In my case it ended up being a bad iptables rule that was causing packets on my GPU node to be dropped silently. I suggest double-checking your firewall and iptables rules.