Closed zbsarashki closed 2 years ago
@zbsarashki Can you check if NODE_IP env is set correctly on dcgm-exporter pod? Also, can you try adding hostNetwork: true
to both dcgm/exporter Daemonsets and verify this?
NODE_IP points to the node (managment interface). The result of setting hostNetwork: true is identical to the failure case:
[sysadmin@controller-0 debug(keystone_admin)]$ kubectl logs -n gpu-operator-resources nvidia-dcgm-exporter-dkbtm time="2021-12-01T23:54:01Z" level=info msg="Starting dcgm-exporter" time="2021-12-01T23:54:01Z" level=info msg="Attemping to connect to remote hostengine at abcd:204::2:5555" time="2021-12-01T23:54:06Z" level=fatal msg="Error connecting to nv-hostengine: Host engine connection invalid/disconnected"
@zbsarashki we will look into this with IPV6, please edit kubectl edit clusterpolicy
and change dcgm.enabled=false
. With this dcgm-exporter will use embedded hostengine rather than using a separate pod.
The logs from nvidia-dcgm-exporter
are exactly what I saw in #294. In my case it ended up being a bad iptables
rule that was causing packets on my GPU node to be dropped silently. I suggest double-checking your firewall and iptables
rules.
Hello,
Are nvidia-dcgm and nvidia-dcgm-exporter expected to work in an strictly ipv6 environment?
Thanks, Babak
gpu-operator version 1.8.1 Failure with nvidia-dcgm-exporter: [sysadmin@controller-0 ~(keystone_admin)]$ kubectl logs -n gpu-operator-resources nvidia-dcgm-exporter-nnmpn time="2021-11-30T15:37:19Z" level=info msg="Starting dcgm-exporter" time="2021-11-30T15:37:19Z" level=info msg="Attemping to connect to remote hostengine at abcd:204::2:5555" time="2021-11-30T15:37:24Z" level=fatal msg="Error connecting to nv-hostengine: Host engine connection invalid/disconnected"
And checking the dcgm we have:
$ kubectl exec -it -n gpu-operator-resources nvidia-dcgm-l8b2l -c nvidia-dcgm-ctr -- /usr/bin/ps -ax PID TTY STAT TIME COMMAND 1 ? Ssl 0:00 /usr/bin/nv-hostengine -n -b 0.0.0.0 65 pts/0 Rs+ 0:00 /usr/bin/ps -ax
And: $kubectl get pods -n gpu-operator-resources nvidia-dcgm-l8b2l --template={{.status.podIP}}; echo '' abcd:206::8e22:765f:6121:eb68