Hey,
Going through live system configuration I have noticed, that gpu-operator-node-feature-discovery-worker-conf contains incorrect device class whitelist:
According to PCI-SIG specifications, base class 03 is Display controller, 00 subclass of 03 class is VGA-compatible controller, and 02 subclass of 03 class is 3D controller . 02 class is Network controller, with empty subclass pointing to any, 00 subclass to Ethernet controller, and 07 subclass to InfiniBand Controller.
So provided configuration with operator translates to:
Result of this misconfiguration can be observed in logs of gpu-operator-node-feature-discovery-worker pods, it tries to gather data about both Ethernet and InfiniBand devices (which should be gathered by network-operator, not the gpu-operator. Those devices should be filtered out by deviceClassWhitelist):
kubectl logs -n gpu-operator gpu-operator-node-feature-discovery-worker-7ndj5 | head -n 5
E0526 21:58:37.810614 1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/eno3/speed: invalid argument" attributeName="speed"
E0526 21:58:37.811725 1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/ens6f0/speed: invalid argument" attributeName="speed"
E0526 21:58:37.811789 1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/ens6f1/speed: invalid argument" attributeName="speed"
E0526 21:58:37.812141 1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/ibp154s0v0/speed: invalid argument" attributeName="speed"
E0526 21:58:37.812180 1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/ibp154s0v1/speed: invalid argument" attributeName="speed"
Hey, Going through live system configuration I have noticed, that
gpu-operator-node-feature-discovery-worker-conf
contains incorrect device class whitelist:According to PCI-SIG specifications, base class
03
isDisplay controller
,00
subclass of03
class isVGA-compatible controller
, and02
subclass of03
class is3D controller
.02
class isNetwork controller
, with empty subclass pointing to any,00
subclass toEthernet controller
, and07
subclass toInfiniBand Controller
.So provided configuration with operator translates to:
With such filters it seems like
gpu-operator-node-feature-discovery
is configured to gather both GPU, and network data (where that should be done by https://github.com/Mellanox/network-operator, with similar issue: https://github.com/Mellanox/network-operator/issues/957). In my opinion, deviceClassWhitelist should contain entries only from 03 classes (Display).Result of this misconfiguration can be observed in logs of
gpu-operator-node-feature-discovery-worker
pods, it tries to gather data about both Ethernet and InfiniBand devices (which should be gathered bynetwork-operator
, not thegpu-operator
. Those devices should be filtered out by deviceClassWhitelist):This configuration can be found here: https://github.com/NVIDIA/gpu-feature-discovery/blob/main/deployments/helm/gpu-feature-discovery/values.yaml#L84
In my opinion,
deviceClassWhitelist
for gpu-feature-discovery should contain only0300
, and0302
entries.Thank you, Franciszek