NVIDIA / gpu-feature-discovery

GPU plugin to the node feature discovery for Kubernetes
Apache License 2.0
292 stars 47 forks source link

Incorrect deviceClassWhitelist configuration is provided #78

Closed fprzewozny closed 5 months ago

fprzewozny commented 5 months ago

Hey, Going through live system configuration I have noticed, that gpu-operator-node-feature-discovery-worker-conf contains incorrect device class whitelist:

apiVersion: v1
data:
 nfd-worker.conf: |-
  sources:
   pci:
    deviceClassWhitelist:
    - "02"
    - "0200"
    - "0207"
    - "0300"
    - "0302"
    deviceLabelFields:
    - vendor
kind: ConfigMap

According to PCI-SIG specifications, base class 03 is Display controller, 00 subclass of 03 class is VGA-compatible controller, and 02 subclass of 03 class is 3D controller . 02 class is Network controller, with empty subclass pointing to any, 00 subclass to Ethernet controller, and 07 subclass to InfiniBand Controller.

So provided configuration with operator translates to:

deviceClassWhitelist:
- "02"    # Any network controller
- "0200"  # Ethernet controller
- "0207"  # InfiniBand Controller
- "0300"  # VGA-compatible controller
- "0302"  # 3D controller

With such filters it seems like gpu-operator-node-feature-discovery is configured to gather both GPU, and network data (where that should be done by https://github.com/Mellanox/network-operator, with similar issue: https://github.com/Mellanox/network-operator/issues/957). In my opinion, deviceClassWhitelist should contain entries only from 03 classes (Display).

Result of this misconfiguration can be observed in logs of gpu-operator-node-feature-discovery-worker pods, it tries to gather data about both Ethernet and InfiniBand devices (which should be gathered by network-operator, not the gpu-operator. Those devices should be filtered out by deviceClassWhitelist):

kubectl logs -n gpu-operator gpu-operator-node-feature-discovery-worker-7ndj5 | head -n 5
E0526 21:58:37.810614       1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/eno3/speed: invalid argument" attributeName="speed"
E0526 21:58:37.811725       1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/ens6f0/speed: invalid argument" attributeName="speed"
E0526 21:58:37.811789       1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/ens6f1/speed: invalid argument" attributeName="speed"
E0526 21:58:37.812141       1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/ibp154s0v0/speed: invalid argument" attributeName="speed"
E0526 21:58:37.812180       1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/ibp154s0v1/speed: invalid argument" attributeName="speed"

This configuration can be found here: https://github.com/NVIDIA/gpu-feature-discovery/blob/main/deployments/helm/gpu-feature-discovery/values.yaml#L84

In my opinion, deviceClassWhitelist for gpu-feature-discovery should contain only 0300, and 0302 entries.

Thank you, Franciszek

fprzewozny commented 5 months ago

https://github.com/NVIDIA/k8s-device-plugin/issues/729