Mellanox / network-operator

Mellanox Network Operator
Apache License 2.0
204 stars 49 forks source link

Incorrect deviceClassWhitelist configuration is provided #957

Open fprzewozny opened 4 months ago

fprzewozny commented 4 months ago

Hey, Going through live system configuration I have noticed, that network-operator-node-feature-discovery-worker-conf contains incorrect device class whitelist:

apiVersion: v1
data:
 nfd-worker.conf: |-
  sources:
   pci:
    deviceClassWhitelist:
    - "0300"
    - "0302"
    deviceLabelFields:
    - vendor
kind: ConfigMap

According to PCI-SIG specifications, base class 03 is Display controller, 00 subclass of 03 class is VGA-compatible controller, and 02 subclass of 03 class is 3D controller. So provided configuration with operator translates to:

    deviceClassWhitelist:
    - "0300"  # VGA-compatible controller
    - "0302"  # 3D controller

With such filters it seems like network-operator-node-feature-discovery is configured to gather GPU data (that should be done with f.e. https://github.com/NVIDIA/gpu-feature-discovery, which have similar configuration issue I will link here once it's created). In my opinion, deviceClassWhitelist should contain entries only from 02 classes (Network).

In code repo it can be found here: https://github.com/Mellanox/network-operator/blob/17d04f562e4edc81b21b965e87064638bef78c91/hack/templates/values/values.template#L49 and https://github.com/Mellanox/network-operator/blob/17d04f562e4edc81b21b965e87064638bef78c91/deployment/network-operator/values.yaml#L49

In my opinion, deviceClassWhitelist for network-operator should contain only 0200, and 0207 entries.

Thank you, Franciszek

fprzewozny commented 4 months ago

Created a bug against gpu-feature-discovery as well: https://github.com/NVIDIA/k8s-device-plugin/issues/729

adrianchiris commented 3 months ago

Hi @fprzewozny we use NFD (Node Feature Discovery) NodeFeature API[1] and deploy a NodeFeatureRule[2][3] obj that will trigger NFD to label the node with expected labels required for network-operator (feature.node.kubernetes.io/pci-15b3.present).

we keep GPUs in deviceClassWhitelist expose the default GPU related labels by NFD. thats needed when using NVIDIA GPU Operator. reason being we expect only one instance of NFD deployed in the cluster.

[1]https://kubernetes-sigs.github.io/node-feature-discovery/v0.16/usage/customization-guide.html#nodefeature-custom-resource [2]https://kubernetes-sigs.github.io/node-feature-discovery/v0.16/usage/customization-guide.html#nodefeaturerule-custom-resource [3]https://github.com/Mellanox/network-operator/blob/b14d5a299aca901d6a20881e16b8b9d77f37a19b/deployment/network-operator/templates/nodefeaturerules.yaml#L3