Open fprzewozny opened 4 months ago
Created a bug against gpu-feature-discovery
as well: https://github.com/NVIDIA/k8s-device-plugin/issues/729
Hi @fprzewozny we use NFD (Node Feature Discovery) NodeFeature API[1] and deploy a NodeFeatureRule[2][3] obj that will trigger NFD to label the node with expected labels required for network-operator (feature.node.kubernetes.io/pci-15b3.present
).
we keep GPUs in deviceClassWhitelist
expose the default GPU related labels by NFD. thats needed when using NVIDIA GPU Operator.
reason being we expect only one instance of NFD deployed in the cluster.
[1]https://kubernetes-sigs.github.io/node-feature-discovery/v0.16/usage/customization-guide.html#nodefeature-custom-resource [2]https://kubernetes-sigs.github.io/node-feature-discovery/v0.16/usage/customization-guide.html#nodefeaturerule-custom-resource [3]https://github.com/Mellanox/network-operator/blob/b14d5a299aca901d6a20881e16b8b9d77f37a19b/deployment/network-operator/templates/nodefeaturerules.yaml#L3
Hey, Going through live system configuration I have noticed, that
network-operator-node-feature-discovery-worker-conf
contains incorrect device class whitelist:According to PCI-SIG specifications, base class
03
isDisplay controller
,00
subclass of03
class isVGA-compatible controller
, and02
subclass of03
class is3D controller
. So provided configuration with operator translates to:With such filters it seems like
network-operator-node-feature-discovery
is configured to gather GPU data (that should be done with f.e. https://github.com/NVIDIA/gpu-feature-discovery, which have similar configuration issue I will link here once it's created). In my opinion,deviceClassWhitelist
should contain entries only from02
classes (Network).In code repo it can be found here: https://github.com/Mellanox/network-operator/blob/17d04f562e4edc81b21b965e87064638bef78c91/hack/templates/values/values.template#L49 and https://github.com/Mellanox/network-operator/blob/17d04f562e4edc81b21b965e87064638bef78c91/deployment/network-operator/values.yaml#L49
In my opinion,
deviceClassWhitelist
for network-operator should contain only0200
, and0207
entries.Thank you, Franciszek