Closed zhouhao3 closed 1 year ago
@zhouhao3 did you customize NFD CR spec by any chance? we look for following labels with vendorID
10de
.
"feature.node.kubernetes.io/pci-10de.present": "true",
"feature.node.kubernetes.io/pci-0302_10de.present": "true",
"feature.node.kubernetes.io/pci-0300_10de.present": "true",
node-feature-discovery:
worker:
config:
sources:
pci:
deviceClassWhitelist:
- "02"
- "0200"
- "0207"
- "0300"
- "0302"
deviceLabelFields:
- vendor
did you customize NFD CR spec by any chance?
@shivamerla Yes, the following are the details, can you help to see if there is a problem? Thanks!
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
name: nfd-instance
namespace: openshift-nfd1
spec:
instance: "" # instance is empty by default
topologyupdater: false # False by default
operand:
image: registry.redhat.io/openshift4/ose-node-feature-discovery:v4.11
imagePullPolicy: Always
workerConfig:
configData: |
core:
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
name: nfd-instance
namespace: openshift-nfd1
spec:
instance: "" # instance is empty by default
topologyupdater: false # False by default
operand:
image: registry.redhat.io/openshift4/ose-node-feature-discovery:v4.11
imagePullPolicy: Always
workerConfig:
configData: |
core:
sleepInterval: 60s
sources:
cpu:
cpuid:
# NOTE: whitelist has priority over blacklist
attributeBlacklist:
- "BMI1"
- "BMI2"
- "CLMUL"
- "CMOV"
- "CX16"
- "ERMS"
- "F16C"
- "HTT"
- "LZCNT"
- "MMX"
- "MMXEXT"
- "NX"
- "POPCNT"
- "RDRAND"
- "RDSEED"
- "RDTSCP"
- "SGX"
- "SSE"
- "SSE2"
- "SSE3"
- "SSE4.1"
- "SSE4.2"
- "SSSE3"
attributeWhitelist:
kernel:
configOpts:
- "NO_HZ"
- "X86"
- "DMI"
pci:
deviceClassWhitelist:
- "0200"
- "0300"
- "0302"
deviceLabelFields:
- "class"
@zhouhao3 can you add vendor
to the deviceLabelFields
list?
can you add vendor to the deviceLabelFields list?
@cdesiniotis thanks, I tried adding vendor
, but it still didn't solve the problem.
In addition, the information in my worker is as follows:
oc describe node | egrep 'Roles|pci' | grep -v master
Roles: worker
feature.node.kubernetes.io/pci-0200.present=true
feature.node.kubernetes.io/pci-0200.sriov.capable=true
feature.node.kubernetes.io/pci-0300.present=true
feature.node.kubernetes.io/pci-0302.present=true
feature.node.kubernetes.io/pci-0302.sriov.capable=true
So I don't think vendor should be added.
@shivamerla @cdesiniotis Could you help me to see if there is any other solution? thanks!
@zhouhao3 we expect NFD to add the right label with vendor
ID. Can you try to un-install and re-install with following NFD CR instance? We should see one of the following labels. Once this label is added you can see that gpu-operator will load other operands.
"feature.node.kubernetes.io/pci-0302_10de.present": "true",
"feature.node.kubernetes.io/pci-0300_10de.present": "true",
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
name: nfd-instance
namespace: openshift-nfd1
spec:
instance: "" # instance is empty by default
topologyupdater: false # False by default
operand:
image: registry.redhat.io/openshift4/ose-node-feature-discovery:v4.11
imagePullPolicy: Always
workerConfig:
configData: |
core:
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
name: nfd-instance
namespace: openshift-nfd1
spec:
instance: "" # instance is empty by default
topologyupdater: false # False by default
operand:
image: registry.redhat.io/openshift4/ose-node-feature-discovery:v4.11
imagePullPolicy: Always
workerConfig:
configData: |
core:
sleepInterval: 60s
sources:
cpu:
cpuid:
# NOTE: whitelist has priority over blacklist
attributeBlacklist:
- "BMI1"
- "BMI2"
- "CLMUL"
- "CMOV"
- "CX16"
- "ERMS"
- "F16C"
- "HTT"
- "LZCNT"
- "MMX"
- "MMXEXT"
- "NX"
- "POPCNT"
- "RDRAND"
- "RDSEED"
- "RDTSCP"
- "SGX"
- "SSE"
- "SSE2"
- "SSE3"
- "SSE4.1"
- "SSE4.2"
- "SSSE3"
attributeWhitelist:
kernel:
configOpts:
- "NO_HZ"
- "X86"
- "DMI"
pci:
deviceClassWhitelist:
- "0200"
- "0300"
- "0302"
deviceLabelFields:
- "vendor"
- "class"
@shivamerla It works, thanks!
I built an OCP4.11 cluster and used a baremetal with A100 GPU as worker.
The deployment of some resources in the early stage was successful, but problems occurred after the last step of deploying nvidia-gpu-operator. Although nvidia-gpu-operator runs successfully, according to its log, it does not find the GPU in the worker, so it does not deploy subsequent CRs on the worker.
But from the following information we can see that the GPU device has been discovered in worker.
we can confirm that
pci-0302
corresponds toA100
.I noticed that in the documentation it says that the platforms supported by Openshift do not include baremetal. I'd like to confirm if it's still not supported by baremetal as of now?