NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.78k stars 286 forks source link

Failed to deploy GPU Operators V22.9.0 in Openshift for baremetal #438

Closed zhouhao3 closed 1 year ago

zhouhao3 commented 1 year ago

I built an OCP4.11 cluster and used a baremetal with A100 GPU as worker.

The deployment of some resources in the early stage was successful, but problems occurred after the last step of deploying nvidia-gpu-operator. Although nvidia-gpu-operator runs successfully, according to its log, it does not find the GPU in the worker, so it does not deploy subsequent CRs on the worker.

oc logs -f -n nvidia-gpu-operator -lapp=gpu-operator
1.6679920299890828e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"ServiceMonitor": "nvidia-node-status-exporter", "Namespace": "nvidia-gpu-operator"}
1.6679920299926214e+09  INFO    controllers.ClusterPolicy       No GPU node in the cluster, do not create DaemonSets    {"DaemonSet": "nvidia-node-status-exporter", "Namespace": "nvidia-gpu-operator"}
1.6679920299926882e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"PrometheusRule": "nvidia-node-status-exporter-alerts"}
1.6679920299998875e+09  INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step completed      {"state:": "state-node-status-exporter", "status": "ready"}
1.6679920300108845e+09  INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step completed      {"state:": "state-vgpu-manager", "status": "disabled"}
1.6679920300175796e+09  INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step completed      {"state:": "state-vgpu-device-manager", "status": "disabled"}
1.667992030026974e+09   INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step completed      {"state:": "state-sandbox-validation", "status": "disabled"}
1.6679920300375054e+09  INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step completed      {"state:": "state-vfio-manager", "status": "disabled"}
1.6679920300471733e+09  INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step completed      {"state:": "state-sandbox-device-plugin", "status": "disabled"}
1.6679920300472524e+09  INFO    controllers.ClusterPolicy       No GPU node found, watching for new nodes to join the cluster.  {"hasNFDLabels": true}

But from the following information we can see that the GPU device has been discovered in worker.

oc describe node | egrep 'Roles|pci' | grep -v master
Roles:              worker
                    feature.node.kubernetes.io/pci-0200.present=true
                    feature.node.kubernetes.io/pci-0200.sriov.capable=true
                    feature.node.kubernetes.io/pci-0300.present=true
                    feature.node.kubernetes.io/pci-0302.present=true
                    feature.node.kubernetes.io/pci-0302.sriov.capable=true

we can confirm that pci-0302 corresponds to A100.

I noticed that in the documentation it says that the platforms supported by Openshift do not include baremetal. I'd like to confirm if it's still not supported by baremetal as of now?

shivamerla commented 1 year ago

@zhouhao3 did you customize NFD CR spec by any chance? we look for following labels with vendorID 10de.

    "feature.node.kubernetes.io/pci-10de.present":      "true",
    "feature.node.kubernetes.io/pci-0302_10de.present": "true",
    "feature.node.kubernetes.io/pci-0300_10de.present": "true",
node-feature-discovery:
  worker:
    config:
      sources:
        pci:
          deviceClassWhitelist:
          - "02"
          - "0200"
          - "0207"
          - "0300"
          - "0302"
          deviceLabelFields:
          - vendor
zhouhao3 commented 1 year ago

did you customize NFD CR spec by any chance?

@shivamerla Yes, the following are the details, can you help to see if there is a problem? Thanks!

apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
  name: nfd-instance
  namespace: openshift-nfd1
spec:
  instance: "" # instance is empty by default
  topologyupdater: false # False by default
  operand:
    image: registry.redhat.io/openshift4/ose-node-feature-discovery:v4.11
    imagePullPolicy: Always
  workerConfig:
    configData: |
      core:
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
  name: nfd-instance
  namespace: openshift-nfd1
spec:
  instance: "" # instance is empty by default
  topologyupdater: false # False by default
  operand:
    image: registry.redhat.io/openshift4/ose-node-feature-discovery:v4.11
    imagePullPolicy: Always
  workerConfig:
    configData: |
      core:
        sleepInterval: 60s
      sources:
        cpu:
          cpuid:
      #     NOTE: whitelist has priority over blacklist
            attributeBlacklist:
              - "BMI1"
              - "BMI2"
              - "CLMUL"
              - "CMOV"
              - "CX16"
              - "ERMS"
              - "F16C"
              - "HTT"
              - "LZCNT"
              - "MMX"
              - "MMXEXT"
              - "NX"
              - "POPCNT"
              - "RDRAND"
              - "RDSEED"
              - "RDTSCP"
              - "SGX"
              - "SSE"
              - "SSE2"
              - "SSE3"
              - "SSE4.1"
              - "SSE4.2"
              - "SSSE3"
            attributeWhitelist:
        kernel:
          configOpts:
            - "NO_HZ"
            - "X86"
            - "DMI"
        pci:
          deviceClassWhitelist:
          - "0200"
          - "0300"
          - "0302"
          deviceLabelFields:
          - "class"
cdesiniotis commented 1 year ago

@zhouhao3 can you add vendor to the deviceLabelFields list?

zhouhao3 commented 1 year ago

can you add vendor to the deviceLabelFields list?

@cdesiniotis thanks, I tried adding vendor, but it still didn't solve the problem.

In addition, the information in my worker is as follows:

oc describe node | egrep 'Roles|pci' | grep -v master
Roles:              worker
                    feature.node.kubernetes.io/pci-0200.present=true
                    feature.node.kubernetes.io/pci-0200.sriov.capable=true
                    feature.node.kubernetes.io/pci-0300.present=true
                    feature.node.kubernetes.io/pci-0302.present=true
                    feature.node.kubernetes.io/pci-0302.sriov.capable=true

So I don't think vendor should be added.

zhouhao3 commented 1 year ago

@shivamerla @cdesiniotis Could you help me to see if there is any other solution? thanks!

shivamerla commented 1 year ago

@zhouhao3 we expect NFD to add the right label with vendor ID. Can you try to un-install and re-install with following NFD CR instance? We should see one of the following labels. Once this label is added you can see that gpu-operator will load other operands.

    "feature.node.kubernetes.io/pci-0302_10de.present": "true",
    "feature.node.kubernetes.io/pci-0300_10de.present": "true",
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
  name: nfd-instance
  namespace: openshift-nfd1
spec:
  instance: "" # instance is empty by default
  topologyupdater: false # False by default
  operand:
    image: registry.redhat.io/openshift4/ose-node-feature-discovery:v4.11
    imagePullPolicy: Always
  workerConfig:
    configData: |
      core:
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
  name: nfd-instance
  namespace: openshift-nfd1
spec:
  instance: "" # instance is empty by default
  topologyupdater: false # False by default
  operand:
    image: registry.redhat.io/openshift4/ose-node-feature-discovery:v4.11
    imagePullPolicy: Always
  workerConfig:
    configData: |
      core:
        sleepInterval: 60s
      sources:
        cpu:
          cpuid:
      #     NOTE: whitelist has priority over blacklist
            attributeBlacklist:
              - "BMI1"
              - "BMI2"
              - "CLMUL"
              - "CMOV"
              - "CX16"
              - "ERMS"
              - "F16C"
              - "HTT"
              - "LZCNT"
              - "MMX"
              - "MMXEXT"
              - "NX"
              - "POPCNT"
              - "RDRAND"
              - "RDSEED"
              - "RDTSCP"
              - "SGX"
              - "SSE"
              - "SSE2"
              - "SSE3"
              - "SSE4.1"
              - "SSE4.2"
              - "SSSE3"
            attributeWhitelist:
        kernel:
          configOpts:
            - "NO_HZ"
            - "X86"
            - "DMI"
        pci:
          deviceClassWhitelist:
          - "0200"
          - "0300"
          - "0302"
          deviceLabelFields:
          - "vendor"
          - "class"
zhouhao3 commented 1 year ago

@shivamerla It works, thanks!