NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.77k stars 286 forks source link

Installation result: some daemonset not be installed, some of them install too much #728

Closed johnzheng1975 closed 1 month ago

johnzheng1975 commented 4 months ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

3. Steps to reproduce the issue

Detailed steps to reproduce the issue. In my eks, has five nodes, one of them is gpu-node. image

I assumed:

  1. all daemonset is installed on gpu node
  2. only one pod for each gpu node.
$ k get nodes  --show-labels | grep gpu
ip-10-202-124-101.us-west-2.compute.internal   Ready    <none>   7d3h    v1.29.3-eks-ae9a62a   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=g4dn.xlarge,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-0907d69dfb1db08a5,eks.amazonaws.com/nodegroup=sandbox-uw2-blue-gpu,eks.amazonaws.com/sourceLaunchTemplateId=lt-0d2878791585d9223,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2a,feature.node.kubernetes.io/cpu-cpuid.ADX=true,feature.node.kubernetes.io/cpu-cpuid.AESNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX2=true,feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true,feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true,feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true,feature.node.kubernetes.io/cpu-cpuid.AVX512F=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true,feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true,feature.node.kubernetes.io/cpu-cpuid.AVX=true,feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true,feature.node.kubernetes.io/cpu-cpuid.FMA3=true,feature.node.kubernetes.io/cpu-cpuid.FXSR=true,feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true,feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true,feature.node.kubernetes.io/cpu-cpuid.LAHF=true,feature.node.kubernetes.io/cpu-cpuid.MOVBE=true,feature.node.kubernetes.io/cpu-cpuid.MPX=true,feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true,feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true,feature.node.kubernetes.io/cpu-cpuid.SYSEE=true,feature.node.kubernetes.io/cpu-cpuid.X87=true,feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true,feature.node.kubernetes.io/cpu-cpuid.XSAVE=true,feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true,feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true,feature.node.kubernetes.io/cpu-cpuid.XSAVES=true,feature.node.kubernetes.io/cpu-hardware_multithreading=true,feature.node.kubernetes.io/cpu-model.family=6,feature.node.kubernetes.io/cpu-model.id=85,feature.node.kubernetes.io/cpu-model.vendor_id=Intel,feature.node.kubernetes.io/kernel-config.NO_HZ=true,feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true,feature.node.kubernetes.io/kernel-version.full=5.10.214-202.855.amzn2.x86_64,feature.node.kubernetes.io/kernel-version.major=5,feature.node.kubernetes.io/kernel-version.minor=10,feature.node.kubernetes.io/kernel-version.revision=214,feature.node.kubernetes.io/pci-10de.present=true,feature.node.kubernetes.io/pci-1d0f.present=true,feature.node.kubernetes.io/storage-nonrotationaldisk=true,feature.node.kubernetes.io/system-os_release.ID=amzn,feature.node.kubernetes.io/system-os_release.VERSION_ID.major=2,feature.node.kubernetes.io/system-os_release.VERSION_ID=2,k8s.amazonaws.com/accelerator=nvidia-tesla-t4,k8s.io/cloud-provider-aws=3a787d0120dc9b8c791e2fc2c9e7613b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-202-124-101.us-west-2.compute.internal,kubernetes.io/os=linux,node-group=gpu,node.kubernetes.io/instance-type=g4dn.xlarge,nvidia.com/cuda.driver-version.full=535.161.08,nvidia.com/cuda.driver-version.major=535,nvidia.com/cuda.driver-version.minor=161,nvidia.com/cuda.driver-version.revision=08,nvidia.com/cuda.driver.major=535,nvidia.com/cuda.driver.minor=161,nvidia.com/cuda.driver.rev=08,nvidia.com/cuda.runtime-version.full=12.2,nvidia.com/cuda.runtime-version.major=12,nvidia.com/cuda.runtime-version.minor=2,nvidia.com/cuda.runtime.major=12,nvidia.com/cuda.runtime.minor=2,nvidia.com/gfd.timestamp=1717054304,nvidia.com/gpu-driver-upgrade-state=upgrade-done,nvidia.com/gpu.compute.major=7,nvidia.com/gpu.compute.minor=5,nvidia.com/gpu.count=1,nvidia.com/gpu.deploy.container-toolkit=true,nvidia.com/gpu.deploy.dcgm-exporter=true,nvidia.com/gpu.deploy.dcgm=true,nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/gpu.deploy.driver=pre-installed,nvidia.com/gpu.deploy.gpu-feature-discovery=true,nvidia.com/gpu.deploy.node-status-exporter=true,nvidia.com/gpu.deploy.operator-validator=true,nvidia.com/gpu.family=turing,nvidia.com/gpu.machine=g4dn.xlarge,nvidia.com/gpu.memory=15360,nvidia.com/gpu.present=true,nvidia.com/gpu.product=Tesla-T4,nvidia.com/gpu.replicas=1,nvidia.com/gpu.sharing-strategy=none,nvidia.com/mig.capable=false,nvidia.com/mig.strategy=single,nvidia.com/mps.capable=false,topology.ebs.csi.aws.com/zone=us-west-2a,topology.kubernetes.io/region=us-west-2,topology.kubernetes.io/zone=us-west-2a

However,

  1. I found "nvidia-device-plugin-mps-control-daemon", "nvidia-driver-daemonset", "nvidia-mig-manager " has no pods
  2. I found gpu-operator-node-feature-discovery-worker has 5 pods, no "Node Selector", why this need be installed on non-gpu node? Thanks.

image

4. Information to attach (optional if deemed irrelevant)

cdesiniotis commented 1 month ago

I found gpu-operator-node-feature-discovery-worker has 5 pods, no "Node Selector", why this need be installed on non-gpu node?

Node feature discovery labels node with hardware features / system configuration. The GPU Operator depends on these labels to know which worker nodes have GPU(s). If you would like to restrict which nodes NFD worker pods get scheduled to, you can configure a node selector in the NFD helm values.

I found "nvidia-device-plugin-mps-control-daemon", "nvidia-driver-daemonset", "nvidia-mig-manager " has no pods

If drivers are pre-installed on your GPU nodes, you can explicitly disable the GPU Operator-managed driver by setting driver.enabled=false in clusterpolicy -- that will prevent the nvidia-driver-daemonset from getting created. Similarly, if you don't have any MIG capable GPUs in your cluster, you can explicitly disable the mig-manager component by setting migManager.enabled=false in ClusterPolicy -- that will prevent the nvidia-mig-manager daemonset from getting created.