NVIDIA / deepops

Tools for building GPU clusters
BSD 3-Clause "New" or "Revised" License
1.25k stars 326 forks source link

Issue with K8 Cluster not detecting GPUs #1257

Closed mlahir1 closed 1 year ago

mlahir1 commented 1 year ago

K8 Cluster wouldn't detect GPUs.

ansible-playbook -l k8s-cluster playbooks/k8s-cluster.yml

  results:
  - |-
    Loaded plugins: langpacks, nvidia, product-id, search-disabled-repos
    Resolving Dependencies
    --> Running transaction check
    ---> Package nvidia-driver-branch-515.x86_64 3:515.86.01-1.el7 will be installed
    --> Processing Dependency: nvidia-driver-branch-515-NVML(x86-64) = 3:515.86.01 for package: 3:nvidia-driver-branch-515-515.86.01-1.el7.x86_64
    --> Processing Dependency: nvidia-driver-branch-515-NvFBCOpenGL(x86-64) = 3:515.86.01 for package: 3:nvidia-driver-branch-515-515.86.01-1.el7.x86_64
    --> Processing Dependency: nvidia-driver-branch-515-cuda(x86-64) = 3:515.86.01 for package: 3:nvidia-driver-branch-515-515.86.01-1.el7.x86_64
    --> Processing Dependency: nvidia-driver-branch-515-cuda-libs(x86-64) = 3:515.86.01 for package: 3:nvidia-driver-branch-515-515.86.01-1.el7.x86_64
    --> Processing Dependency: nvidia-driver-branch-515-devel(x86-64) = 3:515.86.01 for package: 3:nvidia-driver-branch-515-515.86.01-1.el7.x86_64
    --> Processing Dependency: nvidia-driver-branch-515-libs(x86-64) = 3:515.86.01 for package: 3:nvidia-driver-branch-515-515.86.01-1.el7.x86_64
    --> Processing Dependency: nvidia-kmod = 3:515.86.01 for package: 3:nvidia-driver-branch-515-515.86.01-1.el7.x86_64
    --> Processing Dependency: nvidia-modprobe-branch-515(x86-64) = 3:515.86.01 for package: 3:nvidia-driver-branch-515-515.86.01-1.el7.x86_64
    --> Processing Dependency: nvidia-xconfig-branch-515(x86-64) = 3:515.86.01 for package: 3:nvidia-driver-branch-515-515.86.01-1.el7.x86_64
    --> Processing Dependency: libnvidia-glcore.so.515.86.01()(64bit) for package: 3:nvidia-driver-branch-515-515.86.01-1.el7.x86_64
    --> Processing Dependency: libnvidia-tls.so.515.86.01()(64bit) for package: 3:nvidia-driver-branch-515-515.86.01-1.el7.x86_64
    --> Running transaction check
    ---> Package kmod-nvidia-latest-dkms.x86_64 3:515.48.07-1.el7 will be updated
    --> Processing Dependency: kmod-nvidia-latest-dkms = 3:515.48.07 for package: 3:nvidia-driver-latest-dkms-515.48.07-1.el7.x86_64
    ---> Package kmod-nvidia-latest-dkms.x86_64 3:515.86.01-1.el7 will be an update
    ---> Package nvidia-driver-branch-515-NVML.x86_64 3:515.86.01-1.el7 will be installed
    ---> Package nvidia-driver-branch-515-NvFBCOpenGL.x86_64 3:515.86.01-1.el7 will be installed
    ---> Package nvidia-driver-branch-515-cuda.x86_64 3:515.86.01-1.el7 will be installed
    --> Processing Dependency: nvidia-persistenced-branch-515 = 3:515.86.01 for package: 3:nvidia-driver-branch-515-cuda-515.86.01-1.el7.x86_64
    ---> Package nvidia-driver-branch-515-cuda-libs.x86_64 3:515.86.01-1.el7 will be installed
    ---> Package nvidia-driver-branch-515-devel.x86_64 3:515.86.01-1.el7 will be installed
    ---> Package nvidia-driver-branch-515-libs.x86_64 3:515.86.01-1.el7 will be installed
    ---> Package nvidia-modprobe-branch-515.x86_64 3:515.86.01-1.el7 will be installed
    ---> Package nvidia-xconfig-branch-515.x86_64 3:515.86.01-1.el7 will be installed
    --> Running transaction check
    ---> Package nvidia-driver-latest-dkms.x86_64 3:515.48.07-1.el7 will be updated
    --> Processing Dependency: nvidia-driver = 3:515.48.07 for package: 3:nvidia-libXNVCtrl-515.48.07-1.el7.x86_64
    --> Processing Dependency: nvidia-driver = 3:515.48.07 for package: 3:nvidia-settings-515.48.07-1.el7.x86_64
    --> Processing Dependency: nvidia-driver(x86-64) = 3:515.48.07 for package: 3:nvidia-libXNVCtrl-devel-515.48.07-1.el7.x86_64
    --> Processing Dependency: nvidia-driver-latest-dkms(x86-64) = 3:515.48.07 for package: 3:nvidia-modprobe-latest-dkms-515.48.07-1.el7.x86_64
    --> Processing Dependency: nvidia-driver-latest-dkms(x86-64) = 3:515.48.07 for package: 3:nvidia-driver-latest-dkms-cuda-515.48.07-1.el7.x86_64
    --> Processing Dependency: nvidia-driver-latest-dkms(x86-64) = 3:515.48.07 for package: 3:nvidia-driver-latest-dkms-NvFBCOpenGL-515.48.07-1.el7.x86_64
    --> Processing Dependency: nvidia-driver-latest-dkms(x86-64) = 3:515.48.07 for package: 3:nvidia-driver-latest-dkms-NVML-515.48.07-1.el7.x86_64
    --> Processing Dependency: nvidia-driver-latest-dkms(x86-64) = 3:515.48.07 for package: 3:nvidia-driver-latest-dkms-devel-515.48.07-1.el7.x86_64
    --> Processing Dependency: nvidia-driver-latest-dkms(x86-64) = 3:515.48.07 for package: 3:nvidia-driver-latest-dkms-libs-515.48.07-1.el7.x86_64
    --> Processing Dependency: nvidia-driver-latest-dkms(x86-64) = 3:515.48.07 for package: 3:nvidia-xconfig-latest-dkms-515.48.07-1.el7.x86_64
    --> Processing Dependency: nvidia-driver-latest-dkms(x86-64) = 3:515.48.07 for package: 3:nvidia-driver-latest-dkms-cuda-libs-515.48.07-1.el7.x86_64
    ---> Package nvidia-driver-latest-dkms.x86_64 3:525.85.12-1.el7 will be an update
    --> Processing Dependency: kmod-nvidia-latest-dkms = 3:525.85.12 for package: 3:nvidia-driver-latest-dkms-525.85.12-1.el7.x86_64
    ---> Package nvidia-persistenced-branch-515.x86_64 3:515.86.01-1.el7 will be installed
    --> Running transaction check
    ---> Package kmod-nvidia-latest-dkms.x86_64 3:515.48.07-1.el7 will be updated
    ---> Package kmod-nvidia-latest-dkms.x86_64 3:515.48.07-1.el7 will be updated
    ---> Package kmod-nvidia-latest-dkms.x86_64 3:515.86.01-1.el7 will be an update
    --> Processing Dependency: nvidia-kmod = 3:515.86.01 for package: 3:nvidia-driver-branch-515-515.86.01-1.el7.x86_64
    ---> Package kmod-nvidia-latest-dkms.x86_64 3:525.85.12-1.el7 will be an update
    ---> Package nvidia-driver-latest-dkms-NVML.x86_64 3:515.48.07-1.el7 will be updated
    ---> Package nvidia-driver-latest-dkms-NVML.x86_64 3:525.85.12-1.el7 will be an update
    ---> Package nvidia-driver-latest-dkms-NvFBCOpenGL.x86_64 3:515.48.07-1.el7 will be updated
    ---> Package nvidia-driver-latest-dkms-NvFBCOpenGL.x86_64 3:525.85.12-1.el7 will be an update
    ---> Package nvidia-driver-latest-dkms-cuda.x86_64 3:515.48.07-1.el7 will be updated
    --> Processing Dependency: nvidia-driver-latest-dkms-cuda = 3:515.48.07 for package: 3:nvidia-persistenced-latest-dkms-515.48.07-1.el7.x86_64
    ---> Package nvidia-driver-latest-dkms-cuda.x86_64 3:525.85.12-1.el7 will be an update
    ---> Package nvidia-driver-latest-dkms-cuda-libs.x86_64 3:515.48.07-1.el7 will be updated
    ---> Package nvidia-driver-latest-dkms-cuda-libs.x86_64 3:525.85.12-1.el7 will be an update
    ---> Package nvidia-driver-latest-dkms-devel.x86_64 3:515.48.07-1.el7 will be updated
    ---> Package nvidia-driver-latest-dkms-devel.x86_64 3:525.85.12-1.el7 will be an update
    ---> Package nvidia-driver-latest-dkms-libs.x86_64 3:515.48.07-1.el7 will be updated
    ---> Package nvidia-driver-latest-dkms-libs.x86_64 3:525.85.12-1.el7 will be an update
    ---> Package nvidia-libXNVCtrl.x86_64 3:515.48.07-1.el7 will be updated
    ---> Package nvidia-libXNVCtrl.x86_64 3:515.48.07-1.el7 will be obsoleted
    ---> Package nvidia-libXNVCtrl.x86_64 3:525.85.12-1.el7 will be obsoleting
    ---> Package nvidia-libXNVCtrl-devel.x86_64 3:515.48.07-1.el7 will be obsoleted
    ---> Package nvidia-libXNVCtrl-devel.x86_64 3:515.48.07-1.el7 will be updated
    ---> Package nvidia-libXNVCtrl-devel.x86_64 3:525.85.12-1.el7 will be obsoleting
    ---> Package nvidia-modprobe-latest-dkms.x86_64 3:515.48.07-1.el7 will be updated
    ---> Package nvidia-modprobe-latest-dkms.x86_64 3:525.85.12-1.el7 will be an update
    ---> Package nvidia-settings.x86_64 3:515.48.07-1.el7 will be obsoleted
    ---> Package nvidia-settings.x86_64 3:515.48.07-1.el7 will be updated
    ---> Package nvidia-settings.x86_64 3:525.85.12-1.el7 will be obsoleting
    ---> Package nvidia-xconfig-latest-dkms.x86_64 3:515.48.07-1.el7 will be updated
    ---> Package nvidia-xconfig-latest-dkms.x86_64 3:525.85.12-1.el7 will be an update
    --> Running transaction check
    ---> Package kmod-nvidia-open-dkms.x86_64 3:515.86.01-1.el7 will be installed
    ---> Package nvidia-persistenced-latest-dkms.x86_64 3:515.48.07-1.el7 will be updated
    ---> Package nvidia-persistenced-latest-dkms.x86_64 3:525.85.12-1.el7 will be an update
    --> Processing Conflict: 3:kmod-nvidia-open-dkms-515.86.01-1.el7.x86_64 conflicts kmod-nvidia-latest-dkms
    --> Processing Conflict: 3:nvidia-driver-latest-dkms-525.85.12-1.el7.x86_64 conflicts nvidia-driver > 525.85.12
    --> Processing Conflict: 3:nvidia-driver-branch-515-515.86.01-1.el7.x86_64 conflicts nvidia-driver > 515.86.01
    --> Finished Dependency Resolution
    NVIDIA: No kernel module package kmod-nvidia-branch-515 for kernel-3.10.0-1160.76.1.el7.x86_64 and 3:nvidia-driver-branch-515-515.86.01-1.el7.x86_64 found. Ignoring the new kernel
     You could try using --skip-broken to work around the problem
     You could try running: rpm -Va --nofiles --nodigest
fatal: [edc-gpublx004-04.prod.walmart.com]: FAILED! => changed=false 
  changes:
    installed:
    - nvidia-driver-branch-515
  msg: |-
    Error: kmod-nvidia-open-dkms conflicts with 3:kmod-nvidia-latest-dkms-525.85.12-1.el7.x86_64
    Error: nvidia-driver-branch-515 conflicts with 3:nvidia-driver-latest-dkms-525.85.12-1.el7.x86_64
    Error: nvidia-driver-latest-dkms conflicts with 3:nvidia-driver-branch-515-515.86.01-1.el7.x86_64
  rc: 1
  results:
supertetelman commented 1 year ago

Is there any additional debug information you can provide? From the looks of the error, there was either a problem with the NVIDIA drive that was installed on the node or the OS is unable to find a suitable driver to install.

On the GPU nodes can you answer the following questions:

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. Please update the issue or it will be closed in 7 days.