NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.88k stars 305 forks source link

GPU Operator unable to identify FIPS compliant Ubuntu Kernel for TKG on AWS #152

Open voor opened 3 years ago

voor commented 3 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

1. Issue or feature description

$ uname -a
Linux ip-10-150-3-29.us-gov-east-1.compute.internal 4.15.0-2036-aws-fips #38-Ubuntu SMP Wed Jan 20 02:20:40 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ k logs nvidia-driver-daemonset-8t4mg -n gpu-operator-resources

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 4.15.0-2036-aws-fips

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
k get po -n gpu-operator-resources          
NAME                                       READY   STATUS             RESTARTS   AGE
gpu-operator-7cdbb67dcb-wjq2q              1/1     Running            0          94s
nvidia-container-toolkit-daemonset-25cw2   0/1     Init:0/1           0          85s
nvidia-container-toolkit-daemonset-69tjh   0/1     Init:0/1           0          84s
nvidia-driver-daemonset-8t4mg              0/1     Error              3          90s
nvidia-driver-daemonset-qgngv              0/1     CrashLoopBackOff   3          90s
$ apt-cache search linux-headers
linux-headers-5.4.0-1035-aws - Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
linux-headers-5.4.0-1037-aws - Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
linux-libc-dev - Linux Kernel Headers for development
linux-headers-4.15.0-2036-aws-fips - Linux kernel headers for version 4.15.0 on 64 bit x86 SMP
linux-headers-aws-fips - FIPS 140-2 Linux kernel headers for AWS
linux-headers-aws - Linux kernel headers for Amazon Web Services (AWS) systems.

Labels on node:

apiVersion: v1
kind: Node
metadata:
  annotations:
    kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock
    nfd.node.kubernetes.io/extended-resources: ""
    nfd.node.kubernetes.io/feature-labels: cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ,cpu-cpuid.AVX512F,cpu-cpuid.AVX512VL,cpu-cpuid.AVX512VNNI,cpu-cpuid.FMA3,cpu-cpuid.HYPERVISOR,cpu-cpuid.MPX,cpu-hardware_multithreading,kernel-config.NO_HZ,kernel-config.NO_HZ_IDLE,kernel-version.full,kernel-version.major,kernel-version.minor,kernel-version.revision,pci-0300_1d0f.present,pci-0302_10de.present,storage-nonrotationaldisk,system-os_release.ID,system-os_release.VERSION_ID,system-os_release.VERSION_ID.major,system-os_release.VERSION_ID.minor
    nfd.node.kubernetes.io/worker.version: v0.7.0
    node.alpha.kubernetes.io/ttl: "0"
    projectcalico.org/IPv4Address: 10.150.3.24/24
    projectcalico.org/IPv4IPIPTunnelAddr: 100.102.42.192
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2021-02-19T01:08:51Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: g4dn.xlarge
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: us-gov-east-1
    failure-domain.beta.kubernetes.io/zone: us-gov-east-1b
    feature.node.kubernetes.io/cpu-cpuid.ADX: "true"
    feature.node.kubernetes.io/cpu-cpuid.AESNI: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX2: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX512BW: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX512CD: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX512DQ: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX512F: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX512VL: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI: "true"
    feature.node.kubernetes.io/cpu-cpuid.FMA3: "true"
    feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR: "true"
    feature.node.kubernetes.io/cpu-cpuid.MPX: "true"
    feature.node.kubernetes.io/cpu-hardware_multithreading: "true"
    feature.node.kubernetes.io/kernel-config.NO_HZ: "true"
    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE: "true"
    feature.node.kubernetes.io/kernel-version.full: 4.15.0-2036-aws-fips
    feature.node.kubernetes.io/kernel-version.major: "4"
    feature.node.kubernetes.io/kernel-version.minor: "15"
    feature.node.kubernetes.io/kernel-version.revision: "0"
    feature.node.kubernetes.io/pci-0300_1d0f.present: "true"
    feature.node.kubernetes.io/pci-0302_10de.present: "true"
    feature.node.kubernetes.io/storage-nonrotationaldisk: "true"
    feature.node.kubernetes.io/system-os_release.ID: ubuntu
    feature.node.kubernetes.io/system-os_release.VERSION_ID: "18.04"
    feature.node.kubernetes.io/system-os_release.VERSION_ID.major: "18"
    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor: "04"
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: ip-10-150-3-24.us-gov-east-1.compute.internal
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: g4dn.xlarge
    nvidia.com/gpu.present: "true"
    topology.kubernetes.io/region: us-gov-east-1
    topology.kubernetes.io/zone: us-gov-east-1b

2. Steps to reproduce the issue

Run GPU Operator on a FIPS Compliant Ubuntu 18.04 AMI (This can be achieved with Ubuntu Pro Advantage on AWS if you do not have access to Tanzu Kubernetes Grid)

shivamerla commented 3 years ago

@voor Looks like with the apt repository lists we used to build, kernel-header for aws kernel is not found.

RUN echo "deb [arch=amd64] http://archive.ubuntu.com/ubuntu/ bionic main universe" > /etc/apt/sources.list && \
    echo "deb [arch=amd64] http://archive.ubuntu.com/ubuntu/ bionic-updates main universe" >> /etc/apt/sources.list && \
    echo "deb [arch=amd64] http://archive.ubuntu.com/ubuntu/ bionic-security main universe" >> /etc/apt/sources.list && \
    usermod -o -u 0 -g 0 _apt

Can you attach the /etc/apt/sources.list file from your host?

voor commented 3 years ago

You will want to launch Ubuntu Pro 18.04 LTS from the AWS Marketplace, that will give you access to the Ubuntu Advantage Repository that contains the FIPS Compliant AWS Kernels.

drawsmcgraw commented 3 years ago

@shivamerla Thanks for the attention to this issue. Just adding my +1 here. Would love to see this "just work".

shivamerla commented 3 years ago

@drawsmcgraw @voor Sorry for the delay on this. Can you try to create a ConfigMap(say repo-config) in gpu-operator-resources namespace using hosts /etc/apt/sources.list file and install operator passing --set driver.repoConfig.ConfigMapName=repo-config --set driver.repoConfig.destinationDir=/etc/apt/sources.list.d