GoogleCloudPlatform / cos-gpu-installer

Scripts to build and use a container to install GPU drivers on Container-Optimized OS images
Apache License 2.0
90 stars 50 forks source link

Failing installation to GKE driver ver 440.64 #52

Open mikhno-s opened 4 years ago

mikhno-s commented 4 years ago

Hey, I tried to install new version of nvidia driver but got next:

COS_DOWNLOAD_GCS=https://storage.googleapis.com/cos-tools
+ COS_KERNEL_SRC_GIT=https://chromium.googlesource.com/chromiumos/third_party/kernel
+ COS_KERNEL_SRC_ARCHIVE=kernel-src.tar.gz
+ TOOLCHAIN_URL_FILENAME=toolchain_url
+ TOOLCHAIN_ARCHIVE=toolchain.tar.xz
+ TOOLCHAIN_ENV_FILENAME=toolchain_env
+ TOOLCHAIN_PKG_DIR=/build/cos-tools
+ CHROMIUMOS_SDK_GCS=https://storage.googleapis.com/chromiumos-sdk
+ ROOT_OS_RELEASE=/root/etc/os-release
+ KERNEL_SRC_DIR=/build/usr/src/linux
+ NVIDIA_DRIVER_VERSION=440.64
+ NVIDIA_DRIVER_MD5SUM=
+ NVIDIA_INSTALL_DIR_HOST=/home/kubernetes/bin/nvidia
+ NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
+ ROOT_MOUNT_DIR=/root
+ CACHE_FILE=/usr/local/nvidia/.cache
+ LOCK_FILE=/root/tmp/cos_gpu_installer_lock
+ LOCK_FILE_FD=20
+ set +x
[INFO    2020-05-27 21:20:11 UTC] PRELOAD: false
[INFO    2020-05-27 21:20:11 UTC] Checking if this is the only cos-gpu-installer that is running.
[INFO    2020-05-27 21:20:11 UTC] Running on COS build id 12371.175.0
[INFO    2020-05-27 21:20:11 UTC] Checking if third party kernel modules can be installed
[INFO    2020-05-27 21:20:11 UTC] Checking cached version
[INFO    2020-05-27 21:20:11 UTC] Did not find cached version, building the drivers...
[INFO    2020-05-27 21:20:11 UTC] Downloading GPU installer ...
/usr/local/nvidia /
[INFO    2020-05-27 21:20:11 UTC] Downloading from https://storage.googleapis.com/nvidia-drivers-us-public/tesla/440.64/NVIDIA-Linux-x86_64-440.64.run
/
ls: cannot access '/build/usr/src/linux': No such file or directory
[INFO    2020-05-27 21:20:11 UTC] Kernel sources not found locally, downloading
[INFO    2020-05-27 21:20:11 UTC] Kernel source archive download URL: https://storage.googleapis.com/cos-tools/12371.175.0/kernel-src.tar.gz
/build/usr/src/linux /
/
/build/usr/src/linux /

real    0m0.694s
user    0m0.178s
sys 0m0.334s
/
[INFO    2020-05-27 21:20:18 UTC] Setting up compilation environment
[INFO    2020-05-27 21:20:18 UTC] Obtaining toolchain_env file from https://storage.googleapis.com/cos-tools/12371.175.0/toolchain_env

real    0m0.025s
user    0m0.016s
sys 0m0.002s
[INFO    2020-05-27 21:20:18 UTC] /build/cos-tools: bin
lib
toolchain.tar.xz
usr
[INFO    2020-05-27 21:20:18 UTC] Found existing toolchain package. Skipping download and installation
[INFO    2020-05-27 21:20:18 UTC] Configuring environment variables for cross-compilation
[INFO    2020-05-27 21:20:18 UTC] Configuring installation directories
/usr/local/nvidia /
[INFO    2020-05-27 21:20:18 UTC] Updating container's ld cache
/
[INFO    2020-05-27 21:20:18 UTC] Configuring kernel sources
/build/usr/src/linux /
/bin/sh: 1: x86_64-cros-linux-gnu-clang: Permission denied
  HOSTCC  scripts/basic/fixdep
/bin/sh: 1: x86_64-cros-linux-gnu-clang: Permission denied
  HOSTCC  scripts/kconfig/conf.o
  YACC    scripts/kconfig/zconf.tab.c
  LEX     scripts/kconfig/zconf.lex.c
  HOSTCC  scripts/kconfig/zconf.tab.o
  HOSTLD  scripts/kconfig/conf
scripts/kconfig/conf  --olddefconfig Kconfig
./scripts/gcc-version.sh: 26: ./scripts/gcc-version.sh: x86_64-cros-linux-gnu-clang: Permission denied
./scripts/gcc-version.sh: 27: ./scripts/gcc-version.sh: x86_64-cros-linux-gnu-clang: Permission denied
./scripts/gcc-version.sh: 29: ./scripts/gcc-version.sh: x86_64-cros-linux-gnu-clang: Permission denied
./scripts/gcc-version.sh: 26: ./scripts/gcc-version.sh: x86_64-cros-linux-gnu-clang: Permission denied
./scripts/gcc-version.sh: 27: ./scripts/gcc-version.sh: x86_64-cros-linux-gnu-clang: Permission denied
./scripts/gcc-version.sh: 29: ./scripts/gcc-version.sh: x86_64-cros-linux-gnu-clang: Permission denied
init/Kconfig:17: syntax error
init/Kconfig:16: invalid option
./scripts/clang-version.sh: 15: ./scripts/clang-version.sh: x86_64-cros-linux-gnu-clang: Permission denied
./scripts/gcc-plugin.sh: 11: ./scripts/gcc-plugin.sh: x86_64-cros-linux-gnu-clang: Permission denied
make[1]: *** [olddefconfig] Error 1
scripts/kconfig/Makefile:69: recipe for target 'olddefconfig' failed
make: *** [olddefconfig] Error 2
Makefile:531: recipe for target 'olddefconfig' failed

I use this manifest

https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/k8s-1.9/nvidia-driver-installer/cos/daemonset-preloaded.yaml

hsharrison commented 4 years ago

Have you had any luck with this?

CarlosRDomin commented 3 years ago

(In case anyone comes across the same problem)
Solved it by updating the image to gcr.io/cos-cloud/cos-gpu-installer@sha256:8d86a652759f80595cafed7d3dcde3dc53f57f9bc1e33b27bc3cfa7afea8d483:

kubectl patch daemonset nvidia-driver-installer --namespace kube-system --patch '{"spec":{"template":{"spec":{"initContainers":[{"name":"nvidia-driver-installer","image":"gcr.io/cos-cloud/cos-gpu-installer@sha256:8d86a652759f80595cafed7d3dcde3dc53f57f9bc1e33b27bc3cfa7afea8d483","imagePullPolicy":"IfNotPresent","env":[{"name":"NVIDIA_DRIVER_VERSION","value":"450.51.06"}]}]}}}}'

(Source: https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/cos/daemonset-nvidia-v450.yaml)