Closed dchaley closed 3 weeks ago
We'll need to figure out installing GPU drivers. This is automated on Vertex AI & Batch via a flag --installGpuDrivers
.
We grabbed this script from one of our Batch jobs:
# This script helps to install GPU drivers.
# It has been tested successfully with the following operating systems:
# - Debian GNU/Linux 10 (buster, amd64 built on 20230809, supports Shielded VM features)
# - Debian GNU/Linux 11 (bullseye, amd64 built on 20240415, supports Shielded VM features)
# - CentOS 7 (x86_64 built on 20230809, supports Shielded VM features)
# - Rocky 8 (x86_64 built on 20240111, supports Shielded VM features)
# It may or may not work correctly with other operating systems and versions.
install_gpu_driver() {
if [ -f /etc/rocky-release ]; then
gsutil cp gs://nvidia-drivers-us-public/tesla/550.54.15/NVIDIA-Linux-x86_64-550.54.15.run NVIDIA-Linux-x86_64-550.54.15.run
sudo sh NVIDIA-Linux-x86_64-550.54.15.run -s --no-drm
else
gsutil cp gs://nvidia-drivers-us-public/tesla/550.54.15/NVIDIA-Linux-x86_64-550.54.15.run NVIDIA-Linux-x86_64-550.54.15.run
sudo sh NVIDIA-Linux-x86_64-550.54.15.run -s --dkms
fi
nvidia-smi
}
gpu_driver_installed() {
echo \"[BATCH GPU Driver Installation]: Checking for existing GPU driver installation.\"
if nvidia-smi; then
echo \"[BATCH GPU Driver Installation]: Found GPU driver.\"
return 0
fi
echo \"[BATCH GPU Driver Installation]: No GPU driver found, will install.\"
return 1
}
gpu_driver_installed_for_cos() {
echo \"[BATCH GPU Driver Installation]: Checking for existing GPU driver installation.\"
if /var/lib/nvidia/bin/nvidia-smi; then
echo \"[BATCH GPU Driver Installation]: Found GPU driver.\"
return 0
fi
echo \"[BATCH GPU Driver Installation]: No GPU driver found, will install.\"
return 1
}
install_gpu_driver_packages() {
OSID=\"$(. /etc/os-release && echo \"$ID\")\"
if [ ! \"$OSID\" = \"debian\" ] && [ ! -f /etc/centos-release ] && [ ! -f /etc/rocky-release ] && ! grep -qi cos /etc/os-release; then
echo \"[Batch GPU Drivers Installation] Warning: this script has not been fully tested on the current operation system, it may fail.\"
fi
if [ -f /etc/centos-release ] || [ -f /etc/rocky-release ]; then
# On CentOS or Rocky.
if gpu_driver_installed; then
exit 0
fi
if [ ! -d /batch ]; then
mkdir -p /opt/google/gpu-installer
echo 1 >> /opt/google/gpu-installer/deps_installed.flag
yum clean all
yum update -y --skip-broken
# CentOS does not have pre-installed python3
# Rocky Linux installs drivers without dkms
if [ -f /etc/centos-release ]; then
yum install -y dkms python3
yum install -y \"kernel-devel-uname-r == $(uname -r)\" \"kernel-headers-uname-r == $(uname -r)\"
fi
if [ -f /etc/rocky-release ]; then
kernel_version=$(uname -r) && dnf install -y \"kernel-devel-$kernel_version\" \"kernel-headers-$kernel_version\"
fi
yum install -y epel-release pciutils gcc make acpid libglvnd-glx libglvnd-opengl libglvnd-devel pkgconfig
CLOUDSDK_PYTHON=/usr/bin/python3 gsutil cp gs://batch-agent-prod-us/gpu-drivers-installation-tool/prepare_gpu_drivers_installation.py prepare_gpu_drivers_installation.py
sudo python3 prepare_gpu_drivers_installation.py
rm -f prepare_gpu_drivers_installation.py
fi
install_gpu_driver
elif grep -qi cos /etc/os-release; then
# On COS.
if gpu_driver_installed_for_cos; then
exit 0
fi
cos-extensions install gpu -- -version=latest
# Make the driver installation path executable by re-mounting it.
mount --bind /var/lib/nvidia /var/lib/nvidia
mount -o remount,exec /var/lib/nvidia
/var/lib/nvidia/bin/nvidia-smi
else
# This is the default behavior for Debian and all the other OSes.
if gpu_driver_installed; then
exit 0
fi
if [ ! -d /batch ]; then
CLOUDSDK_PYTHON=/usr/bin/python3 gsutil cp gs://batch-agent-prod-us/gpu-drivers-installation-tool/prepare_gpu_drivers_installation.py prepare_gpu_drivers_installation.py
sudo python3 prepare_gpu_drivers_installation.py
rm -f prepare_gpu_drivers_installation.py
fi
install_gpu_driver
fi
}
# Retry a given function n times with exponential back off.
# function signature: retry retryTimes description functionName
retry() {
local retries=$1
local count=0
local description=$2
local wait=1
until \"$3\"; do
exit=$?
# If failed, wait 3 ** count seconds until next retry.
wait=$(($wait*3))
count=$(($count + 1))
if [ $count -lt $retries ]; then
echo \"[Batch Action] $description exited $exit (retried $count/$retries), retrying in $wait seconds.\"
sleep $wait
else
echo \"[Batch Action] $description exited $exit (retried $count/$retries), no more retries left.\"
return $exit
fi
done
echo \"[Batch Action] $description succeeded.\"
return 0
}
retry 4 \"GPU Driver installation\" install_gpu_driver_packages
This appears to fetch the appropriate driver based on a hardcoded version. There's actually a script outside of this we can run, according to this page:
https://github.com/GoogleCloudPlatform/compute-gpu-installation/releases/tag/cuda-installer-v1.1.0
Current status: we were able to create a VM with the container, however it's taking a long time to load the container. The logs are spammed with download progress, can we silence that and speed it up?
If it works, it should run CPU only prediction.
Next step: run again, with the startup script we're testing per above.
One observation we made: it took a very long time to download & extract the container, and we got interactive-style logs the whole way. Meaning, we got logs saying things like [========> ]
for the progress bars. It would probably speed things up if we can turn this off somehow.
The GCE node ended up getting preempted 😅 after 40 minutes of still downloading the container.
In environments where only GCE is available (no Batch, no Vertex AI) we need to use the container the old fashioned way: getting a VM and loading it up.
This issue is to develop & document the process to do this.