Run container by hand on GCE instance

dchaley commented 3 weeks ago

In environments where only GCE is available (no Batch, no Vertex AI) we need to use the container the old fashioned way: getting a VM and loading it up.

This issue is to develop & document the process to do this.

dchaley commented 3 weeks ago

We'll need to figure out installing GPU drivers. This is automated on Vertex AI & Batch via a flag --installGpuDrivers.

We grabbed this script from one of our Batch jobs:

# This script helps to install GPU drivers.
# It has been tested successfully with the following operating systems:
# - Debian GNU/Linux 10 (buster, amd64 built on 20230809, supports Shielded VM features)
# - Debian GNU/Linux 11 (bullseye, amd64 built on 20240415, supports Shielded VM features)
# - CentOS 7 (x86_64 built on 20230809, supports Shielded VM features)
# - Rocky 8 (x86_64 built on 20240111, supports Shielded VM features)
# It may or may not work correctly with other operating systems and versions.

install_gpu_driver() {
  if [ -f /etc/rocky-release ]; then
    gsutil cp gs://nvidia-drivers-us-public/tesla/550.54.15/NVIDIA-Linux-x86_64-550.54.15.run NVIDIA-Linux-x86_64-550.54.15.run
    sudo sh NVIDIA-Linux-x86_64-550.54.15.run -s --no-drm
  else
    gsutil cp gs://nvidia-drivers-us-public/tesla/550.54.15/NVIDIA-Linux-x86_64-550.54.15.run NVIDIA-Linux-x86_64-550.54.15.run
    sudo sh NVIDIA-Linux-x86_64-550.54.15.run -s --dkms
  fi
  nvidia-smi
}

gpu_driver_installed() {
  echo \"[BATCH GPU Driver Installation]: Checking for existing GPU driver installation.\"
  if nvidia-smi; then
    echo \"[BATCH GPU Driver Installation]: Found GPU driver.\"
    return 0
  fi
  echo \"[BATCH GPU Driver Installation]: No GPU driver found, will install.\"
  return 1
}

gpu_driver_installed_for_cos() {
  echo \"[BATCH GPU Driver Installation]: Checking for existing GPU driver installation.\"
  if /var/lib/nvidia/bin/nvidia-smi; then
    echo \"[BATCH GPU Driver Installation]: Found GPU driver.\"
    return 0
  fi
  echo \"[BATCH GPU Driver Installation]: No GPU driver found, will install.\"
  return 1
}

install_gpu_driver_packages() {
  OSID=\"$(. /etc/os-release && echo \"$ID\")\"
  if [ ! \"$OSID\" = \"debian\" ] && [ ! -f /etc/centos-release ] && [ ! -f /etc/rocky-release ] && ! grep -qi cos /etc/os-release; then
    echo \"[Batch GPU Drivers Installation] Warning: this script has not been fully tested on the current operation system, it may fail.\"
  fi
  if [ -f /etc/centos-release ] || [ -f /etc/rocky-release ]; then
    # On CentOS or Rocky.
    if gpu_driver_installed; then
      exit 0
    fi
    if [ ! -d /batch ]; then
      mkdir -p /opt/google/gpu-installer
      echo 1 >> /opt/google/gpu-installer/deps_installed.flag
      yum clean all
      yum update -y --skip-broken
      # CentOS does not have pre-installed python3
      # Rocky Linux installs drivers without dkms
      if [ -f /etc/centos-release ]; then
        yum install -y dkms python3
        yum install -y \"kernel-devel-uname-r == $(uname -r)\" \"kernel-headers-uname-r == $(uname -r)\"
      fi
      if [ -f /etc/rocky-release ]; then
        kernel_version=$(uname -r) && dnf install -y \"kernel-devel-$kernel_version\" \"kernel-headers-$kernel_version\"
      fi
      yum install -y epel-release pciutils gcc make acpid libglvnd-glx libglvnd-opengl libglvnd-devel pkgconfig
      CLOUDSDK_PYTHON=/usr/bin/python3 gsutil cp gs://batch-agent-prod-us/gpu-drivers-installation-tool/prepare_gpu_drivers_installation.py prepare_gpu_drivers_installation.py
      sudo python3 prepare_gpu_drivers_installation.py
      rm -f prepare_gpu_drivers_installation.py
    fi
    install_gpu_driver
  elif grep -qi cos /etc/os-release; then
    # On COS.
    if gpu_driver_installed_for_cos; then
      exit 0
    fi
    cos-extensions install gpu -- -version=latest
    # Make the driver installation path executable by re-mounting it.
    mount --bind /var/lib/nvidia /var/lib/nvidia
    mount -o remount,exec /var/lib/nvidia
    /var/lib/nvidia/bin/nvidia-smi
  else
    # This is the default behavior for Debian and all the other OSes.
    if gpu_driver_installed; then
      exit 0
    fi
    if [ ! -d /batch ]; then
      CLOUDSDK_PYTHON=/usr/bin/python3 gsutil cp gs://batch-agent-prod-us/gpu-drivers-installation-tool/prepare_gpu_drivers_installation.py prepare_gpu_drivers_installation.py
      sudo python3 prepare_gpu_drivers_installation.py
      rm -f prepare_gpu_drivers_installation.py
    fi
    install_gpu_driver
  fi
}

# Retry a given function n times with exponential back off.
# function signature: retry retryTimes description functionName
retry() {
  local retries=$1
  local count=0
  local description=$2
  local wait=1
  until \"$3\"; do
    exit=$?
    # If failed, wait 3 ** count seconds until next retry.
    wait=$(($wait*3))
    count=$(($count + 1))
    if [ $count -lt $retries ]; then
      echo \"[Batch Action] $description exited $exit (retried $count/$retries), retrying in $wait seconds.\"
      sleep $wait
    else
      echo \"[Batch Action] $description exited $exit (retried $count/$retries), no more retries left.\"
      return $exit
    fi
  done
  echo \"[Batch Action] $description succeeded.\"
  return 0
}

retry 4 \"GPU Driver installation\" install_gpu_driver_packages

This appears to fetch the appropriate driver based on a hardcoded version. There's actually a script outside of this we can run, according to this page:

https://github.com/GoogleCloudPlatform/compute-gpu-installation/releases/tag/cuda-installer-v1.1.0

dchaley commented 3 weeks ago

Current status: we were able to create a VM with the container, however it's taking a long time to load the container. The logs are spammed with download progress, can we silence that and speed it up?

If it works, it should run CPU only prediction.

Next step: run again, with the startup script we're testing per above.

dchaley commented 3 weeks ago

One observation we made: it took a very long time to download & extract the container, and we got interactive-style logs the whole way. Meaning, we got logs saying things like [========> ] for the progress bars. It would probably speed things up if we can turn this off somehow.

dchaley commented 3 weeks ago

The GCE node ended up getting preempted 😅 after 40 minutes of still downloading the container.

dchaley / deepcell-imaging

Run container by hand on GCE instance #242