GoogleCloudPlatform / container-engine-accelerators

Collection of tools and examples for managing Accelerated workloads in Kubernetes Engine
Apache License 2.0
211 stars 150 forks source link

Update nvidia-driver-installer pull policy for init container #354

Open konturn opened 6 months ago

konturn commented 6 months ago

I've run into an issue where node maintenance on GPU nodes prevents the driver installer daemonset from starting up again. Specifically, our issue looks like this:

  1. GCP schedules maintenance for our H100 node (we cannot prevent this)--we’re using the termination maintenance policy here, so the node gets stopped.
  2. Node gets restarted, and GCP tries attaching the local SSD’s from before but cannot. These local SSD’s are used for containerd image storage via a symlink and also Nvidia driver storage. So these means that all the images will be wiped from the node.
  3. The daemonset which exposes GPU’s on the node cannot start, since the image doesn’t exist and the pull policy is set to ‘Never’

The fix here entails self-managing a modified version of the daemonset that has the adjusted pull policy. The GKE documentation should link to a daemonset that's able to work properly after node maintenance events.

google-cla[bot] commented 6 months ago

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.