bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.64k stars 508 forks source link

Nvidia container-runtime API for GPU allocation #4052

Closed ytsssun closed 2 months ago

ytsssun commented 3 months ago

Co-authored-by: Monirul Islam Revives: https://github.com/bottlerocket-os/bottlerocket/pull/3994

Description of changes: This PR will expose two new APIs that will allow customer to configure value of accept-nvidia-visible-devices-as-volume-mounts and accept-nvidia-visible-devices-envvar-when-unprivileged for nvidia container runtime.

We introduce the default behavior to inject Nvidia GPUs using volume-mounts(https://github.com/bottlerocket-os/bottlerocket/pull/3718). This PR is to allow the users to opt-in to the previous behavior that allows unprivileged pods to have access to all GPUs when NVIDIA_VISIBLE_DEVICES=all is enabled and make both behavior configurable.

Bottlerocket Settings Impact Value What it means?
settings.kubernetes.nvidia.container-runtime.visible-devices-as-volume-mounts allows to change the  accept-nvidia-visible-devices-as-volume-mounts value for k8s container-toolkit true | false default: true Adjusting the visible-devices-as-volume-mounts settings will alters the method of GPU detection and integration within container environments. Setting this parameter to true enables the NVIDIA runtime to recognize GPU devices listed in the NVIDIA_VISIBLE_DEVICES environment variable and mount them as volumes, which permits applications within the container to interact with and leverage the GPUs as if they were local resources.
settings.kubernetes.nvidia.container-runtime.visible-devices-envvar-when-unprivileged allows to set value of accept-nvidia-visible-devices-envvar-when-unprivileged settings of nvidia container runtime for k8s varient true | false default: false When this setting is set to false, it prevents unprivileged containers from accessing all GPU devices on the host by default. If NVIDIA_VISIBLE_DEVICES is set to all within the container images and visible-devices-envvar-when-unprivileged is set to true, all GPUs on the host will be accessible to the containers, regardless of the limits set via nvidia.com/gpu. This could lead to situations where more GPUs are allocated to a pod than intended, which can affect resource scheduling and isolation.

Testing done:

  1. Built an AMI for nvidia variant. Verify the settings gets picked up with default value.

    $ apiclient get settings.kubernetes.nvidia.container-runtime
    {
    "settings": {
    "kubernetes": {
      "nvidia": {
        "container-runtime": {
          "visible-devices-as-volume-mounts": true,
          "visible-devices-envvar-when-unprivileged": false
        }
      }
    }
    }
    }
  2. Opt-in the previous behavior to allow unprivileged nvidia device access.

    $ apiclient set settings.kubernetes.nvidia.container-runtime.visible-devices-as-volume-mounts=false
    $ apiclient set settings.kubernetes.nvidia.container-runtime.visible-devices-envvar-when-unprivileged=true
    $ apiclient get settings.kubernetes.nvidia.container-runtime
    {
    "settings": {
    "kubernetes": {
      "nvidia": {
        "container-runtime": {
          "visible-devices-as-volume-mounts": false,
          "visible-devices-envvar-when-unprivileged": true
        }
      }
    }
    }
    }
  3. Verify the nvidia-container-runtime config exists

    
    $ cat /etc/nvidia-container-runtime/config.toml
    accept-nvidia-visible-devices-as-volume-mounts = true
    accept-nvidia-visible-devices-envvar-when-unprivileged = false

[nvidia-container-cli] root = "/" path = "/usr/bin/nvidia-container-cli" environment = [] ldconfig = "@/sbin/ldconfig"



- [x] Migration Test
Tested migration from 1.20.1 to new version.
Tested migration back to 1.20.1. 

**Terms of contribution:**

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.
ytsssun commented 2 months ago

Closing this PR since it is conflicting with the core-kit migration. We will submit new PRs soon to accommodate the new core-kit setup.