bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.78k stars 519 forks source link

Add Nvidia GPU Time-slicing support #3985

Closed monirul closed 4 months ago

monirul commented 5 months ago

Issue number:

Closes #

Description of changes: NVIDIA GPUs supports Time Slicing feature which allows user to share a GPU among a larger number of workload by dividing the GPU’s time into slices. Each workload gets a turn to use the GPU resources within its allocated time slice. This is similar to how a CPU might time-slice between different processes, ensuring that the GPU is used efficiently and not sitting idle. This PR contains the changes required for bottlerocket to enable Timeslicing for kubernetes.

This PR introduces two bottlerocket settings API:

Bottlerocket Settings Impact Value What it means?
settings.kubernetes.nvidia.device-plugin.max-sharing-per-gpu sets the value of the replicas settings of the device plugin for the timesliced resources integer default: 0 When the value is greater than 0. the timeslicing will be enabled.
settings.kubernetes.nvidia.device-plugin.rename-shared-gpu sets the value of the renameByDefault settings of the device plugin for the timesliced resources true | false default: false When this setting is set to false, it does not change the shared gpu's resource name. if set to true, it renames the gpus and append .shared in the gpu name. for example, if the value is set to true, the gpu name of nvidia.com/gpu will be changed to nvidia.com/gpu.shared

Testing done:

bash-5.1# apiclient set settings.kubernetes.nvidia.device-plugin.max-sharing-per-gpu=10
[root@admin]# cat .bottlerocket/rootfs/etc/nvidia-k8s-device-plugin/settings.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  plugin:
    passDeviceSpecs: true
    deviceListStrategy: "volume-mounts"
    deviceIDStrategy: "index"
sharing:
  timeSlicing:
    renameByDefault: true
    resources:
    - name: "nvidia.com/gpu"
      replicas: 10

$ kubectl describe node ip-192-168-68-216.us-west-2.compute.internal          
Name:               ip-192-168-68-216.us-west-2.compute.internal
...
Capacity:
  cpu:                    8
  ephemeral-storage:      18366Mi
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 32458088Ki
  nvidia.com/gpu.shared:  10
  pods:                   58

Note: Migration test is still in progress. I will update once the test is complete.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.