NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.68k stars 606 forks source link

Use mps on kubernetes #467

Open somelaoda opened 2 years ago

somelaoda commented 2 years ago

I'm trying to use mps service on kubernetes with nvidia-docker

Docker version 19.03.13,
nvidia-driver  495.44
cuda 11.5
image :ngc tensorflow:21.11

Now I set nvidia-cuda-mps-control on the host machine,also hostIPC and hostPID has been set when nvidia-docker startup.

Now the process in container can found nvidia-cuda-mps-control process, but the process memory limit is not in effect,No matter what I use

$export CUDA_MPS_PINNED_DEVICE_MEM_LIMIT=”0=1G,1=512MB” or set_default_device_pinned_mem_limit

how can I make MPS work correctly across multiple containers?

klueska commented 2 years ago

We do not officially support MPS in nvidia-docker or kubernetes. Some users have been able to get it to work in the past, but there is no supported way to do it at the moment.

That said, we do plan to add official support for MPS in the next few months, as part of an overall improved "GPU sharing initiative" that will unify the experience for GPU sharing through CUDA multiplexing, MPS, and / or MIG.

ghokun commented 2 years ago

You can use this project for now: https://github.com/awslabs/aws-virtual-gpu-device-plugin

I added support for per client memory restrictions in my fork's README. Only works for CUDA >= 11.5 https://github.com/kuartis/kuartis-virtual-gpu-device-plugin

flixr commented 2 years ago

@klueska that would be great! Is there any ticket or other resource where we can follow roadmap/progress on this "GPU sharing initiative"?

somelaoda commented 2 years ago

@klueska is there any good news about this project ? I'm looking forward to you ~ 😄

troycheng commented 1 year ago

Is there any furthur progress with official support for MPS ?

romainrossi commented 1 year ago

Same here, a very needed feature. Any progress ?

ettelr commented 8 months ago

Any update on this thread? is anyone using MPS in kubernetes

prattcmp commented 8 months ago

Strange that this has been ignored for so long...

elezar commented 8 months ago

This is something that is under active development. We don't have a concrete release date yet, but are targetting the first quarter of 2024.

prattcmp commented 8 months ago

This is something that is under active development. We don't have a concrete release date yet, but are targetting the first quarter of 2024.

2024Q1 would be great, even in a beta version.

klueska commented 6 months ago

We just released an RC for the next version of the k8s-device-plugin with support for MPS: https://github.com/NVIDIA/k8s-device-plugin/tree/v0.15.0-rc.1?tab=readme-ov-file#with-cuda-mps

We would appreciate people to try this out and give any feedback you have before the final release in a few weeks.

ettelr commented 6 months ago

@klueska I see in the link you sent the following note: Note: Sharing with MPS is currently not supported on devices with MIG enabled. Is it planned to be supported on GPUs that are not mig enabled (like L40, L40S) ? if yes would it be closely?

hrbasic commented 6 months ago

Hey, I'm trying to deploy v0.15.0-rc.1 version, but I'm getting error:

I0228 15:18:51.597975      31 main.go:279] Retrieving plugins.
I0228 15:18:51.598008      31 factory.go:104] Detected NVML platform: found NVML library
I0228 15:18:51.598047      31 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0228 15:18:51.619657      31 main.go:301] Failed to start plugin: error waiting for MPS daemon: error checking MPS daemon health: failed to send command to MPS daemon: exit status 1
I0228 15:18:51.619670      31 main.go:208] Failed to start one or more plugins. Retrying in 30s...

Running with config:

{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "mpsRoot": "/run/nvidia/mps",
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {},
    "mps": {
      "renameByDefault": true,
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "rename": "nvidia.com/gpu.shared",
          "devices": "all",
          "replicas": 10
        }
      ]
    }
  }
}

All pods are up and running:

gpu-feature-discovery-x59tw                                   2/2     Running     0          30m
gpu-operator-5bd8fb6df5-r2jrq                                 1/1     Running     0          103m
gpu-operator-node-feature-discovery-gc-78b479ccc6-kf8nk       1/1     Running     0          103m
gpu-operator-node-feature-discovery-master-569bfcd8bc-z6whl   1/1     Running     0          103m
gpu-operator-node-feature-discovery-worker-4bnwr              1/1     Running     0          103m
gpu-operator-node-feature-discovery-worker-d5cmh              1/1     Running     0          103m
gpu-operator-node-feature-discovery-worker-fc4vs              1/1     Running     0          103m
gpu-operator-node-feature-discovery-worker-ktsl9              1/1     Running     0          103m
gpu-operator-node-feature-discovery-worker-lm2gv              1/1     Running     0          103m
gpu-operator-node-feature-discovery-worker-mhmjv              1/1     Running     0          103m
gpu-operator-node-feature-discovery-worker-w4mgz              1/1     Running     0          103m
nvidia-container-toolkit-daemonset-qj5p6                      1/1     Running     0          100m
nvidia-cuda-validator-slg25                                   0/1     Completed   0          95m
nvidia-dcgm-exporter-kr84r                                    1/1     Running     0          100m
nvidia-device-plugin-2svlb                                    2/2     Running     0          30m
nvidia-driver-daemonset-sznx9                                 1/1     Running     0          103m
nvidia-mig-manager-78hrq                                      1/1     Running     0          100m
nvidia-operator-validator-nhz56                               1/1     Running     0          100m

GPU:

  *-display
       description: 3D controller
       product: GA100 [A100 PCIe 40GB]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:13:00.0
       logical name: /dev/fb0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm bus_master cap_list fb
       configuration: depth=32 driver=nvidia latency=248 mode=1280x800 visual=truecolor xres=1280 yres=800
       resources: iomemory:1fe00-1fdff iomemory:1ff00-1feff irq:16 memory:fb000000-fbffffff memory:1fe000000000-1fefffffffff memory:1ff000000000-1ff001ffffff

I'm using 535.154.05 driver deployed with gpu-operator on Rocky 8.9. Any idea what could be the root cause?

klueska commented 6 months ago

It seems you are running with the GPU operator. Support for MPS with the operator will be available in the next operator release.

If you want to test things out before then, you can disable deployment of the device plugin and GFD as part of the operator deployment, and instead install the v0.15.0-rc.1 device-plugin helm chart separately.

hrbasic commented 6 months ago

Thanks for answer, I've already deployed this separately: gpu-operator

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
metadata:
  name: gpu-operator

resources:
  - namespace.yaml

namespace: gpu-operator

helmCharts:
  - name: gpu-operator
    repo: https://nvidia.github.io/gpu-operator
    releaseName: gpu-operator
    namespace: gpu-operator
    valuesFile: values.yaml
    version: 23.9.1

values for operator:

driver:
  repository: my-repo.com/nvidia
  version: 535.154.05
  imagePullPolicy: Always
  imagePullSecrets: 
    - image-pull-secret
  useOpenKernelModules: true

gfd:
  enabled: false

mig:
  strategy: "none"

operator:
  imagePullSecrets: 
    - image-pull-secret

devicePlugin:
  enabled: false

device-plugin

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
metadata:
  name: nvidia-device-plugin

namespace: gpu-operator

helmCharts:
  - name: nvidia-device-plugin
    repo: https://nvidia.github.io/k8s-device-plugin
    releaseName: nvidia-device-plugin
    namespace: gpu-operator
    valuesFile: values.yaml
    version: 0.15.0-rc.1

values for plugin:

config:
  default: default
  map:
    default: |-
      version: v1
      flags:
        migStrategy: none
      sharing:
        mps:
          renameByDefault: true
          resources:
          - name: nvidia.com/gpu
            replicas: 10

But I'll try this once more on new node, since it's deployed on node which already had "old" drivers installed by operator.

klueska commented 6 months ago

OK, that should work then (it's the same way I have tested things locally). Note that there may be a few transient failures in the plugin while it waits for the MPS daemonset to start up (because it won't come online until GFD has applied a label indicating that it should be there).

hrbasic commented 6 months ago

Just wanted to inform you that I successfully configured and deployed the device-plugin with MPS. I disabled NFD in the gpu-operator helm chart and enabled it in the device-plugin installation. Additionally, I had to restart\delete the nvidia-device-plugin-gpu-feature-discovery pod. I assume that restart was necessary since I've installed both helm charts simultaneously, or maybe it would eventually applied labels like you mentioned. Thanks for help, I'll keep you informed if any issues arise during the testing phase.

igorgad commented 6 months ago

Hello. I can confirm that installing the device-plugin version 0.15.0-rc.1 alongside gpu-operator works with the following procedure.

  1. Install gpu-operator with nvdp and nfd
  2. upgrade to disable nvdp and nfd
  3. Install nvdp with gfd enabled

One slight problem I'm facing is that it does segfault if a pod has /dev/shm mounted and try to allocate GPU memory, like in the following example. Is there a workaround to avoid using /dev/shm for the mps daemon communication?

apiVersion: v1
kind: Pod
metadata:
  name: testshm
spec:
  volumes:
    - emptyDir:
        medium: Memory
      name: shared-mem
  containers:
  - name: testshm
    image: nvidia/cuda:12.3.1-base-ubuntu20.04
    command: ["tail", "-f", "/dev/null"]
    volumeMounts:
      - mountPath: /dev/shm
        name: shared-mem
    resources:
      limits:
        nvidia.com/gpu: 1

Thanks!

cdesiniotis commented 6 months ago

@igorgad you do not need to manually mount /dev/shm in your pod spec. The device-plugin, as part of its AllocateResponse, will make sure all the entities required for MPS get included in the container. Can you verify your example pod works when you remove the shared-mem volumeMount?

elezar commented 6 months ago

@igorgad you do not need to manually mount /dev/shm in your pod spec. The device-plugin, as part of its AllocateResponse, will make sure all the entities required for MPS get included in the container. Can you verify your example pod works when you remove the shared-mem volumeMount?

To clarify: using MPS does require a /dev/shm to be set up and this is used by the MPS Control Daemon to allow for communication. The infrastructure added to the device plugin to support MPS automatically creates a tmpfs and mounts it at /dev/shm for any containers that require MPS. This means that the additional /dev/shm that you are requesting is overriding this /dev/shm that contians the information controlled by the MPS control daemon -- causing the segfaults.

igorgad commented 6 months ago

Hey @cdesiniotis and @elezar, thanks for clarifying it.

I can confirm that it works properly without the shared-mem volume mounted on the pod. However, it's common to mount a memory-backed volume on /dev/shm to increase the amount of shared memory available to python multiprocessing and pytorch dataloaders. The tmpfs mounted to /dev/shm by the device plugin has 64MB, which is too small for many workloads.

elezar commented 6 months ago

We have an issue to track making the shm size configurable. Would this be able to address your use case? What are typical values for the shared memory size?

igorgad commented 6 months ago

Yep. Sounds good. I generally set the shared memory size of the pod to the amount of memory requested by the pod. But I guess that's not feasible in the device plugin context. Therefore, I would say that 8GB should be enough for most workloads.

ettelr commented 4 months ago

Any update on this one? is shm configurable?

klueska commented 4 months ago

Future versions of MPS will not depend on /dev/shm at all, making the need to inject /dev/shm unnecessary. Until then (meaning on any existing driver) this issue will continue to persist.

ettelr commented 4 months ago

Future versions of MPS will not depend on /dev/shm at all, making the need to inject /dev/shm unnecessary. Until then (meaning on any existing driver) this issue will continue to persist.

this means that currently torch workloads are not runnable an a system with nvidia mps deployed via device plugin since almost every torch workload will need more than the amount that's injected currently by chart

elezar commented 4 months ago

@ettelr we have an action item to allow the size of the /dev/shm that is created to be specified as part of the deployment. Would this work for your use cases?

ettelr commented 4 months ago

@ettelr we have an action item to allow the size of the /dev/shm that is created to be specified as part of the deployment. Would this work for your use cases?

yes it's supposed to work, we will just use it instead of injecting ourselves shm volume

ZYWNB666 commented 2 months ago

If I want to use Time-Slicing in k8s, do I need to enable mps on node hosts?

ettelr commented 1 month ago

If I want to use Time-Slicing in k8s, do I need to enable mps on node hosts?

@ettelr we have an action item to allow the size of the /dev/shm that is created to be specified as part of the deployment. Would this work for your use cases?

Hi @elezar , @klueska

is there any update on shm item? is it already configurable from chart? is it still required for mps daemon ?

ettelr commented 1 month ago

@ettelr we have an action item to allow the size of the /dev/shm that is created to be specified as part of the deployment. Would this work for your use cases?

Hi any update on mps shm configurable ? we cannot use it like this each of workload is using different shm

ettelr commented 1 month ago

@ettelr we have an action item to allow the size of the /dev/shm that is created to be specified as part of the deployment. Would this work for your use cases?

Hi any update on mps shm configurable ? we cannot use it like this each of workload is using different shm

@klueska @elezar

zhiyxu commented 1 month ago

@igorgad you do not need to manually mount /dev/shm in your pod spec. The device-plugin, as part of its AllocateResponse, will make sure all the entities required for MPS get included in the container. Can you verify your example pod works when you remove the shared-mem volumeMount?

To clarify: using MPS does require a /dev/shm to be set up and this is used by the MPS Control Daemon to allow for communication. The infrastructure added to the device plugin to support MPS automatically creates a tmpfs and mounts it at /dev/shm for any containers that require MPS. This means that the additional /dev/shm that you are requesting is overriding this /dev/shm that contians the information controlled by the MPS control daemon -- causing the segfaults.

@elezar @klueska Hi, I found a file named cuda.shm.0.xx.1 in /dev/shm (where xx varies for each file), and this file is required when using MPS. The file is very small, only 4k in size. Detailed information is provided below. Do you know the specific content and purpose of this file?

$ stat /dev/shm/cuda.shm.0.xx.1
  File: /dev/shm/cuda.shm.0.xx.1
  Size: 4096        Blocks: 8          IO Block: 4096   regular file
Device: 15h/21d Inode: 1008193229  Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2024-08-09 14:57:42.483693717 +0800
Modify: 2024-08-09 14:29:23.214025824 +0800
Change: 2024-08-09 14:29:23.214025824 +0800
 Birth: -

Additionally, this path seems to be hardcoded, as the file must be located in /dev/shm. Is there any way to configure the location of this file?