NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.71k stars 614 forks source link

MPS Memory limits confusion #764

Open RonanQuigley opened 3 months ago

RonanQuigley commented 3 months ago

1. Quick Debug Information

2. Issue or feature description

I've configured MPS on an NVIDIA LS40 with 10 replicas.

As per the mps daemon logs, a default memory limit of ~4GB has been set.

I0612 11:23:33.074061      53 main.go:187] Retrieving MPS daemons.
I0612 11:23:33.153182      53 daemon.go:93] "Staring MPS daemon" resource="nvidia.com/gpu"
I0612 11:23:33.218453      53 daemon.go:131] "Starting log tailer" resource="nvidia.com/gpu"
[2024-06-12 10:28:13.702 Control    69] Starting control daemon using socket /mps/nvidia.com/gpu/pipe/control
[2024-06-12 10:28:13.702 Control    69] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe
[2024-06-12 10:28:13.725 Control    69] Accepting connection...
[2024-06-12 10:28:13.725 Control    69] NEW UI
[2024-06-12 10:28:13.725 Control    69] Cmd:set_default_device_pinned_mem_limit 0 4606M

However, if I look at this from the point of view of a client:

import torch 
torch.cuda.get_device_properties(torch.device('cuda'))
# _CudaDeviceProperties(name='NVIDIA L40S', major=8, minor=9, total_memory=45589MB, multi_processor_count=14)

Only the set_default_active_thread_percentage of 10 is respected. The multi_processor_count changes from 142 to 14.

Here's some additional info from the application pod:

printenv | grep CUDA
CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe

echo "get_default_device_pinned_mem_limit 0" | nvidia-cuda-mps-control
4G 

Why is nvidia-cuda-mps-control reporting one thing for memory and pytorch saying something else? This doesn't look right to me, but maybe I'm missing something. If I use MIG with an A100, the total_memory returned reflects the MIG instance as opposed to the total VRAM of the card.

# values.yaml
nodeSelector: {
  nvidia.com/gpu: "true"
}

gfd: 
  enabled: true
  nameOverride: gpu-feature-discovery
  namespaceOverride: {{ nvidia_plugin.namespace }}
  nodeSelector: {
    nvidia.com/gpu: "true"
  }

nfd:
  master:
    nodeSelector: {
      nvidia.com/gpu: "true"
    }
    tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"
  worker:
    nodeSelector: {
      nvidia.com/gpu: "true"
    }

config: 
  default: "default"
  map:
    default: |-
    ls400: |-
      version: v1
      sharing:
        mps:
          resources:
          - name: nvidia.com/gpu
            replicas: 10    

Additional information that might help better understand your environment and reproduce the bug:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:BE:00.0 Off |                    0 |
| N/A   32C    P8             35W /  350W |      35MiB /  46068MiB |      0%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     70756      C   nvidia-cuda-mps-server                         28MiB |
+-----------------------------------------------------------------------------------------+
github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.