Workloads keep in hang state except cuda-sample:vectoradd under MPS mode

1. Quick Debug Information

OS/Version(Garden Linux 934.11):
Kernel Version: 5.15.135-gardenlinux-amd64
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd:/1.6.20
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s/v1.26.11

2. Issue or feature description

To confirm that the MPS feature is capable of catering to our needs, I'm scheduling a comprehensive test suite for further analysis.

I've been testing the MPS feature offered by the recently updated device-plugin V15.0 on our K8S BM node equipped with 3 V100s. However, it only seems to execute the standard test cuda-sample:vectoradd successfully, while the rest of the test cases continually remain in a hang state, thus not progressing as expected.

✅Passed case

cuda-samples/vectorAdd
❌Failed case
tf-notebook tensorflow/tensorflow:latest-gpu-jupyter

3. Information to attach (optional if deemed irrelevant)

I am using the following configmap to enable the MPS under our GPU BM

# k apply -f ntr-mps-cm.yaml
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: ntr-mps-cm
  namespace: ssdl-vgpu
data:
  any: |
    version: v1
    flags:
      failOnInitError: true
      nvidiaDriverRoot: "/run/nvidia/driver/"
      plugin:
        deviceListStrategy: envvar
        deviceIDStrategy: uuid
    sharing:
      mps:
        resources:
          - name: nvidia.com/gpu
            replicas: 4

And use the following commands to install the latest device-plugin

helm install --wait k8s-vgpu nvdp/nvidia-device-plugin \
  --namespace ssdl-vgpu \
  --version 0.15.0 \
  --set config.name=ntr-time-cm \
  --set compatWithCPUManager=true \

And the related logs can be found in

Logs of the MPS control daemon and device-plugin

MPS control daemon logs

2024-04-23T06:04:59.093748766Z I0423 06:04:59.092447      55 main.go:78] Starting NVIDIA MPS Control Daemon 435bfb70
2024-04-23T06:04:59.093839331Z commit: 435bfb70a44b74daca23fe957a0f256afaa3c51e
2024-04-23T06:04:59.093853836Z I0423 06:04:59.092736      55 main.go:55] "Starting NVIDIA MPS Control Daemon" version=<
2024-04-23T06:04:59.093867206Z  435bfb70
2024-04-23T06:04:59.093878874Z  commit: 435bfb70a44b74daca23fe957a0f256afaa3c51e
2024-04-23T06:04:59.093889630Z  >
2024-04-23T06:04:59.093900576Z I0423 06:04:59.092874      55 main.go:107] Starting OS watcher.
2024-04-23T06:04:59.094030621Z I0423 06:04:59.093463      55 main.go:121] Starting Daemons.
2024-04-23T06:04:59.094062157Z I0423 06:04:59.093551      55 main.go:164] Loading configuration.
2024-04-23T06:04:59.094686515Z I0423 06:04:59.094616      55 main.go:172] Updating config with default resource matching patterns.
2024-04-23T06:04:59.094774486Z I0423 06:04:59.094718      55 main.go:183] 
2024-04-23T06:04:59.094792480Z Running with config:
2024-04-23T06:04:59.094796229Z {
2024-04-23T06:04:59.094799702Z   "version": "v1",
2024-04-23T06:04:59.094802996Z   "flags": {
2024-04-23T06:04:59.094806479Z     "migStrategy": "none",
2024-04-23T06:04:59.094809741Z     "failOnInitError": true,
2024-04-23T06:04:59.094813226Z     "nvidiaDriverRoot": "/run/nvidia/driver/",
2024-04-23T06:04:59.094816715Z     "gdsEnabled": null,
2024-04-23T06:04:59.094820702Z     "mofedEnabled": null,
2024-04-23T06:04:59.094823952Z     "useNodeFeatureAPI": null,
2024-04-23T06:04:59.094827235Z     "plugin": {
2024-04-23T06:04:59.094830383Z       "passDeviceSpecs": null,
2024-04-23T06:04:59.094833529Z       "deviceListStrategy": [
2024-04-23T06:04:59.094836708Z         "envvar"
2024-04-23T06:04:59.094840338Z       ],
2024-04-23T06:04:59.094844148Z       "deviceIDStrategy": "uuid",
2024-04-23T06:04:59.094847589Z       "cdiAnnotationPrefix": null,
2024-04-23T06:04:59.094850905Z       "nvidiaCTKPath": null,
2024-04-23T06:04:59.094854382Z       "containerDriverRoot": null
2024-04-23T06:04:59.094857781Z     }
2024-04-23T06:04:59.094861393Z   },
2024-04-23T06:04:59.094870585Z   "resources": {
2024-04-23T06:04:59.094873870Z     "gpus": [
2024-04-23T06:04:59.094877070Z       {
2024-04-23T06:04:59.094882321Z         "pattern": "*",
2024-04-23T06:04:59.094885531Z         "name": "nvidia.com/gpu"
2024-04-23T06:04:59.094888738Z       }
2024-04-23T06:04:59.094891938Z     ]
2024-04-23T06:04:59.094895130Z   },
2024-04-23T06:04:59.094898375Z   "sharing": {
2024-04-23T06:04:59.094901556Z     "timeSlicing": {},
2024-04-23T06:04:59.094904765Z     "mps": {
2024-04-23T06:04:59.094908027Z       "failRequestsGreaterThanOne": true,
2024-04-23T06:04:59.094911202Z       "resources": [
2024-04-23T06:04:59.094914378Z         {
2024-04-23T06:04:59.094917668Z           "name": "nvidia.com/gpu",
2024-04-23T06:04:59.094920946Z           "devices": "all",
2024-04-23T06:04:59.094924462Z           "replicas": 4
2024-04-23T06:04:59.094927756Z         }
2024-04-23T06:04:59.094931025Z       ]
2024-04-23T06:04:59.094934399Z     }
2024-04-23T06:04:59.094937659Z   }
2024-04-23T06:04:59.094940881Z }
2024-04-23T06:04:59.094944445Z I0423 06:04:59.094737      55 main.go:187] Retrieving MPS daemons.
2024-04-23T06:04:59.367188348Z I0423 06:04:59.366922      55 daemon.go:93] "Staring MPS daemon" resource="nvidia.com/gpu"
2024-04-23T06:04:59.443650558Z I0423 06:04:59.443468      55 daemon.go:131] "Starting log tailer" resource="nvidia.com/gpu"
2024-04-23T06:04:59.445568425Z [2024-04-23 06:04:59.389 Control    72] Starting control daemon using socket /mps/nvidia.com/gpu/pipe/control
2024-04-23T06:04:59.445618049Z [2024-04-23 06:04:59.389 Control    72] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe
2024-04-23T06:04:59.445627888Z [2024-04-23 06:04:59.415 Control    72] Accepting connection...
2024-04-23T06:04:59.445635791Z [2024-04-23 06:04:59.415 Control    72] NEW UI
2024-04-23T06:04:59.445643954Z [2024-04-23 06:04:59.415 Control    72] Cmd:set_default_device_pinned_mem_limit 0 4096M
2024-04-23T06:04:59.445651895Z [2024-04-23 06:04:59.415 Control    72] UI closed
2024-04-23T06:04:59.445659400Z [2024-04-23 06:04:59.418 Control    72] Accepting connection...
2024-04-23T06:04:59.445667180Z [2024-04-23 06:04:59.418 Control    72] NEW UI
2024-04-23T06:04:59.445674877Z [2024-04-23 06:04:59.418 Control    72] Cmd:set_default_device_pinned_mem_limit 1 4096M
2024-04-23T06:04:59.445682790Z [2024-04-23 06:04:59.419 Control    72] UI closed
2024-04-23T06:04:59.445690190Z [2024-04-23 06:04:59.439 Control    72] Accepting connection...
2024-04-23T06:04:59.445697598Z [2024-04-23 06:04:59.439 Control    72] NEW UI
2024-04-23T06:04:59.445705544Z [2024-04-23 06:04:59.439 Control    72] Cmd:set_default_device_pinned_mem_limit 2 4096M
2024-04-23T06:04:59.445713416Z [2024-04-23 06:04:59.439 Control    72] UI closed
2024-04-23T06:04:59.445720849Z [2024-04-23 06:04:59.442 Control    72] Accepting connection...
2024-04-23T06:04:59.445728295Z [2024-04-23 06:04:59.442 Control    72] NEW UI
2024-04-23T06:04:59.445735742Z [2024-04-23 06:04:59.442 Control    72] Cmd:set_default_active_thread_percentage 25
2024-04-23T06:04:59.445749210Z [2024-04-23 06:04:59.442 Control    72] 25.0
2024-04-23T06:04:59.445757584Z [2024-04-23 06:04:59.442 Control    72] UI closed
2024-04-23T06:05:26.660115113Z [2024-04-23 06:05:26.659 Control    72] Accepting connection...
2024-04-23T06:05:26.660166122Z [2024-04-23 06:05:26.659 Control    72] NEW UI
2024-04-23T06:05:26.660179349Z [2024-04-23 06:05:26.659 Control    72] Cmd:get_default_active_thread_percentage
2024-04-23T06:05:26.660189426Z [2024-04-23 06:05:26.659 Control    72] 25.0

Nvidia-device-plugin logs

I0423 06:04:56.479370      39 main.go:279] Retrieving plugins.
I0423 06:04:56.480927      39 factory.go:104] Detected NVML platform: found NVML library
I0423 06:04:56.480999      39 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0423 06:04:56.558122      39 main.go:301] Failed to start plugin: error waiting for MPS daemon: error checking MPS daemon health: failed to send command to MPS daemon: exit status 1
I0423 06:04:56.558152      39 main.go:208] Failed to start one or more plugins. Retrying in 30s...
I0423 06:05:26.587365      39 main.go:315] Stopping plugins.
I0423 06:05:26.587429      39 server.go:185] Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0423 06:05:26.587496      39 main.go:200] Starting Plugins.
I0423 06:05:26.587508      39 main.go:257] Loading configuration.
I0423 06:05:26.588327      39 main.go:265] Updating config with default resource matching patterns.
I0423 06:05:26.588460      39 main.go:276] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "mpsRoot": "/run/nvidia/mps",
    "nvidiaDriverRoot": "/run/nvidia/driver/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {},
    "mps": {
      "failRequestsGreaterThanOne": true,
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "devices": "all",
          "replicas": 4
        }
      ]
    }
  }
}
I0423 06:05:26.588478      39 main.go:279] Retrieving plugins.
I0423 06:05:26.588515      39 factory.go:104] Detected NVML platform: found NVML library
I0423 06:05:26.588559      39 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0423 06:05:26.660603      39 server.go:176] "MPS daemon is healthy" resource="nvidia.com/gpu"
I0423 06:05:26.661346      39 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I0423 06:05:26.662857      39 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0423 06:05:26.673960      39 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

✅Passed case

cuda-samples and check the nvidia-smi log

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: N/A      |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB           Off | 00000000:3B:00.0 Off |                    0 |
| N/A   30C    P0              25W / 250W |     34MiB / 16384MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE-16GB           Off | 00000000:AF:00.0 Off |                    0 |
| N/A   31C    P0              27W / 250W |     34MiB / 16384MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE-16GB           Off | 00000000:D8:00.0 Off |                    0 |
| N/A   31C    P0              27W / 250W |     34MiB / 16384MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1127738      C   nvidia-cuda-mps-server                       30MiB |
|    0   N/A  N/A   1397691    M+C   /cuda-samples/vectorAdd                      18MiB |
|    0   N/A  N/A   1397696    M+C   /cuda-samples/vectorAdd                      10MiB |
|    1   N/A  N/A   1127738      C   nvidia-cuda-mps-server                       30MiB |
|    1   N/A  N/A   1397689    M+C   /cuda-samples/vectorAdd                      94MiB |
|    1   N/A  N/A   1397714    M+C   /cuda-samples/vectorAdd                      10MiB |
|    2   N/A  N/A   1127738      C   nvidia-cuda-mps-server                       30MiB |
|    2   N/A  N/A   1397712    M+C   /cuda-samples/vectorAdd                      10MiB |
+---------------------------------------------------------------------------------------+

❌ Failed case

I am using this classification.ipynb to have a E2E verify about the MPS use in tensorflow, while it turns out that the pod keeps hang 60 mins and don't get any response, and the log can not get more details.

classification.ipynb logs

2024-04-25 02:10:19.499819: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[I 2024-04-25 02:10:20.630 ServerApp] Connecting to kernel 562db3f6-327c-4adc-83e9-735fb4d0042c.
[I 2024-04-25 02:10:36.700 ServerApp] Connecting to kernel 562db3f6-327c-4adc-83e9-735fb4d0042c.

And here is the current MPS daemon log

[2024-04-25 02:10:31.040 Control    74] NEW CLIENT 0 from user 0: Server already exists
[2024-04-25 02:10:31.153 Control    74] Accepting connection...
[2024-04-25 02:10:31.153 Control    74] NEW CLIENT 0 from user 0: Server already exists
[2024-04-25 02:10:31.335 Control    74] Accepting connection...
[2024-04-25 02:10:31.336 Control    74] User did not send valid credentials
[2024-04-25 02:10:31.336 Control    74] Accepting connection...
[2024-04-25 02:10:31.336 Control    74] NEW CLIENT 0 from user 0: Server already exists
[2024-04-25 02:10:31.344 Control    74] Accepting connection...
[2024-04-25 02:10:31.344 Control    74] NEW CLIENT 0 from user 0: Server already exists
[2024-04-25 02:10:31.389 Control    74] Accepting connection...
[2024-04-25 02:10:31.389 Control    74] User did not send valid credentials
[2024-04-25 02:10:31.389 Control    74] Accepting connection...
[2024-04-25 02:10:31.389 Control    74] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2024-04-25 02:10:31.535 Control    74] Accepting connection...
[2024-04-25 02:10:31.535 Control    74] User did not send valid credentials
[2024-04-25 02:10:31.536 Control    74] Accepting connection...
[2024-04-25 02:10:31.536 Control    74] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list
[2024-04-25 02:10:31.590 Control    74] Server 91 exited with status 0
[2024-04-25 02:10:31.590 Control    74] Starting new server 7246 for user 0
[2024-04-25 02:10:31.618 Control    74] Accepting connection...
[2024-04-25 02:10:31.884 Control    74] NEW SERVER 7246: Ready
[2024-04-25 02:10:31.893 Control    74] Accepting connection...
[2024-04-25 02:10:31.893 Control    74] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list
[2024-04-25 02:18:33.152 Control    74] Accepting connection...
[2024-04-25 02:18:33.152 Control    74] User did not send valid credentials
[2024-04-25 02:18:33.152 Control    74] Accepting connection...
[2024-04-25 02:18:33.152 Control    74] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list

NVIDIA / k8s-device-plugin