NVIDIA / tensorflow

An Open Source Machine Learning Framework for Everyone
https://developer.nvidia.com/deep-learning-frameworks
Apache License 2.0
962 stars 144 forks source link

Multi mps-daemon with Tensorflow #61

Open anaconda2196 opened 2 years ago

anaconda2196 commented 2 years ago

Hi,

GPU - 4 Tesla V100

I have started mps-daemon per physical gpu

# ps -ef | grep mps
root      4258 34958  0 14:19 pts/2    00:00:00 grep --color=auto mps
root      6487 27180  0 09:14 ?        00:00:07 nvidia-cuda-mps-server
root     27170     1  0 09:03 ?        00:00:00 nvidia-cuda-mps-control -d
root     27175     1  0 09:03 ?        00:00:00 nvidia-cuda-mps-control -d
root     27180     1  0 09:03 ?        00:00:00 nvidia-cuda-mps-control -d
root     27185     1  0 09:03 ?        00:00:00 nvidia-cuda-mps-control -d
root     38635 27175  0 13:04 ?        00:00:01 nvidia-cuda-mps-server
root     56671 27170  0 13:25 ?        00:00:01 nvidia-cuda-mps-server
root     62115 27185  0 13:10 ?        00:00:01 nvidia-cuda-mps-server
# nvidia-smi
Tue May 10 14:18:24 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   27C    P0    41W / 300W |     30MiB / 32768MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   31C    P0    42W / 300W |     30MiB / 32768MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   28C    P0    41W / 300W |     30MiB / 32768MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:B3:00.0 Off |                    0 |
| N/A   33C    P0    43W / 300W |     30MiB / 32768MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     56671      C   nvidia-cuda-mps-server             27MiB |
|    1   N/A  N/A     38635      C   nvidia-cuda-mps-server             27MiB |
|    2   N/A  N/A      6487      C   nvidia-cuda-mps-server             27MiB |
|    3   N/A  N/A     62115      C   nvidia-cuda-mps-server             27MiB |
+-----------------------------------------------------------------------------+

If I deploy sample k8s application pod as below - It is working.

apiVersion: v1
kind: Pod
metadata:
 name: cuda-gpu-demo
spec:
  hostIPC: true
  restartPolicy: OnFailure
  containers:
  - name: cuda-gpu-demo
    image: my-image
    command:
    - "/bin/sh"
    - "-c"
    args:
    - for i in {0..40}; do echo $i; /usr/bin/binomialOptions; sleep 1; done
    resources:
      limits:
        nvidia.com/gpu: 1
    volumeMounts:
      - name: mps
        mountPath: /tmp/nvidia/
  volumes:
    - name: mps
      hostPath:
        path: /tmp/nvidia/

as you can see blow for device:1 (M+C)

nvidia-smi 
Tue May 10 13:14:38 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   27C    P0    41W / 300W |      0MiB / 32768MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   32C    P0    56W / 300W |    161MiB / 32768MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   28C    P0    41W / 300W |     30MiB / 32768MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:B3:00.0 Off |                    0 |
| N/A   32C    P0    43W / 300W |     30MiB / 32768MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A     11999    M+C   /usr/bin/binomialOptions          131MiB |
|    1   N/A  N/A     38635      C   nvidia-cuda-mps-server             27MiB |
|    2   N/A  N/A      6487      C   nvidia-cuda-mps-server             27MiB |
|    3   N/A  N/A     62115      C   nvidia-cuda-mps-server             27MiB |
+-----------------------------------------------------------------------------+

However, whenever I tried to run tensorflow python script in jupyter-notebook, it is not connecting with mps-server and nothing is showing in logs under (/vat/log/nvidia-mps --> server.log | control.log)

I have already set CUDA_MPS_PIPE_DIRECTORY as env variable also mounted host "/tmp/nvidia/" directory where all CUDA_MPS_PIPE_DIRECTORY are created per physical gpu.

Issue: 2022-05-11 00:27:26.583002: F tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal