Tue May 10 13:14:38 2022
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 Tesla V100-SXM2... Off | 00000000:89:00.0 Off | 0 |
| N/A 27C P0 41W / 300W | 0MiB / 32768MiB | 0% E. Process |
| | | N/A |
| 1 Tesla V100-SXM2... Off | 00000000:8A:00.0 Off | 0 |
| N/A 32C P0 56W / 300W | 161MiB / 32768MiB | 0% E. Process |
| | | N/A |
| 2 Tesla V100-SXM2... Off | 00000000:B2:00.0 Off | 0 |
| N/A 28C P0 41W / 300W | 30MiB / 32768MiB | 0% E. Process |
| | | N/A |
| 3 Tesla V100-SXM2... Off | 00000000:B3:00.0 Off | 0 |
| N/A 32C P0 43W / 300W | 30MiB / 32768MiB | 0% E. Process |
| | | N/A |
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| 1 N/A N/A 11999 M+C /usr/bin/binomialOptions 131MiB |
| 1 N/A N/A 38635 C nvidia-cuda-mps-server 27MiB |
| 2 N/A N/A 6487 C nvidia-cuda-mps-server 27MiB |
| 3 N/A N/A 62115 C nvidia-cuda-mps-server 27MiB |
However, whenever I tried to run tensorflow python script in jupyter-notebook, it is not connecting with mps-server and nothing is showing in logs under (/vat/log/nvidia-mps --> server.log | control.log)
I have already set CUDA_MPS_PIPE_DIRECTORY as env variable also mounted host "/tmp/nvidia/" directory where all CUDA_MPS_PIPE_DIRECTORY are created per physical gpu.
2022-05-11 00:27:26.583002: F tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
GPU - 4 Tesla V100
I have started mps-daemon per physical gpu
If I deploy sample k8s application pod as below - It is working.
as you can see blow for device:1 (M+C)
However, whenever I tried to run tensorflow python script in jupyter-notebook, it is not connecting with mps-server and nothing is showing in logs under (/vat/log/nvidia-mps --> server.log | control.log)
I have already set CUDA_MPS_PIPE_DIRECTORY as env variable also mounted host "/tmp/nvidia/" directory where all CUDA_MPS_PIPE_DIRECTORY are created per physical gpu.
2022-05-11 00:27:26.583002: F tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal