NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.86k stars 634 forks source link

"CUDA unknown error" when using pytorch, and recovered by restarting the nvidia plugin pod #319

Open chxk opened 2 years ago

chxk commented 2 years ago

1. Issue or feature description

I use GPU pod to run pytorch processes with the device plugin, and met the problem occasionally which shows "CUDA unknown error". But after I killed the nvidia-device-plugin pod(then there started a new pod by the nvidia-device-plugin daemonset) on the host, this problem went away.

2. Steps to reproduce the issue

python
>>> import torch
>>> a=torch.Tensor(1)
>>> a.cuda()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/python3.7.6/lib/python3.7/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

3. relevant information

  1. python 3.7.6
  2. torch 1.7.1
  3. cuda 10.0
  4. kubernetes v1.17.4
  5. k8s-device-plugin v0.9.0 deployed by a daemonset
  6. GPU: 8*V100

How can I avoid this problem?

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.