"CUDA unknown error" when using pytorch, and recovered by restarting the nvidia plugin pod

1. Issue or feature description

I use GPU pod to run pytorch processes with the device plugin, and met the problem occasionally which shows "CUDA unknown error". But after I killed the nvidia-device-plugin pod(then there started a new pod by the nvidia-device-plugin daemonset) on the host, this problem went away.

2. Steps to reproduce the issue

python
>>> import torch
>>> a=torch.Tensor(1)
>>> a.cuda()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/python3.7.6/lib/python3.7/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

3. relevant information

python 3.7.6
torch 1.7.1
cuda 10.0
kubernetes v1.17.4
k8s-device-plugin v0.9.0 deployed by a daemonset
GPU: 8*V100

How can I avoid this problem?

NVIDIA / k8s-device-plugin

"CUDA unknown error" when using pytorch, and recovered by restarting the nvidia plugin pod #319

1. Issue or feature description

2. Steps to reproduce the issue

3. relevant information