I use GPU pod to run pytorch processes with the device plugin, and met the problem occasionally which shows "CUDA unknown error". But after I killed the nvidia-device-plugin pod(then there started a new pod by the nvidia-device-plugin daemonset) on the host, this problem went away.
2. Steps to reproduce the issue
python
>>> import torch
>>> a=torch.Tensor(1)
>>> a.cuda()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/python3.7.6/lib/python3.7/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
1. Issue or feature description
I use GPU pod to run pytorch processes with the device plugin, and met the problem occasionally which shows "CUDA unknown error". But after I killed the nvidia-device-plugin pod(then there started a new pod by the nvidia-device-plugin daemonset) on the host, this problem went away.
2. Steps to reproduce the issue
3. relevant information
How can I avoid this problem?