JosephKJ / OWOD

(CVPR 2021 Oral) Open World Object Detection
https://josephkj.in
Apache License 2.0
1.04k stars 155 forks source link

Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU #81

Closed JohnWuzh closed 2 years ago

JohnWuzh commented 2 years ago

When running the "replicate.sh", there will be problems. When running "python tools/train_net.py --num-gpus 4 --dist-url='tcp://127.0.0.1:52133' --config-file ./configs/OWOD/t1/t1_val.yaml SOLVER. IMS_PER_BATCH 4 SOLVER.BASE_LR 0.01 OWOD.TEMPERATURE 1.5 OUTPUT_DIR "./output/t1_final" ", this problem can also occur. The question is:

image

And then executing "nvidia-smi", the following information is displayed: "Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU"

Looking forward your responding! Thanks very much!

JosephKJ commented 2 years ago

Hi @ia-heng : this seems to be an NVIDIA driver issue. Please check with your system administrator. Thank you.