facebookresearch / SlowFast

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.
Apache License 2.0
6.53k stars 1.21k forks source link

CUDA error: no kernel image is available for execution on the device #65

Closed guancheng817 closed 4 years ago

guancheng817 commented 4 years ago

Hi, thanks for your great codebase. When I run

python tools/run_net.py --cfg configs/AVA/SLOWFAST_32x2_R50_SHORT.yaml NUM_GPUS 2

it appear the following error.. it looks like a compile errro about detectron2? thanks for your help..

[INFO: train_net.py: 288]: Start epoch: 1 Traceback (most recent call last): File "tools/run_net.py", line 152, in main() File "tools/run_net.py", line 124, in main daemon=False, File "/home/gc/anaconda3/envs/slowfast/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/gc/anaconda3/envs/slowfast/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/gc/anaconda3/envs/slowfast/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/mnt/slowfast/slowfast/utils/multiprocessing.py", line 50, in run func(cfg) File "/mnt/slowfast/tools/train_net.py", line 294, in train train_epoch(train_loader, model, optimizer, train_meter, cur_epoch, cfg) File "/mnt/slowfast/tools/train_net.py", line 64, in train_epoch preds = model(inputs, meta["boxes"]) File "/home/gc/anaconda3/envs/slowfast/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(input, kwargs) File "/home/gc/anaconda3/envs/slowfast/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/gc/anaconda3/envs/slowfast/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(input, kwargs) File "/mnt/slowfast/slowfast/models/video_model_builder.py", line 372, in forward x = self.head(x, bboxes) File "/home/gc/anaconda3/envs/slowfast/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, *kwargs) File "/mnt/slowfast/slowfast/models/head_helper.py", line 115, in forward out = roi_align(out, bboxes) File "/home/gc/anaconda3/envs/slowfast/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(input, *kwargs) File "/mnt/slowfast/detectron2_repo/detectron2/layers/roi_align.py", line 95, in forward input, rois, self.output_size, self.spatial_scale, self.sampling_ratio, self.aligned File "/mnt/slowfast/detectron2_repo/detectron2/layers/roi_align.py", line 20, in forward input, roi, spatial_scale, output_size[0], output_size[1], sampling_ratio, aligned RuntimeError: CUDA error: no kernel image is available for execution on the device (ROIAlign_forward_cuda at /mnt/slowfast/detectron2_repo/detectron2/layers/csrc/ROIAlign/ROIAlign_cuda.cu:361) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x6d (0x7fdef37dae7d in /home/gc/anaconda3/envs/slowfast/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: detectron2::ROIAlign_forward_cuda(at::Tensor const&, at::Tensor const&, float, int, int, int, bool) + 0xa04 (0x7fdeba271b04 in /mnt/slowfast/detectron2_repo/detectron2/_C.cpython-37m-x86_64-linux-gnu.so) frame #2: detectron2::ROIAlign_forward(at::Tensor const&, at::Tensor const&, float, int, int, int, bool) + 0xbc (0x7fdeba1f934c in /mnt/slowfast/detectron2_repo/detectron2/_C.cpython-37m-x86_64-linux-gnu.so) frame #3: + 0x5cdba (0x7fdeba20bdba in /mnt/slowfast/detectron2_repo/detectron2/_C.cpython-37m-x86_64-linux-gnu.so) frame #4: + 0x5cebe (0x7fdeba20bebe in /mnt/slowfast/detectron2_repo/detectron2/_C.cpython-37m-x86_64-linux-gnu.so) frame #5: + 0x5755b (0x7fdeba20655b in /mnt/slowfast/detectron2_repo/detectron2/_C.cpython-37m-x86_64-linux-gnu.so) frame #6: _PyMethodDef_RawFastCallKeywords + 0x264 (0x564e51e08c34 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #7: _PyCFunction_FastCallKeywords + 0x21 (0x564e51e08d51 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #8: _PyEval_EvalFrameDefault + 0x4ebc (0x564e51e750ac in /home/gc/anaconda3/envs/slowfast/bin/python) frame #9: _PyFunction_FastCallDict + 0x10b (0x564e51db91db in /home/gc/anaconda3/envs/slowfast/bin/python) frame #10: THPFunction_apply(_object, _object*) + 0xa26 (0x7fdf1517c826 in /home/gc/anaconda3/envs/slowfast/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #11: _PyMethodDef_RawFastCallKeywords + 0x1e0 (0x564e51e08bb0 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #12: _PyCFunction_FastCallKeywords + 0x21 (0x564e51e08d51 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #13: _PyEval_EvalFrameDefault + 0x4784 (0x564e51e74974 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #14: _PyFunction_FastCallDict + 0x10b (0x564e51db91db in /home/gc/anaconda3/envs/slowfast/bin/python) frame #15: _PyObject_Call_Prepend + 0x63 (0x564e51dd7e33 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #16: PyObject_Call + 0x6e (0x564e51dcaa3e in /home/gc/anaconda3/envs/slowfast/bin/python) frame #17: _PyEval_EvalFrameDefault + 0x1f3a (0x564e51e7212a in /home/gc/anaconda3/envs/slowfast/bin/python) frame #18: _PyEval_EvalCodeWithName + 0x2f9 (0x564e51db81b9 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #19: _PyFunction_FastCallDict + 0x1d5 (0x564e51db92a5 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #20: _PyObject_Call_Prepend + 0x63 (0x564e51dd7e33 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #21: + 0x16a2da (0x564e51e0f2da in /home/gc/anaconda3/envs/slowfast/bin/python) frame #22: _PyObject_FastCallKeywords + 0x49b (0x564e51e1019b in /home/gc/anaconda3/envs/slowfast/bin/python) frame #23: _PyEval_EvalFrameDefault + 0x4a86 (0x564e51e74c76 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #24: _PyFunction_FastCallDict + 0x10b (0x564e51db91db in /home/gc/anaconda3/envs/slowfast/bin/python) frame #25: _PyObject_Call_Prepend + 0x63 (0x564e51dd7e33 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #26: PyObject_Call + 0x6e (0x564e51dcaa3e in /home/gc/anaconda3/envs/slowfast/bin/python) frame #27: _PyEval_EvalFrameDefault + 0x1f3a (0x564e51e7212a in /home/gc/anaconda3/envs/slowfast/bin/python) frame #28: _PyEval_EvalCodeWithName + 0x2f9 (0x564e51db81b9 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #29: _PyFunction_FastCallDict + 0x1d5 (0x564e51db92a5 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #30: _PyObject_Call_Prepend + 0x63 (0x564e51dd7e33 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #31: + 0x16a2da (0x564e51e0f2da in /home/gc/anaconda3/envs/slowfast/bin/python) frame #32: _PyObject_FastCallKeywords + 0x49b (0x564e51e1019b in /home/gc/anaconda3/envs/slowfast/bin/python) frame #33: _PyEval_EvalFrameDefault + 0x52e6 (0x564e51e754d6 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #34: _PyEval_EvalCodeWithName + 0x2f9 (0x564e51db81b9 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #35: _PyFunction_FastCallDict + 0x1d5 (0x564e51db92a5 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #36: _PyObject_Call_Prepend + 0x63 (0x564e51dd7e33 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #37: PyObject_Call + 0x6e (0x564e51dcaa3e in /home/gc/anaconda3/envs/slowfast/bin/python) frame #38: _PyEval_EvalFrameDefault + 0x1f3a (0x564e51e7212a in /home/gc/anaconda3/envs/slowfast/bin/python) frame #39: _PyEval_EvalCodeWithName + 0x2f9 (0x564e51db81b9 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #40: _PyFunction_FastCallDict + 0x1d5 (0x564e51db92a5 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #41: _PyObject_Call_Prepend + 0x63 (0x564e51dd7e33 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #42: + 0x16a2da (0x564e51e0f2da in /home/gc/anaconda3/envs/slowfast/bin/python) frame #43: PyObject_Call + 0x6e (0x564e51dcaa3e in /home/gc/anaconda3/envs/slowfast/bin/python) frame #44: _PyEval_EvalFrameDefault + 0x1f3a (0x564e51e7212a in /home/gc/anaconda3/envs/slowfast/bin/python) frame #45: _PyEval_EvalCodeWithName + 0x2f9 (0x564e51db81b9 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #46: _PyFunction_FastCallDict + 0x1d5 (0x564e51db92a5 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #47: _PyObject_Call_Prepend + 0x63 (0x564e51dd7e33 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #48: PyObject_Call + 0x6e (0x564e51dcaa3e in /home/gc/anaconda3/envs/slowfast/bin/python) frame #49: _PyEval_EvalFrameDefault + 0x1f3a (0x564e51e7212a in /home/gc/anaconda3/envs/slowfast/bin/python) frame #50: _PyEval_EvalCodeWithName + 0x2f9 (0x564e51db81b9 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #51: _PyFunction_FastCallDict + 0x1d5 (0x564e51db92a5 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #52: _PyObject_Call_Prepend + 0x63 (0x564e51dd7e33 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #53: + 0x16a2da (0x564e51e0f2da in /home/gc/anaconda3/envs/slowfast/bin/python) frame #54: _PyObject_FastCallKeywords + 0x49b (0x564e51e1019b in /home/gc/anaconda3/envs/slowfast/bin/python) frame #55: _PyEval_EvalFrameDefault + 0x4a86 (0x564e51e74c76 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #56: _PyEval_EvalCodeWithName + 0xab8 (0x564e51db8978 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #57: _PyFunction_FastCallKeywords + 0x387 (0x564e51e08437 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #58: _PyEval_EvalFrameDefault + 0x416 (0x564e51e70606 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #59: _PyFunction_FastCallKeywords + 0xfb (0x564e51e081ab in /home/gc/anaconda3/envs/slowfast/bin/python) frame #60: _PyEval_EvalFrameDefault + 0x416 (0x564e51e70606 in /home/gc/anaconda3/envs/slowfast/bin/python) frame #61: _PyFunction_FastCallDict + 0x10b (0x564e51db91db in /home/gc/anaconda3/envs/slowfast/bin/python) frame #62: _PyEval_EvalFrameDefault + 0x1f3a (0x564e51e7212a in /home/gc/anaconda3/envs/slowfast/bin/python) frame #63: _PyFunction_FastCallDict + 0x10b (0x564e51db91db in /home/gc/anaconda3/envs/slowfast/bin/python)

guancheng817 commented 4 years ago

My envs: Ubuntu 16.04 Pytorch 1.3.1 cudatoolkit 10.0 cudnn 7.6 I use a anaconda env. thanks.

guancheng817 commented 4 years ago

I have solved by using only 1 1070Ti , another one is 980Ti. But I encounter another question. when I python tools/run_net.py --cfg configs/AVA/c2/SLOWFAST_32x2_R101_50_50_v2.1.yaml NUM_GPUS 1 TRAIN.ENABLE False TEST.CHECKPOINT_FILE_PATH model_zoo/SLOWFAST_32x2_R101_50_50_v2.1.pkl TEST.BATCH_SIZE 1

it appeared following error: Traceback (most recent call last): File "tools/run_net.py", line 152, in main() File "tools/run_net.py", line 147, in main test(cfg=cfg) File "/mnt/slowfast/tools/test_net.py", line 182, in test perform_test(test_loader, model, test_meter, cfg) File "/mnt/slowfast/tools/test_net.py", line 77, in perform_test ori_boxes.detach().cpu(), UnboundLocalError: local variable 'ori_boxes' referenced before assignment

How should I modify the ori_boxes, when using 1 gpu.. thanks for your help.

haooooooqi commented 4 years ago

Hi @guancheng817, thanks for using the codebase! The issue you found is a high frequent issue in detectron2, you should able to get some useful suggestion from here

Regarding to the next issue you described, it is recommend to use more than one GPUs (as what we do in the default setting). If you prefer to play with AVA with 1 GPU, you can find something useful from here, and here.

guancheng817 commented 4 years ago

Hi @guancheng817, thanks for using the codebase! The issue you found is a high frequent issue in detectron2, you should able to get some useful suggestion from here

Regarding to the next issue you described, it is recommend to use more than one GPUs (as what we do in the default setting). If you prefer to play with AVA with 1 GPU, you can find something useful from here, and here.

thanks you a lot..