WXinlong / SOLO

SOLO and SOLOv2 for instance segmentation, ECCV 2020 & NeurIPS 2020.
Other
1.69k stars 307 forks source link

no kernel image is available for execution on the device atsrc/sigmoid_focal_loss_cuda.cu:128 #186

Open rivercn opened 3 years ago

rivercn commented 3 years ago

### 硬件环境:

1,cnetos7 服务器 2,官方cuda版本 CUDA10.0 3,conda 运行环境 python3 pytorch1.4 cudatoolkit10.1

(PSLI) [zhusong@localhost SOLO]$ nvidia-smi Fri Aug 13 10:56:36 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 TITAN Xp Off | 00000000:02:00.0 Off | N/A | | 29% 45C P0 63W / 250W | 0MiB / 12196MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 TITAN Xp Off | 00000000:03:00.0 Off | N/A | | 33% 47C P0 63W / 250W | 0MiB / 12196MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 TITAN Xp Off | 00000000:82:00.0 Off | N/A | | 32% 46C P0 60W / 250W | 0MiB / 12196MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 TITAN Xp Off | 00000000:83:00.0 Off | N/A | | 35% 49C P0 59W / 250W | 0MiB / 12196MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

### 错误内容:

oading annotations into memory... Done (t=13.39s) creating index... index created! 2021-08-13 10:47:28,953 - mmdet - INFO - Start running, host: zhusong@localhost.localdomain, work_dir: /home/zhusong/project/SOLO/work_dirs/decoupled_solo_light_release_r50_fpn_8gpu_3x 2021-08-13 10:47:28,953 - mmdet - INFO - workflow: [('train', 1)], max: 36 epochs /home/zhusong/.conda/envs/PSLI/lib/python3.7/site-packages/torch/nn/functional.py:2506: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. "See the documentation of nn.Upsample for details.".format(mode)) THCudaCheck FAIL file=mmdet/ops/sigmoid_focal_loss/src/sigmoid_focal_loss_cuda.cu line=128 error=209 : no kernel image is available for execution on the device Traceback (most recent call last): File "tools/train.py", line 125, in main() File "tools/train.py", line 121, in main timestamp=timestamp) File "/home/zhusong/project/SOLO/mmdet/apis/train.py", line 111, in train_detector timestamp=timestamp) File "/home/zhusong/project/SOLO/mmdet/apis/train.py", line 297, in _non_dist_train runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/home/zhusong/.conda/envs/PSLI/lib/python3.7/site-packages/mmcv-0.2.16-py3.7-linux-x86_64.egg/mmcv/runner/runner.py", line 364, in run epoch_runner(data_loaders[i], kwargs) File "/home/zhusong/.conda/envs/PSLI/lib/python3.7/site-packages/mmcv-0.2.16-py3.7-linux-x86_64.egg/mmcv/runner/runner.py", line 268, in train self.model, data_batch, train_mode=True, kwargs) File "/home/zhusong/project/SOLO/mmdet/apis/train.py", line 78, in batch_processor losses = model(data) File "/home/zhusong/.conda/envs/PSLI/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/zhusong/.conda/envs/PSLI/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(inputs[0], kwargs[0]) File "/home/zhusong/.conda/envs/PSLI/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/zhusong/project/SOLO/mmdet/core/fp16/decorators.py", line 49, in new_func return old_func(args, kwargs) File "/home/zhusong/project/SOLO/mmdet/models/detectors/base.py", line 142, in forward return self.forward_train(img, img_meta, kwargs) File "/home/zhusong/project/SOLO/mmdet/models/detectors/single_stage_ins.py", line 78, in forward_train loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore) File "/home/zhusong/project/SOLO/mmdet/models/anchor_heads/decoupled_solo_light_head.py", line 258, in loss loss_cate = self.loss_cate(flatten_cate_preds, flatten_cate_labels, avg_factor=num_ins + 1) File "/home/zhusong/.conda/envs/PSLI/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, **kwargs) File "/home/zhusong/project/SOLO/mmdet/models/losses/focal_loss.py", line 79, in forward avg_factor=avg_factor) File "/home/zhusong/project/SOLO/mmdet/models/losses/focal_loss.py", line 37, in sigmoid_focal_loss loss = _sigmoid_focal_loss(pred, target, gamma, alpha) File "/home/zhusong/project/SOLO/mmdet/ops/sigmoid_focal_loss/sigmoid_focal_loss.py", line 19, in forward gamma, alpha) RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at mmdet/ops/sigmoid_focal_loss/src/sigmoid_focal_loss_cuda.cu:128

44

验证过程参考,运行例子代码可以正常运行 https://heary.cn/posts/PyTorch%E6%8A%A5CUDA-error-no-kernel-image-is-available-for-execution-on-the-device%E9%97%AE%E9%A2%98%E8%A7%A3%E5%86%B3/ 低版本pytorch还未测试,是否在/sigmoid_focal_loss_cuda.cu中有其它解决方案,看着像是focal loss的 两个超参数计算问题

onepiece010938 commented 2 years ago

same error plz

onepiece010938 commented 2 years ago

I fixed it. Due to I changed the environment (GPU1080 to A4000),caused this error. Just remove the build file under SOLOv2, and rebuild it.

dragonhaha commented 2 years ago

Colab上的环境同样遇到了该问题,按照楼上朋友的方法依然没有解决。 我把mmdet/ops/sigmoid_focal_loss/src/sigmoid_focal_loss.cpp以及mmdet/ops/sigmoid_focal_loss/src/sigmoid_focal_loss.cpp 全部改为了以下链接的内容,然后重新build mmdet,最后成功跑起来了。 link

附环境: CUDA available: True CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.1.TC455_06.29190527_0 GPU 0: Tesla P100-PCIE-16GB GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.10.0+cu111 PyTorch compiling details: PyTorch built with:

TorchVision: 0.11.1+cu111 OpenCV: 4.1.2 MMCV: 0.2.16 MMDetection: 1.0.0+95f3732 MMDetection Compiler: GCC 7.5 MMDetection CUDA Compiler: 11.1