rivercn commented 3 years ago

### 硬件环境：

1，cnetos7 服务器 2，官方cuda版本 CUDA10.0 3，conda 运行环境 python3 pytorch1.4 cudatoolkit10.1

(PSLI) [zhusong@localhost SOLO]$ nvidia-smi Fri Aug 13 10:56:36 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 TITAN Xp Off | 00000000:02:00.0 Off | N/A | | 29% 45C P0 63W / 250W | 0MiB / 12196MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 TITAN Xp Off | 00000000:03:00.0 Off | N/A | | 33% 47C P0 63W / 250W | 0MiB / 12196MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 TITAN Xp Off | 00000000:82:00.0 Off | N/A | | 32% 46C P0 60W / 250W | 0MiB / 12196MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 TITAN Xp Off | 00000000:83:00.0 Off | N/A | | 35% 49C P0 59W / 250W | 0MiB / 12196MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

### 错误内容：

oading annotations into memory... Done (t=13.39s) creating index... index created! 2021-08-13 10:47:28,953 - mmdet - INFO - Start running, host: zhusong@localhost.localdomain, work_dir: /home/zhusong/project/SOLO/work_dirs/decoupled_solo_light_release_r50_fpn_8gpu_3x 2021-08-13 10:47:28,953 - mmdet - INFO - workflow: [('train', 1)], max: 36 epochs /home/zhusong/.conda/envs/PSLI/lib/python3.7/site-packages/torch/nn/functional.py:2506: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. "See the documentation of nn.Upsample for details.".format(mode)) THCudaCheck FAIL file=mmdet/ops/sigmoid_focal_loss/src/sigmoid_focal_loss_cuda.cu line=128 error=209 : no kernel image is available for execution on the device Traceback (most recent call last): File "tools/train.py", line 125, in main() File "tools/train.py", line 121, in main timestamp=timestamp) File "/home/zhusong/project/SOLO/mmdet/apis/train.py", line 111, in train_detector timestamp=timestamp) File "/home/zhusong/project/SOLO/mmdet/apis/train.py", line 297, in _non_dist_train runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/home/zhusong/.conda/envs/PSLI/lib/python3.7/site-packages/mmcv-0.2.16-py3.7-linux-x86_64.egg/mmcv/runner/runner.py", line 364, in run epoch_runner(data_loaders[i], kwargs) File "/home/zhusong/.conda/envs/PSLI/lib/python3.7/site-packages/mmcv-0.2.16-py3.7-linux-x86_64.egg/mmcv/runner/runner.py", line 268, in train self.model, data_batch, train_mode=True, kwargs) File "/home/zhusong/project/SOLO/mmdet/apis/train.py", line 78, in batch_processor losses = model(data) File "/home/zhusong/.conda/envs/PSLI/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/zhusong/.conda/envs/PSLI/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(inputs[0], kwargs[0]) File "/home/zhusong/.conda/envs/PSLI/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/zhusong/project/SOLO/mmdet/core/fp16/decorators.py", line 49, in new_func return old_func(args, kwargs) File "/home/zhusong/project/SOLO/mmdet/models/detectors/base.py", line 142, in forward return self.forward_train(img, img_meta, kwargs) File "/home/zhusong/project/SOLO/mmdet/models/detectors/single_stage_ins.py", line 78, in forward_train loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore) File "/home/zhusong/project/SOLO/mmdet/models/anchor_heads/decoupled_solo_light_head.py", line 258, in loss loss_cate = self.loss_cate(flatten_cate_preds, flatten_cate_labels, avg_factor=num_ins + 1) File "/home/zhusong/.conda/envs/PSLI/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, **kwargs) File "/home/zhusong/project/SOLO/mmdet/models/losses/focal_loss.py", line 79, in forward avg_factor=avg_factor) File "/home/zhusong/project/SOLO/mmdet/models/losses/focal_loss.py", line 37, in sigmoid_focal_loss loss = _sigmoid_focal_loss(pred, target, gamma, alpha) File "/home/zhusong/project/SOLO/mmdet/ops/sigmoid_focal_loss/sigmoid_focal_loss.py", line 19, in forward gamma, alpha) RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at mmdet/ops/sigmoid_focal_loss/src/sigmoid_focal_loss_cuda.cu:128

44

验证过程参考，运行例子代码可以正常运行 https://heary.cn/posts/PyTorch%E6%8A%A5CUDA-error-no-kernel-image-is-available-for-execution-on-the-device%E9%97%AE%E9%A2%98%E8%A7%A3%E5%86%B3/ 低版本pytorch还未测试，是否在/sigmoid_focal_loss_cuda.cu中有其它解决方案，看着像是focal loss的两个超参数计算问题

onepiece010938 commented 2 years ago

same error plz

onepiece010938 commented 2 years ago

I fixed it. Due to I changed the environment (GPU1080 to A4000),caused this error. Just remove the build file under SOLOv2, and rebuild it.

dragonhaha commented 2 years ago

Colab上的环境同样遇到了该问题，按照楼上朋友的方法依然没有解决。我把mmdet/ops/sigmoid_focal_loss/src/sigmoid_focal_loss.cpp以及mmdet/ops/sigmoid_focal_loss/src/sigmoid_focal_loss.cpp 全部改为了以下链接的内容，然后重新build mmdet，最后成功跑起来了。 link

附环境： CUDA available: True CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.1.TC455_06.29190527_0 GPU 0: Tesla P100-PCIE-16GB GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.10.0+cu111 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.1+cu111 OpenCV: 4.1.2 MMCV: 0.2.16 MMDetection: 1.0.0+95f3732 MMDetection Compiler: GCC 7.5 MMDetection CUDA Compiler: 11.1

WXinlong / SOLO

no kernel image is available for execution on the device atsrc/sigmoid_focal_loss_cuda.cu:128 #186

44