Media-Smart / vedadet

A single stage object detection toolbox based on PyTorch
Apache License 2.0
498 stars 128 forks source link

RuntimeError: cuda runtime error (98) : invalid device function #76

Open bilzard opened 2 years ago

bilzard commented 2 years ago

When I tried to train TinaFace model, the error shared below was occurred.

Error Message

``` $ CUDA_VISIBLE_DEVICES="0" python tools/trainval.py configs/trainval/tinaface/train_my_project_r50_fpn_gn_dcn.py ... error in deformable_im2col: invalid device function error in deformable_im2col: invalid device function error in deformable_im2col: invalid device function error in deformable_im2col: invalid device function error in deformable_im2col: invalid device function error in deformable_im2col: invalid device function error in deformable_im2col: invalid device function error in deformable_im2col: invalid device function error in deformable_im2col: invalid device function /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages/torch/nn/functional.py:3060: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. warnings.warn("Default upsampling behavior when mode={} is changed " /data/working/my-project/vedadet/vedadet/criteria/iou_bbox_anchor_criterion.py:329: UserWarning: This overload of nonzero is deprecated: nonzero() Consider using one of the following signatures instead: nonzero(*, bool as_tuple) (Triggered internally at /opt/conda/conda-bld/pytorch_1607370172916/work/torch/csrc/utils/python_arg_parser.cpp:882.) iou_weights[(bbox_weights.sum(axis=1) > 0).nonzero()] = 1. THCudaCheck FAIL file=/data/working/my-project/vedadet/vedadet/ops/sigmoid_focal_loss/src/cuda/sigmoid_focal_loss_cuda.cu line=131 error=98 : invalid device function Traceback (most recent call last): File "tools/trainval.py", line 65, in main() File "tools/trainval.py", line 61, in main trainval(cfg, distributed, logger) File "/data/working/my-project/vedadet/vedadet/assembler/trainval.py", line 86, in trainval looper.start(cfg.max_epochs) File "/data/working/my-project/vedadet/vedacore/loopers/epoch_based_looper.py", line 29, in start self.epoch_loop(mode) File "/data/working/my-project/vedadet/vedacore/loopers/epoch_based_looper.py", line 17, in epoch_loop self.cur_results[mode] = engine(data) File "/home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/working/my-project/vedadet/vedacore/parallel/data_parallel.py", line 30, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/working/my-project/vedadet/vedadet/engines/train_engine.py", line 20, in forward return self.forward_impl(**data) File "/data/working/my-project/vedadet/vedadet/engines/train_engine.py", line 29, in forward_impl losses = self.criterion.loss(feats, img_metas, gt_labels, gt_bboxes, File "/data/working/my-project/vedadet/vedadet/criteria/iou_bbox_anchor_criterion.py", line 437, in loss losses_cls, losses_bbox, losses_iou = multi_apply( File "/data/working/my-project/vedadet/vedacore/misc/utils.py", line 16, in multi_apply return tuple(map(list, zip(*map_results))) File "/data/working/my-project/vedadet/vedadet/criteria/iou_bbox_anchor_criterion.py", line 336, in loss_single loss_cls = self.loss_cls( File "/home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/working/my-project/vedadet/vedadet/criteria/losses/focal_loss.py", line 142, in forward loss_cls = self.loss_weight * sigmoid_focal_loss( File "/data/working/my-project/vedadet/vedadet/criteria/losses/focal_loss.py", line 69, in sigmoid_focal_loss loss = _sigmoid_focal_loss(pred, target, gamma, alpha) File "/data/working/my-project/vedadet/vedadet/ops/sigmoid_focal_loss/sigmoid_focal_loss.py", line 20, in forward loss = sigmoid_focal_loss_ext.forward(input, target, num_classes, RuntimeError: cuda runtime error (98) : invalid device function at /data/working/my-project/vedadet/vedadet/ops/sigmoid_focal_loss/src/cuda/sigmoid_focal_loss_cuda.cu:131 Segmentation fault (core dumped) ```

Environment

--------------
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.6 LTS
Release:    18.04
Codename:   bionic
--------------
GPU: Tesla T4
NVIDIA-SMI 450.142.00
Driver Version: 450.142.00
CUDA Version: 11.0
--------------
Python 3.8.5
cudatoolkit               11.0.3               h15472ef_9    conda-forge
pytorch                   1.7.1           py3.8_cuda11.0.221_cudnn8.0.5_0    pytorch
torchaudio                0.7.2                      py38    pytorch
torchvision               0.8.2           cpu_py38ha229d99_0
--------------
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
--------------
$ conda info
     active environment : vedadet
    active env location : /home/ubuntu/anaconda3/envs/vedadet
            shell level : 1
       user config file : /home/ubuntu/.condarc
 populated config files : /home/ubuntu/.condarc
          conda version : 4.8.4
    conda-build version : not installed
         python version : 3.7.10.final.0
       virtual packages : __cuda=11.0
                          __glibc=2.27
       base environment : /home/ubuntu/anaconda3  (writable)
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /home/ubuntu/anaconda3/pkgs
       envs directories : /home/ubuntu/anaconda3/envs
                          /home/ubuntu/.conda/envs
               platform : linux-64
             user-agent : conda/4.8.4 requests/2.26.0 CPython/3.7.10 Linux/5.4.0-1060-aws ubuntu/18.04.5 glibc/2.27
                UID:GID : 1000:1000
             netrc file : /home/ubuntu/.netrc
           offline mode : False
---------------------------------------
bilzard commented 2 years ago

building vedadet seems to be successfull.

Output

``` $ pip install -v -e . Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com Requirement already satisfied: cython in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from -r requirements/build.txt (line 1)) (0.29.26) Requirement already satisfied: numpy in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from -r requirements/build.txt (line 2)) (1.21.2) (vedadet) ubuntu@ip-172-31-1-219:/data/working/my-project/vedadet$ pip install -v -e . Using pip 21.3.1 from /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages/pip (python 3.8) Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com Obtaining file:///data/working/my-project/vedadet Running command python setup.py egg_info running egg_info creating /tmp/pip-pip-egg-info-zq53t05y/vedadet.egg-info writing /tmp/pip-pip-egg-info-zq53t05y/vedadet.egg-info/PKG-INFO writing dependency_links to /tmp/pip-pip-egg-info-zq53t05y/vedadet.egg-info/dependency_links.txt writing requirements to /tmp/pip-pip-egg-info-zq53t05y/vedadet.egg-info/requires.txt writing top-level names to /tmp/pip-pip-egg-info-zq53t05y/vedadet.egg-info/top_level.txt writing manifest file '/tmp/pip-pip-egg-info-zq53t05y/vedadet.egg-info/SOURCES.txt' reading manifest file '/tmp/pip-pip-egg-info-zq53t05y/vedadet.egg-info/SOURCES.txt' adding license file 'LICENSE' writing manifest file '/tmp/pip-pip-egg-info-zq53t05y/vedadet.egg-info/SOURCES.txt' Preparing metadata (setup.py) ... done Requirement already satisfied: addict in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from vedadet==0.1.0) (2.4.0) Requirement already satisfied: terminaltables in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from vedadet==0.1.0) (3.1.10) Requirement already satisfied: opencv-python in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from vedadet==0.1.0) (4.5.4.60) Requirement already satisfied: torchvision>=0.7.0 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from vedadet==0.1.0) (0.8.0a0) Requirement already satisfied: pyyaml in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from vedadet==0.1.0) (6.0) Requirement already satisfied: yapf in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from vedadet==0.1.0) (0.31.0) Requirement already satisfied: imagecorruptions in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from vedadet==0.1.0) (1.1.2) Requirement already satisfied: mmpycocotools in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from vedadet==0.1.0) (12.0.3) Requirement already satisfied: torch in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from torchvision>=0.7.0->vedadet==0.1.0) (1.7.1) Requirement already satisfied: pillow>=4.1.1 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from torchvision>=0.7.0->vedadet==0.1.0) (8.4.0) Requirement already satisfied: numpy in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from torchvision>=0.7.0->vedadet==0.1.0) (1.21.2) Requirement already satisfied: scikit-image>=0.15 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from imagecorruptions->vedadet==0.1.0) (0.19.1) Requirement already satisfied: scipy>=1.2.1 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from imagecorruptions->vedadet==0.1.0) (1.7.3) Requirement already satisfied: setuptools>=18.0 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from mmpycocotools->vedadet==0.1.0) (59.6.0) Requirement already satisfied: matplotlib>=2.1.0 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from mmpycocotools->vedadet==0.1.0) (3.5.1) Requirement already satisfied: cython>=0.27.3 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from mmpycocotools->vedadet==0.1.0) (0.29.26) Requirement already satisfied: kiwisolver>=1.0.1 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from matplotlib>=2.1.0->mmpycocotools->vedadet==0.1.0) (1.3.2) Requirement already satisfied: packaging>=20.0 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from matplotlib>=2.1.0->mmpycocotools->vedadet==0.1.0) (21.3) Requirement already satisfied: fonttools>=4.22.0 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from matplotlib>=2.1.0->mmpycocotools->vedadet==0.1.0) (4.28.4) Requirement already satisfied: pyparsing>=2.2.1 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from matplotlib>=2.1.0->mmpycocotools->vedadet==0.1.0) (3.0.6) Requirement already satisfied: python-dateutil>=2.7 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from matplotlib>=2.1.0->mmpycocotools->vedadet==0.1.0) (2.8.2) Requirement already satisfied: cycler>=0.10 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from matplotlib>=2.1.0->mmpycocotools->vedadet==0.1.0) (0.11.0) Requirement already satisfied: imageio>=2.4.1 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from scikit-image>=0.15->imagecorruptions->vedadet==0.1.0) (2.13.3) Requirement already satisfied: tifffile>=2019.7.26 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from scikit-image>=0.15->imagecorruptions->vedadet==0.1.0) (2021.11.2) Requirement already satisfied: networkx>=2.2 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from scikit-image>=0.15->imagecorruptions->vedadet==0.1.0) (2.6.3) Requirement already satisfied: PyWavelets>=1.1.1 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from scikit-image>=0.15->imagecorruptions->vedadet==0.1.0) (1.2.0) Requirement already satisfied: typing_extensions in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from torch->torchvision>=0.7.0->vedadet==0.1.0) (4.0.1) Requirement already satisfied: six>=1.5 in /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages (from python-dateutil>=2.7->matplotlib>=2.1.0->mmpycocotools->vedadet==0.1.0) (1.16.0) Installing collected packages: vedadet Attempting uninstall: vedadet Found existing installation: vedadet 0.1.0 Uninstalling vedadet-0.1.0: Removing file or directory /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages/vedadet.egg-link Removing pth entries from /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages/easy-install.pth: Removing entry: /data/working/my-project/vedadet Successfully uninstalled vedadet-0.1.0 Running setup.py develop for vedadet Running command /home/ubuntu/anaconda3/envs/vedadet/bin/python3.8 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/data/working/my-project/vedadet/setup.py'"'"'; __file__='"'"'/data/working/my-project/vedadet/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps running develop running egg_info writing vedadet.egg-info/PKG-INFO writing dependency_links to vedadet.egg-info/dependency_links.txt writing requirements to vedadet.egg-info/requires.txt writing top-level names to vedadet.egg-info/top_level.txt reading manifest file 'vedadet.egg-info/SOURCES.txt' adding license file 'LICENSE' writing manifest file 'vedadet.egg-info/SOURCES.txt' running build_ext building 'vedadet.ops.nms.nms_ext' extension Emitting ninja build file /data/working/my-project/vedadet/build/temp.linux-x86_64-3.8/build.ninja... Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages/setuptools/command/easy_install.py:156: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools. warnings.warn( /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools. warnings.warn( ninja: no work to do. g++ -pthread -shared -B /home/ubuntu/anaconda3/envs/vedadet/compiler_compat -L/home/ubuntu/anaconda3/envs/vedadet/lib -Wl,-rpath=/home/ubuntu/anaconda3/envs/vedadet/lib -Wl,--no-as-needed -Wl,--sysroot=/ /data/working/my-project/vedadet/build/temp.linux-x86_64-3.8/vedadet/ops/nms/src/nms_ext.o /data/working/my-project/vedadet/build/temp.linux-x86_64-3.8/vedadet/ops/nms/src/cpu/nms_cpu.o /data/working/my-project/vedadet/build/temp.linux-x86_64-3.8/vedadet/ops/nms/src/cuda/nms_cuda.o /data/working/my-project/vedadet/build/temp.linux-x86_64-3.8/vedadet/ops/nms/src/cuda/nms_kernel.o -L/home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-3.8/vedadet/ops/nms/nms_ext.cpython-38-x86_64-linux-gnu.so building 'vedadet.ops.sigmoid_focal_loss.sigmoid_focal_loss_ext' extension Emitting ninja build file /data/working/my-project/vedadet/build/temp.linux-x86_64-3.8/build.ninja... Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. g++ -pthread -shared -B /home/ubuntu/anaconda3/envs/vedadet/compiler_compat -L/home/ubuntu/anaconda3/envs/vedadet/lib -Wl,-rpath=/home/ubuntu/anaconda3/envs/vedadet/lib -Wl,--no-as-needed -Wl,--sysroot=/ /data/working/my-project/vedadet/build/temp.linux-x86_64-3.8/vedadet/ops/sigmoid_focal_loss/src/sigmoid_focal_loss_ext.o /data/working/my-project/vedadet/build/temp.linux-x86_64-3.8/vedadet/ops/sigmoid_focal_loss/src/cuda/sigmoid_focal_loss_cuda.o -L/home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-3.8/vedadet/ops/sigmoid_focal_loss/sigmoid_focal_loss_ext.cpython-38-x86_64-linux-gnu.so building 'vedadet.ops.dcn.deform_conv_ext' extension Emitting ninja build file /data/working/my-project/vedadet/build/temp.linux-x86_64-3.8/build.ninja... Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. g++ -pthread -shared -B /home/ubuntu/anaconda3/envs/vedadet/compiler_compat -L/home/ubuntu/anaconda3/envs/vedadet/lib -Wl,-rpath=/home/ubuntu/anaconda3/envs/vedadet/lib -Wl,--no-as-needed -Wl,--sysroot=/ /data/working/my-project/vedadet/build/temp.linux-x86_64-3.8/vedadet/ops/dcn/src/deform_conv_ext.o /data/working/my-project/vedadet/build/temp.linux-x86_64-3.8/vedadet/ops/dcn/src/cuda/deform_conv_cuda.o /data/working/my-project/vedadet/build/temp.linux-x86_64-3.8/vedadet/ops/dcn/src/cuda/deform_conv_cuda_kernel.o -L/home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-3.8/vedadet/ops/dcn/deform_conv_ext.cpython-38-x86_64-linux-gnu.so building 'vedadet.ops.dcn.deform_pool_ext' extension Emitting ninja build file /data/working/my-project/vedadet/build/temp.linux-x86_64-3.8/build.ninja... Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. g++ -pthread -shared -B /home/ubuntu/anaconda3/envs/vedadet/compiler_compat -L/home/ubuntu/anaconda3/envs/vedadet/lib -Wl,-rpath=/home/ubuntu/anaconda3/envs/vedadet/lib -Wl,--no-as-needed -Wl,--sysroot=/ /data/working/my-project/vedadet/build/temp.linux-x86_64-3.8/vedadet/ops/dcn/src/deform_pool_ext.o /data/working/my-project/vedadet/build/temp.linux-x86_64-3.8/vedadet/ops/dcn/src/cuda/deform_pool_cuda.o /data/working/my-project/vedadet/build/temp.linux-x86_64-3.8/vedadet/ops/dcn/src/cuda/deform_pool_cuda_kernel.o -L/home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-3.8/vedadet/ops/dcn/deform_pool_ext.cpython-38-x86_64-linux-gnu.so copying build/lib.linux-x86_64-3.8/vedadet/ops/nms/nms_ext.cpython-38-x86_64-linux-gnu.so -> vedadet/ops/nms copying build/lib.linux-x86_64-3.8/vedadet/ops/sigmoid_focal_loss/sigmoid_focal_loss_ext.cpython-38-x86_64-linux-gnu.so -> vedadet/ops/sigmoid_focal_loss copying build/lib.linux-x86_64-3.8/vedadet/ops/dcn/deform_conv_ext.cpython-38-x86_64-linux-gnu.so -> vedadet/ops/dcn copying build/lib.linux-x86_64-3.8/vedadet/ops/dcn/deform_pool_ext.cpython-38-x86_64-linux-gnu.so -> vedadet/ops/dcn Creating /home/ubuntu/anaconda3/envs/vedadet/lib/python3.8/site-packages/vedadet.egg-link (link to .) Adding vedadet 0.1.0 to easy-install.pth file Installed /data/working/my-project/vedadet Successfully installed vedadet-0.1.0 ```

jasonlbx13 commented 1 year ago

local cuda version need be equal to cudatoolkit in conda env, after equal them, rebuilding by pip install -v -e ., then will be ok