lsj1111 commented 4 months ago

the environment is :

sys.platform linux Python 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0] numpy 1.24.3 detectron2 0.6 @/media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/lsj/code/MASKDINO/detectron2-main/detectron2 Compiler GCC 7.5 CUDA compiler CUDA 11.3 detectron2 arch flags /media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/lsj/code/MASKDINO/detectron2-main/detectron2/_C.cpython-38-x86_64-linux-gnu.so; cannot find cuobjdump DETECTRON2_ENV_MODULE PyTorch 1.10.0 @/media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/anaconda3/envs/maskdino/lib/python3.8/site-packages/torch PyTorch debug build False torch._C._GLIBCXX_USE_CXX11_ABI False GPU available Yes GPU 0,1 NVIDIA GeForce RTX 4090 (arch=8.9) Driver version 535.183.01 CUDA_HOME :/usr/local/cuda - invalid! Pillow 10.3.0 torchvision 0.11.0 @/media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/anaconda3/envs/maskdino/lib/python3.8/site-packages/torchvision torchvision arch flags /media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/anaconda3/envs/maskdino/lib/python3.8/site-packages/torchvision/_C.so; cannot find cuobjdump fvcore 0.1.5.post20221221 iopath 0.1.9 cv2 4.10.0

This problem arises when I execute train_net.py:

[07/01 22:30:46] d2.engine.train_loop INFO: Starting training from iteration 0 [07/01 22:30:47] d2.engine.train_loop ERROR: Exception during training: Traceback (most recent call last): File "/media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/lsj/code/MASKDINO/detectron2-main/detectron2/engine/train_loop.py", line 155, in train self.run_step() File "/media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/lsj/code/MASKDINO/detectron2-main/detectron2/engine/defaults.py", line 498, in run_step self._trainer.run_step() File "/media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/lsj/code/MASKDINO/detectron2-main/detectron2/engine/train_loop.py", line 495, in run_step loss_dict = self.model(data) File "/media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/anaconda3/envs/maskdino/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/lsj/code/MaskDINO-main/maskdino/maskdino.py", line 267, in forward losses = self.criterion(outputs, targets,mask_dict) File "/media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/anaconda3/envs/maskdino/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/lsj/code/MaskDINO-main/maskdino/modeling/criterion.py", line 357, in forward indices = self.matcher(outputs_without_aux, targets) File "/media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/anaconda3/envs/maskdino/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/anaconda3/envs/maskdino/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, *kwargs) File "/media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/lsj/code/MaskDINO-main/maskdino/modeling/matcher.py", line 220, in forward return self.memory_efficient_forward(outputs, targets, cost) File "/media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/anaconda3/envs/maskdino/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, **kwargs) File "/media/hky/d2bed7b9-228d-4a5b-ae76-f3f34ce12c7b/hky/lsj/code/MaskDINO-main/maskdino/modeling/matcher.py", line 169, in memory_efficient_forward cost_mask = batch_sigmoid_ce_loss_jit(out_mask, tgt_mask) RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch)

nvrtc compilation failed:

define NAN __int_as_float(0x7fffffff)

define POS_INFINITY __int_as_float(0x7f800000)

define NEG_INFINITY __int_as_float(0xff800000)

template device T maximum(T a, T b) { return isnan(a) ? a : (a > b ? a : b); }

template device T minimum(T a, T b) { return isnan(a) ? a : (a < b ? a : b); }

extern "C" global void fused_neg_add(float ttargets_1, float aten_add) { { float v = __ldg(ttargets_1 + (long long)(threadIdx.x) + 512ll (long long)(blockIdx.x)); aten_add[(long long)(threadIdx.x) + 512ll (long long)(blockIdx.x)] = (0.f - v) + 1.f; } }

github-actions[bot] commented 4 months ago

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template. The following information is missing: "Instructions To Reproduce the Issue and Full Logs";

Programmer-RD-AI commented 4 months ago

Hi, This issue seems to root from pytorch it self... Check: PyTorch Issue #87595, The issue was initially found in 2022 and now an update has been pushed... The following command should help you get the latest version

pip3 install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/test/cu117/torch_test.html If there are any issues till please feel free to comment :)

lsj1111 commented 4 months ago

Hi, This issue seems to root from pytorch it self... Check: PyTorch Issue #87595, The issue was initially found in 2022 and now an update has been pushed... The following command should help you get the latest version

pip3 install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/test/cu117/torch_test.html If there are any issues till please feel free to comment :)

year,thank you ,I have solved this problem. In the official documentation of detectron2, it seems that only cuda11.3 is supported, so I used cuda11.3 and caused the above problem, but then I found that cuda11.6 can also use detectron2, so the problem was solved. .

Programmer-RD-AI commented 4 months ago

ah ok great :) 👍🏽

facebookresearch / detectron2

RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch) #5318

define NAN __int_as_float(0x7fffffff)

define POS_INFINITY __int_as_float(0x7f800000)

define NEG_INFINITY __int_as_float(0xff800000)