Closed HLearning closed 5 years ago
When you compiled maskrcnn-benchmark
, you probably didn't have a PyTorch with CUDA enabled.
How did you install maskrcnn-benchmark
?
I have the same error as HLearning. I have built maskrcnn with CUDA 9.2 enabled with python3 setup.py build develop. I checked if cuda.is_available() and CUDA_HOME, and I get True and /usr/local/cuda-9.2 as expcted. But I still get the same error. Where else could the problem be? thanks
Are you running in docker? This might be related to https://github.com/facebookresearch/maskrcnn-benchmark/issues/167
No, I followed the Option 1: Step-by-step installation. I saw the thread and was also wondering if it is related to the cuda version (I am running on 9.2)
Can you try uninstalling and installing again maskrcnn-benchmark
?
For some reason CUDA was not picked up when you first installed it I suppose.
Yeah that's what I thought, I uninstalled it and manually removed all dependencies. It's still not working after trying to reinstall.
I will try to install via install instead of build develop. Will keep you posted
inside the setup.py
script, can you print torch.cuda.is_available()
and CUDA_HOME
?
I get True
and /usr/local/cuda-9.2
resp.
Normal install instead of build develop also gives the same error.
@Nacho114 what's the code that you are trying to run? And what is the full error message?
I am running a modified version of maskrcnn-benchmark/tools/train_net.py to run on my custom data loader. (I followed the instructions to make the custom dataset).
The full error message is:
. . .
eight loaded from conv1.weight of shape (64, 3, 7, 7)
2018-11-29 13:38:21,685 maskrcnn_benchmark.trainer INFO: Start training
Traceback (most recent call last):
File "relational_rxn_graphs/detector/train.py", line 227, in <module>
main()
File "relational_rxn_graphs/detector/train.py", line 220, in main
model = train(cfg, data_cfg, args.local_rank, args.distributed)
File "relational_rxn_graphs/detector/train.py", line 71, in train
arguments,
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 66, in do_train
loss_dict = model(images, targets)
File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 479, in __call__
result = self.forward(*input, **kwargs)
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
proposals, proposal_losses = self.rpn(images, features, targets)
File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 479, in __call__
result = self.forward(*input, **kwargs)
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 100, in forward
return self._forward_train(anchors, objectness, rpn_box_regression, targets)
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 116, in _forward_train
anchors, objectness, rpn_box_regression, targets
File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 479, in __call__
result = self.forward(*input, **kwargs)
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 138, in forward
sampled_boxes.append(self.forward_for_single_feature_map(a, o, b))
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 118, in forward_for_single_feature_map
score_field="objectness",
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/structures/boxlist_ops.py", line 27, in boxlist_nms
keep = _box_nms(boxes, score, nms_thresh)
RuntimeError: Not compiled with GPU support (nms at /ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/csrc/nms.h:22)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xb0 (0x1002eb242d70 in /u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/lib/libc10.so)
frame #1: nms(at::Tensor const&, at::Tensor const&, float) + 0x108 (0x10031db394d8 in /ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-35m-powerpc64le-linux-gnu.so)
frame #2: <unknown function> + 0x1a10c (0x10031db4a10c in /ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-35m-powerpc64le-linux-gnu.so)
frame #3: <unknown function> + 0x163e8 (0x10031db463e8 in /ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-35m-powerpc64le-linux-gnu.so)
<omitting python frames>
the . . .
is just the weights being imported and displayed in std out.
so, after all, this might not be a docker issue (#167)
Please let us know if you make any progress @Nacho114
@miguelvr Will do, someone else will install it independently on the same cluster to see if they can get it to work. To be honest I do not know where else to look in terms of debugging. So if you could point out to me where else to look that would be great!
we are running against the same error with docker (although it works for me in our cluster)
My environment:
Below Anaconda:
Python: 3.7
Cuda:9.2
Cudnn:7.2
Pytorch:1.0
G++, gcc: 5.5
A lot of modifications have been tried, but mistakes always occur.
run: python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)"
show:
RuntimeError: Not compiled with GPU support (nms at /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/csrc/nms.h:22)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fd1fcde78d5 in /home/hjl/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: nms(at::Tensor const&, at::Tensor const&, float) + 0xd4 (0x7fd1f7dc4564 in /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so)
frame #2: <unknown function> + 0x15d05 (0x7fd1f7dd0d05 in /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x15dfe (0x7fd1f7dd0dfe in /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x12c3e (0x7fd1f7dcdc3e in /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so)
<omitting python frames>
frame #63: __libc_start_main + 0xe7 (0x7fd23ded2b97 in /lib/x86_64-linux-gnu/libc.so.6)
I tried to run other pytorch code, CUDA is working
The problem has been solved.
If you use anaconda, activate envs,
conda install -c pytorch pytorch-nightly cuda92
@Nacho114 is the solution from @HLearning the right one for you?
Currently reinstalling from scratch (torch included), if that does not work I will see if I can get conda working on the cluster to try the solution proposed by HLearning. Will report back when I'm done.
@Nacho114 is the solution from @HLearning the right one for you?
yes
@Nacho114 one thing to check: verify that the python that you are using to run the python setup.py build develop
is the same as the one you are running your scripts
which python
should help you there, as well as the pytorch versions / location in each one of the interpreters
I've tried to be meticulous with witch python I'm using. So after a clean reinstall of everything I am getting a new error (good!):
2018-12-03 14:19:21,321 maskrcnn_benchmark.trainer INFO: Start training
/ibm/gpfs-homes/ial/.local/tmp_compilation/pytorch-master-at10.0/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [21,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
File "relational_rxn_graphs/detector/train.py", line 227, in <module>
main()
File "relational_rxn_graphs/detector/train.py", line 220, in main
model = train(cfg, data_cfg, args.local_rank, args.distributed)
File "relational_rxn_graphs/detector/train.py", line 71, in train
arguments,
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 66, in do_train
loss_dict = model(images, targets)
File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 52, in forward
x, result, detector_losses = self.roi_heads(features, proposals, targets)
File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/roi_heads.py", line 23, in forward
x, detections, loss_box = self.box(features, proposals, targets)
File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/box_head.py", line 55, in forward
[class_logits], [box_regression]
File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/loss.py", line 144, in __call__
sampled_pos_inds_subset = torch.nonzero(labels > 0).squeeze(1)
RuntimeError: copy_if failed to synchronize: device-side assert triggered
At first glance this seems to be a problem on my side, so I will report back if it works after this.
This error is a bug in PyTorch that has normally been fixed with the latest version that is available. Which version of PyTorch are you running?
torch version = 1.0.0a0+5c89190
Hum, weird. I believe this problem should have been fixed with your version of PyTorch. Can you double check that this is indeed picking this version, and if that's the case open a new issue? The original issue seems to have been fixed.
Will do, thanks.
Will do, thanks.
Hi, I encountered the same error as you did. I reinstalled everything but had no luck. Then I randomly deleted folder "build/" under the maskrcnn_benchmark and rebuilt the project with setup.py. Now everything works.
This solved my problem. Hope it solve yours too.
Will do, thanks.
Hi, I encountered the same error as you did. I reinstalled everything but had no luck. Then I randomly deleted folder "build/" under the maskrcnn_benchmark and rebuilt the project with setup.py. Now everything works.
This solved my problem. Hope it solve yours too.
Your solution also works for me!@randomwalk10
Will do, thanks.
Hi, I encountered the same error as you did. I reinstalled everything but had no luck. Then I randomly deleted folder "build/" under the maskrcnn_benchmark and rebuilt the project with setup.py. Now everything works. This solved my problem. Hope it solve yours too.
Your solution also works for me!@randomwalk10
Glad to hear! I guess "python setup.py clean" does NOT clean everything and we have to manually delete "build/" in the end LOL.
Just for anybody else creating a docker image of this that runs into this problem -- with an environment with a valid cuda setup that's not being picked up, setting the environment variable FORCE_CUDA to 1 before building/installing the project resolved this issue for me
I run into the problem when using pycharm to debug remotely. And in my case the problem is caused by the file SOURCES.txt under the folder maskrcnn_benchmark.egg-info.
Will do, thanks.
Hi, I encountered the same error as you did. I reinstalled everything but had no luck. Then I randomly deleted folder "build/" under the maskrcnn_benchmark and rebuilt the project with setup.py. Now everything works.
This solved my problem. Hope it solve yours too.
This is right! After re-install the CUDA, we must re-build!
Will do, thanks.
Hi, I encountered the same error as you did. I reinstalled everything but had no luck. Then I randomly deleted folder "build/" under the maskrcnn_benchmark and rebuilt the project with setup.py. Now everything works. This solved my problem. Hope it solve yours too.
This is right! After re-install the CUDA, we must re-build! 谢谢,英语不好,就用中文了,其实问题去年已经解决了,造成这个问题的原因是,我使用了conda 并且conda 中安装的cuda 和ubuntu 系统中的版本不一致,然而,编绎时,用的系统中的cuda 进行编绎的,而调用时则用的conda 中的cuda ,所以一直无法调用。重装能解决问题也要两个cuda 版本一致才可以
Will do, thanks.
Hi, I encountered the same error as you did. I reinstalled everything but had no luck. Then I randomly deleted folder "build/" under the maskrcnn_benchmark and rebuilt the project with setup.py. Now everything works.
This solved my problem. Hope it solve yours too.
Thanks. Solve my problem too. I also counter the "Not compiled with GPU" problem. Actually I have successfully run the demo before. But later I move the file to another folder. And just following the INSTALL.MD to rebuild it doesn't work. First I also think the problem originate from CUDA. But after checking my CUDA over and over again, I'm pretty sure it works exatly fine. After seeing your answer, I try to delete the build folder and rebuild it. Then some magic just happens and it works perfectly. So I guess maybe delete the build folder before rebuilding is pretty important!
❓ Questions and Help
`RuntimeError: Not compiled with GPU support (nms at /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/csrc/nms.h:22) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fda63bc0915 in /home/hjl/anaconda3/envs/pytorch1.0/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: nms(at::Tensor const&, at::Tensor const&, float) + 0xd4 (0x7fda5ee41954 in /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so) frame #2: + 0x14e1d (0x7fda5ee4de1d in /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so)
frame #3: + 0x12291 (0x7fda5ee4b291 in /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so)