facebookresearch / maskrcnn-benchmark

Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch.
MIT License
9.29k stars 2.5k forks source link

Not compiled with GPU support #230

Closed HLearning closed 5 years ago

HLearning commented 5 years ago

❓ Questions and Help

`RuntimeError: Not compiled with GPU support (nms at /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/csrc/nms.h:22) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fda63bc0915 in /home/hjl/anaconda3/envs/pytorch1.0/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: nms(at::Tensor const&, at::Tensor const&, float) + 0xd4 (0x7fda5ee41954 in /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so) frame #2: + 0x14e1d (0x7fda5ee4de1d in /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so) frame #3: + 0x12291 (0x7fda5ee4b291 in /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so)

frame #62: __libc_start_main + 0xe7 (0x7fdaa4e11b97 in /lib/x86_64-linux-gnu/libc.so.6)`
fmassa commented 5 years ago

When you compiled maskrcnn-benchmark, you probably didn't have a PyTorch with CUDA enabled. How did you install maskrcnn-benchmark?

Nacho114 commented 5 years ago

I have the same error as HLearning. I have built maskrcnn with CUDA 9.2 enabled with python3 setup.py build develop. I checked if cuda.is_available() and CUDA_HOME, and I get True and /usr/local/cuda-9.2 as expcted. But I still get the same error. Where else could the problem be? thanks

fmassa commented 5 years ago

Are you running in docker? This might be related to https://github.com/facebookresearch/maskrcnn-benchmark/issues/167

Nacho114 commented 5 years ago

No, I followed the Option 1: Step-by-step installation. I saw the thread and was also wondering if it is related to the cuda version (I am running on 9.2)

fmassa commented 5 years ago

Can you try uninstalling and installing again maskrcnn-benchmark? For some reason CUDA was not picked up when you first installed it I suppose.

Nacho114 commented 5 years ago

Yeah that's what I thought, I uninstalled it and manually removed all dependencies. It's still not working after trying to reinstall.

I will try to install via install instead of build develop. Will keep you posted

fmassa commented 5 years ago

inside the setup.py script, can you print torch.cuda.is_available() and CUDA_HOME?

Nacho114 commented 5 years ago

I get True and /usr/local/cuda-9.2 resp.

Normal install instead of build develop also gives the same error.

fmassa commented 5 years ago

@Nacho114 what's the code that you are trying to run? And what is the full error message?

Nacho114 commented 5 years ago

I am running a modified version of maskrcnn-benchmark/tools/train_net.py to run on my custom data loader. (I followed the instructions to make the custom dataset).

The full error message is:

. . .
eight                  loaded from conv1.weight                 of shape (64, 3, 7, 7)
2018-11-29 13:38:21,685 maskrcnn_benchmark.trainer INFO: Start training
Traceback (most recent call last):
  File "relational_rxn_graphs/detector/train.py", line 227, in <module>
    main()
  File "relational_rxn_graphs/detector/train.py", line 220, in main
    model = train(cfg, data_cfg, args.local_rank, args.distributed)
  File "relational_rxn_graphs/detector/train.py", line 71, in train
    arguments,
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 66, in do_train
    loss_dict = model(images, targets)
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 479, in __call__
    result = self.forward(*input, **kwargs)
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 479, in __call__
    result = self.forward(*input, **kwargs)
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 100, in forward
    return self._forward_train(anchors, objectness, rpn_box_regression, targets)
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 116, in _forward_train
    anchors, objectness, rpn_box_regression, targets
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 479, in __call__
    result = self.forward(*input, **kwargs)
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 138, in forward
    sampled_boxes.append(self.forward_for_single_feature_map(a, o, b))
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 118, in forward_for_single_feature_map
    score_field="objectness",
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/structures/boxlist_ops.py", line 27, in boxlist_nms
    keep = _box_nms(boxes, score, nms_thresh)
RuntimeError: Not compiled with GPU support (nms at /ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/csrc/nms.h:22)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xb0 (0x1002eb242d70 in /u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/lib/libc10.so)
frame #1: nms(at::Tensor const&, at::Tensor const&, float) + 0x108 (0x10031db394d8 in /ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-35m-powerpc64le-linux-gnu.so)
frame #2: <unknown function> + 0x1a10c (0x10031db4a10c in /ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-35m-powerpc64le-linux-gnu.so)
frame #3: <unknown function> + 0x163e8 (0x10031db463e8 in /ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-35m-powerpc64le-linux-gnu.so)
<omitting python frames>
Nacho114 commented 5 years ago

the . . . is just the weights being imported and displayed in std out.

miguelvr commented 5 years ago

so, after all, this might not be a docker issue (#167)

Please let us know if you make any progress @Nacho114

Nacho114 commented 5 years ago

@miguelvr Will do, someone else will install it independently on the same cluster to see if they can get it to work. To be honest I do not know where else to look in terms of debugging. So if you could point out to me where else to look that would be great!

miguelvr commented 5 years ago

we are running against the same error with docker (although it works for me in our cluster)

HLearning commented 5 years ago

My environment: Below Anaconda: Python: 3.7 Cuda:9.2 Cudnn:7.2 Pytorch:1.0 G++, gcc: 5.5 A lot of modifications have been tried, but mistakes always occur. run: python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" show:

RuntimeError: Not compiled with GPU support (nms at /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/csrc/nms.h:22)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fd1fcde78d5 in /home/hjl/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: nms(at::Tensor const&, at::Tensor const&, float) + 0xd4 (0x7fd1f7dc4564 in /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so)
frame #2: <unknown function> + 0x15d05 (0x7fd1f7dd0d05 in /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x15dfe (0x7fd1f7dd0dfe in /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x12c3e (0x7fd1f7dcdc3e in /home/hjl/PyTorch_MaskRcnn/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so)
<omitting python frames>
frame #63: __libc_start_main + 0xe7 (0x7fd23ded2b97 in /lib/x86_64-linux-gnu/libc.so.6)

I tried to run other pytorch code, CUDA is working

HLearning commented 5 years ago

The problem has been solved. If you use anaconda, activate envs, conda install -c pytorch pytorch-nightly cuda92

fmassa commented 5 years ago

@Nacho114 is the solution from @HLearning the right one for you?

Nacho114 commented 5 years ago

Currently reinstalling from scratch (torch included), if that does not work I will see if I can get conda working on the cluster to try the solution proposed by HLearning. Will report back when I'm done.

HLearning commented 5 years ago

@Nacho114 is the solution from @HLearning the right one for you?

yes

fmassa commented 5 years ago

@Nacho114 one thing to check: verify that the python that you are using to run the python setup.py build develop is the same as the one you are running your scripts

which python

should help you there, as well as the pytorch versions / location in each one of the interpreters

Nacho114 commented 5 years ago

I've tried to be meticulous with witch python I'm using. So after a clean reinstall of everything I am getting a new error (good!):

2018-12-03 14:19:21,321 maskrcnn_benchmark.trainer INFO: Start training
/ibm/gpfs-homes/ial/.local/tmp_compilation/pytorch-master-at10.0/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [21,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "relational_rxn_graphs/detector/train.py", line 227, in <module>
    main()
  File "relational_rxn_graphs/detector/train.py", line 220, in main
    model = train(cfg, data_cfg, args.local_rank, args.distributed)
  File "relational_rxn_graphs/detector/train.py", line 71, in train
    arguments,
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 66, in do_train
    loss_dict = model(images, targets)
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 52, in forward
    x, result, detector_losses = self.roi_heads(features, proposals, targets)
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/roi_heads.py", line 23, in forward
    x, detections, loss_box = self.box(features, proposals, targets)
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/box_head.py", line 55, in forward
    [class_logits], [box_regression]
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/loss.py", line 144, in __call__
    sampled_pos_inds_subset = torch.nonzero(labels > 0).squeeze(1)
RuntimeError: copy_if failed to synchronize: device-side assert triggered

At first glance this seems to be a problem on my side, so I will report back if it works after this.

fmassa commented 5 years ago

This error is a bug in PyTorch that has normally been fixed with the latest version that is available. Which version of PyTorch are you running?

Nacho114 commented 5 years ago

torch version = 1.0.0a0+5c89190

fmassa commented 5 years ago

Hum, weird. I believe this problem should have been fixed with your version of PyTorch. Can you double check that this is indeed picking this version, and if that's the case open a new issue? The original issue seems to have been fixed.

Nacho114 commented 5 years ago

Will do, thanks.

randomwalk10 commented 5 years ago

Will do, thanks.

Hi, I encountered the same error as you did. I reinstalled everything but had no luck. Then I randomly deleted folder "build/" under the maskrcnn_benchmark and rebuilt the project with setup.py. Now everything works.

This solved my problem. Hope it solve yours too.

KleinXin commented 5 years ago

Will do, thanks.

Hi, I encountered the same error as you did. I reinstalled everything but had no luck. Then I randomly deleted folder "build/" under the maskrcnn_benchmark and rebuilt the project with setup.py. Now everything works.

This solved my problem. Hope it solve yours too.

Your solution also works for me!@randomwalk10

randomwalk10 commented 5 years ago

Will do, thanks.

Hi, I encountered the same error as you did. I reinstalled everything but had no luck. Then I randomly deleted folder "build/" under the maskrcnn_benchmark and rebuilt the project with setup.py. Now everything works. This solved my problem. Hope it solve yours too.

Your solution also works for me!@randomwalk10

Glad to hear! I guess "python setup.py clean" does NOT clean everything and we have to manually delete "build/" in the end LOL.

steven-s commented 5 years ago

Just for anybody else creating a docker image of this that runs into this problem -- with an environment with a valid cuda setup that's not being picked up, setting the environment variable FORCE_CUDA to 1 before building/installing the project resolved this issue for me

yangjzx commented 5 years ago

I run into the problem when using pycharm to debug remotely. And in my case the problem is caused by the file SOURCES.txt under the folder maskrcnn_benchmark.egg-info.

ghost commented 4 years ago

Will do, thanks.

Hi, I encountered the same error as you did. I reinstalled everything but had no luck. Then I randomly deleted folder "build/" under the maskrcnn_benchmark and rebuilt the project with setup.py. Now everything works.

This solved my problem. Hope it solve yours too.

This is right! After re-install the CUDA, we must re-build!

HLearning commented 4 years ago

Will do, thanks.

Hi, I encountered the same error as you did. I reinstalled everything but had no luck. Then I randomly deleted folder "build/" under the maskrcnn_benchmark and rebuilt the project with setup.py. Now everything works. This solved my problem. Hope it solve yours too.

This is right! After re-install the CUDA, we must re-build! 谢谢,英语不好,就用中文了,其实问题去年已经解决了,造成这个问题的原因是,我使用了conda 并且conda 中安装的cuda 和ubuntu 系统中的版本不一致,然而,编绎时,用的系统中的cuda 进行编绎的,而调用时则用的conda 中的cuda ,所以一直无法调用。重装能解决问题也要两个cuda 版本一致才可以

ResearcherYan commented 3 years ago

Will do, thanks.

Hi, I encountered the same error as you did. I reinstalled everything but had no luck. Then I randomly deleted folder "build/" under the maskrcnn_benchmark and rebuilt the project with setup.py. Now everything works.

This solved my problem. Hope it solve yours too.

Thanks. Solve my problem too. I also counter the "Not compiled with GPU" problem. Actually I have successfully run the demo before. But later I move the file to another folder. And just following the INSTALL.MD to rebuild it doesn't work. First I also think the problem originate from CUDA. But after checking my CUDA over and over again, I'm pretty sure it works exatly fine. After seeing your answer, I try to delete the build folder and rebuild it. Then some magic just happens and it works perfectly. So I guess maybe delete the build folder before rebuilding is pretty important!