facebookresearch / maskrcnn-benchmark

Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch.
MIT License
9.29k stars 2.5k forks source link

returned non-zero exit status 1. RuntimeError: _th_or not supported on CUDAType for Bool #1172

Open MickeyLQ opened 4 years ago

MickeyLQ commented 4 years ago

Traceback (most recent call last): File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/tools/train_net.py", line 201, in main() File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/tools/train_net.py", line 194, in main model = train(cfg, args.local_rank, args.distributed) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/tools/train_net.py", line 94, in train arguments, File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 72, in dotrain for iteration, (images, targets, ) in enumerate(data_loader, start_iter): File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 582, in next return self._process_next_batch(batch) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch raise batch.exc_type(batch.exc_msg) IndexError: Traceback (most recent call last): File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 99, in samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/utils/data/dataset.py", line 85, in getitem return self.datasets[dataset_idx][sample_idx] File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/data/datasets/coco.py", line 94, in getitem target = target.clip_to_image(remove_empty=True) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/structures/bounding_box.py", line 223, in clip_to_image return self[keep] File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/structures/bounding_box.py", line 208, in getitem bbox.add_field(k, v[item]) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/structures/segmentation_mask.py", line 555, in getitem selected_instances = self.instances.getitem(item) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/structures/segmentation_mask.py", line 464, in getitem selected_polygons.append(self.polygons[i]) IndexError: list index out of range

index created! Traceback (most recent call last): File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/tools/train_net.py", line 201, in main() File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/tools/train_net.py", line 194, in main model = train(cfg, args.local_rank, args.distributed) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/tools/train_net.py", line 94, in train arguments, File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 84, in do_train loss_dict = model(images, targets) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 376, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/amp/_initialize.py", line 197, in new_fwd applier(kwargs, input_caster)) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 52, in forward x, result, detector_losses = self.roi_heads(features, proposals, targets) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/roi_heads.py", line 26, in forward x, detections, loss_box = self.box(features, proposals, targets) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/box_head.py", line 43, in forward proposals = self.loss_evaluator.subsample(proposals, targets) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/loss.py", line 111, in subsample img_sampled_inds = torch.nonzero(pos_inds_img | neg_inds_img).squeeze(1) RuntimeError: _th_or not supported on CUDAType for Bool Traceback (most recent call last): File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/tools/train_net.py", line 201, in main() File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/tools/train_net.py", line 194, in main model = train(cfg, args.local_rank, args.distributed) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/tools/train_net.py", line 94, in train arguments, File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 84, in do_train loss_dict = model(images, targets) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 376, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/amp/_initialize.py", line 197, in new_fwd applier(kwargs, input_caster)) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 52, in forward x, result, detector_losses = self.roi_heads(features, proposals, targets) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/roi_heads.py", line 26, in forward x, detections, loss_box = self.box(features, proposals, targets) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/box_head.py", line 43, in forward proposals = self.loss_evaluator.subsample(proposals, targets) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/loss.py", line 111, in subsample img_sampled_inds = torch.nonzero(pos_inds_img | neg_inds_img).squeeze(1) RuntimeError: _th_or not supported on CUDAType for Bool Traceback (most recent call last): File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/tools/train_net.py", line 201, in main() File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/tools/train_net.py", line 194, in main model = train(cfg, args.local_rank, args.distributed) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/tools/train_net.py", line 94, in train arguments, File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 84, in do_train loss_dict = model(images, targets) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 376, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/amp/_initialize.py", line 197, in new_fwd applier(kwargs, input_caster)) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 52, in forward x, result, detector_losses = self.roi_heads(features, proposals, targets) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/roi_heads.py", line 26, in forward x, detections, loss_box = self.box(features, proposals, targets) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/box_head.py", line 43, in forward proposals = self.loss_evaluator.subsample(proposals, targets) File "/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/loss.py", line 111, in subsample img_sampled_inds = torch.nonzero(pos_inds_img | neg_inds_img).squeeze(1) RuntimeError: _th_or not supported on CUDAType for Bool Traceback (most recent call last): File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in main() File "/home/ai/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main cmd=process.args) subprocess.CalledProcessError: Command '['/home/ai/anaconda3/envs/maskrcnn_benchmark/bin/python', '-u', '/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/tools/train_net.py', '--local_rank=0', '--config-file', '/media/ai/fcb4c527-7fb0-41df-b900-75a0d4a92991/lq/maskrcnn-benchmark/configs/e2e_mask_rcnn_X_101_32x8d_FPN_1x.yaml']' returned non-zero exit status 1.

Environment

PyTorch version: 1.1.0 Is debug build: No CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.4 LTS GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609 CMake version: version 3.5.1

Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 9.0.176 GPU models and configuration: GPU 0: GeForce GTX 1080 Ti GPU 1: GeForce GTX 1080 Ti GPU 2: GeForce GTX 1080 Ti GPU 3: GeForce GTX 1080 Ti

Nvidia driver version: 390.25 cuDNN version: Could not collect

Versions of relevant libraries: [pip3] numpy==1.15.0 [pip3] numpydoc==0.7.0 [pip3] torch==1.1.0 [pip3] torchfile==0.1.0 [pip3] torchnet==0.0.5.1 [pip3] torchvision==0.3.0 [conda] blas 1.0 mkl
[conda] mkl 2019.4 243
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.0.15 py37ha843d7b_0
[conda] mkl_random 1.1.0 py37hd6b4f25_0
[conda] pytorch 1.1.0 py3.7_cuda9.0.176_cudnn7.5.1_0 pytorch [conda] pytorch-nightly 1.0.0.dev20190328 py3.7_cuda9.0.176_cudnn7.4.2_0 pytorch [conda] torchvision 0.3.0 py37_cu9.0.176_1 pytorch

how to fix it?

NuaaWill commented 4 years ago

I have the same problem. Please let me know if you can solve it. Thank you

lizhimll commented 4 years ago

Me ,too.The same problem has confused me so many days

lizhimll commented 4 years ago

Traceback (most recent call last): File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main() File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/site-packages/torch/distributed/launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/ok/anaconda3/envs/pytorchm/bin/python', '-u', 'tools/train_net.py', '--local_rank=3', '--config-file', 'config/e2e_mask_rcnn_R_50_C4_1x.yaml']' returned non-zero exit status 1. This is my traceback

NuaaWill commented 4 years ago

Traceback (most recent call last): File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main() File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/site-packages/torch/distributed/launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/ok/anaconda3/envs/pytorchm/bin/python', '-u', 'tools/train_net.py', '--local_rank=3', '--config-file', 'config/e2e_mask_rcnn_R_50_C4_1x.yaml']' returned non-zero exit status 1. This is my traceback

I've solved the problem and started training. To reinstall, you'll need to upgrade pytorch to 1.3, and the corresponding cuda and drivers will also need to be updated, but that's pretty quick.

MickeyLQ commented 4 years ago

Traceback (most recent call last): File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main() File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/site-packages/torch/distributed/launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/ok/anaconda3/envs/pytorchm/bin/python', '-u', 'tools/train_net.py', '--local_rank=3', '--config-file', 'config/e2e_mask_rcnn_R_50_C4_1x.yaml']' returned non-zero exit status 1. This is my traceback

I've solved the problem and started training. To reinstall, you'll need to upgrade pytorch to 1.3, and the corresponding cuda and drivers will also need to be updated, but that's pretty quick.

But I can't update cuda, what should I do if I must use cuda9

NuaaWill commented 4 years ago

Traceback (most recent call last): File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main() File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/site-packages/torch/distributed/launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/ok/anaconda3/envs/pytorchm/bin/python', '-u', 'tools/train_net.py', '--local_rank=3', '--config-file', 'config/e2e_mask_rcnn_R_50_C4_1x.yaml']' returned non-zero exit status 1. This is my traceback

I've solved the problem and started training. To reinstall, you'll need to upgrade pytorch to 1.3, and the corresponding cuda and drivers will also need to be updated, but that's pretty quick.

But I can't update cuda, what should I do if I must use cuda9

You can try to change the bug code in a python file (such as a few lines of code in loss.py), but the reality is that an operating system can have several cudas co-existing, and you can switch cudas by modifying the soft connection.

MickeyLQ commented 4 years ago

Traceback (most recent call last): File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main() File "/home/ok/anaconda3/envs/pytorchm/lib/python3.7/site-packages/torch/distributed/launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/ok/anaconda3/envs/pytorchm/bin/python', '-u', 'tools/train_net.py', '--local_rank=3', '--config-file', 'config/e2e_mask_rcnn_R_50_C4_1x.yaml']' returned non-zero exit status 1. This is my traceback

I've solved the problem and started training. To reinstall, you'll need to upgrade pytorch to 1.3, and the corresponding cuda and drivers will also need to be updated, but that's pretty quick.

But I can't update cuda, what should I do if I must use cuda9

You can try to change the bug code in a python file (such as a few lines of code in loss.py), but the reality is that an operating system can have several cudas co-existing, and you can switch cudas by modifying the soft connection.

Thanks, but i can't change the drive (for some reasong) and that's the newest version of cuda i can use.

MickeyLQ commented 4 years ago

I have the same problem. Please let me know if you can solve it. Thank you

https://github.com/facebookresearch/maskrcnn-benchmark/issues/1172#issuecomment-562123695 He fixed it by upgrade pytorch to 1.3. Maybe you can try it.

RadiantJeral commented 4 years ago

I met the problem in torch_nightly==1.0 , and I fixed it in #1182

lmomoy commented 4 years ago

Have you solved this problem by not upgrading the version of pytorch?

hongfz16 commented 4 years ago

Have you solved this problem by not upgrading the version of pytorch?

Just replace all torch.bool with torch.uint8 in file modeling/balanced_positive_negative_sampler.py and file structures/segmentation_mask.py. Works for me (Pytorch 1.1.0 and CUDA 9.0)

Ruolingdeng commented 4 years ago

@hongfz16 The solution method you said works to me, thank you very much.