facebookresearch / maskrcnn-benchmark

Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch.
MIT License
9.27k stars 2.5k forks source link

copy_if failed to synchronize: device-side assert triggered #733

Open chengruizhe opened 5 years ago

chengruizhe commented 5 years ago

❓ Questions and Help

I was training a customized module in fbnet and encountered the following error:

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [40,0,0], thread: [104,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [40,0,0], thread: [105,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [40,0,0], thread: [106,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [40,0,0], thread: [107,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "tools/train_net.py", line 186, in <module>
    main()
  File "tools/train_net.py", line 179, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 85, in train
    arguments,
  File "/data/ryancheng/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 67, in do_train
    loss_dict = model(images, targets)
  File "/data/ryancheng/miniconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/ryancheng/miniconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/amp/_initialize.py", line 194, in new_fwd
    **applier(kwargs, input_caster))
  File "/data/ryancheng/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/data/ryancheng/miniconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/ryancheng/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 159, in forward
    return self._forward_train(anchors, objectness, rpn_box_regression, targets)
  File "/data/ryancheng/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 175, in _forward_train
    anchors, objectness, rpn_box_regression, targets
  File "/data/ryancheng/miniconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/ryancheng/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 140, in forward
    sampled_boxes.append(self.forward_for_single_feature_map(a, o, b))
  File "/data/ryancheng/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 115, in forward_for_single_feature_map
    boxlist = remove_small_boxes(boxlist, self.min_size)
  File "/data/ryancheng/maskrcnn-benchmark/maskrcnn_benchmark/structures/boxlist_ops.py", line 46, in remove_small_boxes
    (ws >= min_size) & (hs >= min_size)
RuntimeError: copy_if failed to synchronize: device-side assert triggered

Any idea on what this error is caused by? Thanks in advance!

AlenUbuntu commented 5 years ago

Same Error Here.

MelonEater commented 5 years ago

same too....

chengruizhe commented 5 years ago

I think this could be related to learning rate. Try using a smaller learning rate.

AlenUbuntu commented 5 years ago

Yes, if I use a smaller learning rate, the issue disappears

Jayis commented 5 years ago

same error

if the problem happened in training, smaller learning rate helps but I've also encountered this error while testing once... while testing, that's not possible to solve it by using smaller learning rate, right?

I've tried to debug, but I can't even access the "boxlist". Runtime error happened when I try to print the "boxlist".

I really want to know if anybody had another solution rather than just "using smaller learning rate"...

Sreehari-S commented 5 years ago

@Jayis

I'm encountering the same issue.Were u able to solve it?

jiushishuai88 commented 4 years ago

i encountered this issue when i set wrong num of class.It solved by correct the output num_class