donnyyou / torchcv

TorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision
https://pytorchcv.com
Apache License 2.0
2.25k stars 378 forks source link

cudaGetLastError() == cudaSuccess ASSERT FAILED #59

Open mqchen1993 opened 5 years ago

mqchen1993 commented 5 years ago

2019-06-12 06:42:21,014 INFO [module_helper.py, 138] Loading pretrained model:/tmp/cars_segmentation/torchcv/pretrained_models/3x3resnet101-imagenet.pth 2019-06-12 06:42:28,858 INFO [controller.py, 28] Training start... Traceback (most recent call last): File "main.py", line 199, in Controller.train(runner) File "/tmp/cars_segmentation/torchcv/methods/tools/controller.py", line 40, in train runner.train() File "/tmp/cars_segmentation/torchcv/methods/seg/fcn_segmentor.py", line 85, in train out_dict = self.seg_net(data_dict) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/tmp/cars_segmentation/torchcv/models/seg/nets/pspnet.py", line 84, in forward x = self.backbone(data_dict['img']) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/tmp/cars_segmentation/torchcv/models/backbones/resnet/resnet_backbone.py", line 94, in forward x = self.prefix(x) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/tmp/cars_segmentation/torchcv/extensions/ops/sync_bn/syncbn.py", line 44, in forward xsum, xsqsum = sum_square(input) File "/tmp/cars_segmentation/torchcv/extensions/ops/sync_bn/functions.py", line 19, in sum_square return _sum_square.apply(input) File "/tmp/cars_segmentation/torchcv/extensions/ops/sync_bn/functions.py", line 27, in forward xsum, xsqusum = gpu.sumsquare_forward(input)

RuntimeError: cudaGetLastError() == cudaSuccess ASSERT FAILED at syncbn_kernel.cu:263, please report a bug to PyTorch. (Sum_Square_Forward_CUDA at syncbn_kernel.cu:263) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f6cd6bd1441 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f6cd6bd0d7a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #2: Sum_Square_Forward_CUDA(at::Tensor) + 0x281 (0x7f6cbdb615e2 in /tmp/cars_segmentation/torchcv/extensions/ops/sync_bn/src/gpu/syncbn_gpu.cpython-36m-x86_64-linux-gnu.so) frame #3: + 0x1fc29 (0x7f6cbdb57c29 in /tmp/cars_segmentation/torchcv/extensions/ops/sync_bn/src/gpu/syncbn_gpu.cpython-36m-x86_64-linux-gnu.so) frame #4: + 0x24095 (0x7f6cbdb5c095 in /tmp/cars_segmentation/torchcv/extensions/ops/sync_bn/src/gpu/syncbn_gpu.cpython-36m-x86_64-linux-gnu.so)

when I train, I get this error. Can you help me solve it, thanks

donnyyou commented 5 years ago

https://github.com/donnyyou/torchcv/issues/52#issuecomment-497177002 replace the bn type with encoding syncbn.