Closed YiwuZhong closed 4 years ago
It is a pytorch bug that has been fixed in https://github.com/pytorch/pytorch/issues/35202 The training has diverged and triggered this bug. A smaller learning rate would help.
It is a pytorch bug that has been fixed in pytorch/pytorch#35202 The training has diverged and triggered this bug. A smaller learning rate would help.
@ppwwyyxx Hi, I also meet the same problem.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=77 : an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f0eb3f60536 in /miniconda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x7ae (0x7f0eb41a3fbe in /miniconda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f0eb3f50abd in /miniconda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x522a12 (0x7f0ef2bb6a12 in /miniconda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x522ab6 (0x7f0ef2bb6ab6 in /miniconda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x19dfce (0x556fd8543fce in /miniconda/bin/python)
frame #6: <unknown function> + 0x113a6b (0x556fd84b9a6b in /miniconda/bin/python)
frame #7: <unknown function> + 0x113682 (0x556fd84b9682 in /miniconda/bin/python)
frame #8: <unknown function> + 0x113bc7 (0x556fd84b9bc7 in /miniconda/bin/python)
frame #9: <unknown function> + 0x113bc7 (0x556fd84b9bc7 in /miniconda/bin/python)
frame #10: <unknown function> + 0x103948 (0x556fd84a9948 in /miniconda/bin/python)
frame #11: <unknown function> + 0x114267 (0x556fd84ba267 in /miniconda/bin/python)
frame #12: <unknown function> + 0x11427d (0x556fd84ba27d in /miniconda/bin/python)
frame #13: <unknown function> + 0x11427d (0x556fd84ba27d in /miniconda/bin/python)
frame #14: <unknown function> + 0x11427d (0x556fd84ba27d in /miniconda/bin/python)
frame #15: <unknown function> + 0x11427d (0x556fd84ba27d in /miniconda/bin/python)
frame #16: <unknown function> + 0x11427d (0x556fd84ba27d in /miniconda/bin/python)
frame #17: <unknown function> + 0x11427d (0x556fd84ba27d in /miniconda/bin/python)
frame #18: <unknown function> + 0x11427d (0x556fd84ba27d in /miniconda/bin/python)
frame #19: <unknown function> + 0xfc157 (0x556fd84a2157 in /miniconda/bin/python)
frame #20: <unknown function> + 0xfc1c3 (0x556fd84a21c3 in /miniconda/bin/python)
frame #21: <unknown function> + 0xfc146 (0x556fd84a2146 in /miniconda/bin/python)
frame #22: <unknown function> + 0x1d0f13 (0x556fd8576f13 in /miniconda/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x2a2a (0x556fd85695ea in /miniconda/bin/python)
frame #24: _PyFunction_FastCallKeywords + 0xfb (0x556fd850668b in /miniconda/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x6a0 (0x556fd8567260 in /miniconda/bin/python)
frame #26: _PyFunction_FastCallKeywords + 0xfb (0x556fd850668b in /miniconda/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x416 (0x556fd8566fd6 in /miniconda/bin/python)
frame #28: _PyEval_EvalCodeWithName + 0x2f9 (0x556fd84c06f9 in /miniconda/bin/python)
frame #29: _PyFunction_FastCallKeywords + 0x387 (0x556fd8506917 in /miniconda/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x14e6 (0x556fd85680a6 in /miniconda/bin/python)
frame #31: _PyEval_EvalCodeWithName + 0x2f9 (0x556fd84c06f9 in /miniconda/bin/python)
frame #32: PyEval_EvalCodeEx + 0x44 (0x556fd84c15f4 in /miniconda/bin/python)
frame #33: PyEval_EvalCode + 0x1c (0x556fd84c161c in /miniconda/bin/python)
frame #34: <unknown function> + 0x21c974 (0x556fd85c2974 in /miniconda/bin/python)
frame #35: PyRun_StringFlags + 0x7d (0x556fd85cdbdd in /miniconda/bin/python)
frame #36: PyRun_SimpleStringFlags + 0x3f (0x556fd85cdc3f in /miniconda/bin/python)
frame #37: <unknown function> + 0x227d3d (0x556fd85cdd3d in /miniconda/bin/python)
frame #38: _Py_UnixMain + 0x3c (0x556fd85ce0bc in /miniconda/bin/python)
frame #39: __libc_start_main + 0xe7 (0x7f0ef58b8b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #40: <unknown function> + 0x1d0990 (0x556fd8576990 in /miniconda/bin/python)
NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:69, unhandled cuda error, NCCL version 2.4.8
Traceback (most recent call last):
File "tools/train_net3.py", line 263, in <module>
args=(args,),
File "/miniconda/lib/python3.7/site-packages/detectron2/engine/launch.py", line 59, in launch
daemon=False,
File "/miniconda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/miniconda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/miniconda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/miniconda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/miniconda/lib/python3.7/site-packages/detectron2/engine/launch.py", line 94, in _distributed_worker
main_func(*args)
File "/mnt/task_runtime/tools/train_net3.py", line 234, in main
return trainer.train()
File "/mnt/task_runtime/tools/train_net3.py", line 116, in train
self.train_loop(self.start_iter, self.max_iter)
File "/mnt/task_runtime/tools/train_net3.py", line 105, in train_loop
self.run_step()
File "/miniconda/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 226, in run_step
loss_dict = self.model(data)
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/miniconda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 445, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/task_runtime/adet/modeling/one_stage_detector.py", line 123, in forward
_, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/task_runtime/adet/modeling/roi_heads/text_head.py", line 163, in forward
preds, rec_loss = self.recognizer(bezier_features, targets)
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/task_runtime/adet/modeling/roi_heads/attn_predictor.py", line 130, in forward
decoder_input, decoder_hidden, rois)
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/task_runtime/adet/modeling/roi_heads/attn_predictor.py", line 88, in forward
output = torch.cat((embedded, attn_applied.squeeze(1)), 1)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278
And I also set a smaller learning rate 0.000001
, but this problem still exists, how can I fix it?
Instructions To Reproduce the š Bug:
what changes you made (
git diff
) or what code you wrote Use default codes and COCO dataset.what exact command you run: run
python train_net.py \ --config-file configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco.yaml --num-gpus 2 SOLVER.IMS_PER_BATCH 2 SOLVER.CLIP_GRADIENTS.ENABLED True
at the PointRend folder.what you observed (including full logs):
terminate called after throwing an instance of 'c10::Error' what(): CUDA error: an illegal memory access was encountered (insert_events at /opt/conda/conda-bld/pytorch_1587428094786/work/c10/cuda/CUDACachingAllocator.cpp:771) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f494e1f9b5e in /home/user/.conda/envs/det2/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x6d0 (0x7f494dfb4e30 in /home/user/.conda/envs/det2/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f494e1e76ed in /home/user/.conda/envs/det2/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5191fb (0x7f497b20f1fb in /home/user/.conda/envs/det2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x1c10d9 (0x5602cb4630d9 in /home/user/.conda/envs/det2/bin/python)
frame #5: + 0x1282a6 (0x5602cb3ca2a6 in /home/user/.conda/envs/det2/bin/python)
frame #6: + 0x127f23 (0x5602cb3c9f23 in /home/user/.conda/envs/det2/bin/python)
frame #7: + 0x128696 (0x5602cb3ca696 in /home/user/.conda/envs/det2/bin/python)
frame #8: + 0x128696 (0x5602cb3ca696 in /home/user/.conda/envs/det2/bin/python)
frame #9: + 0x11c2a0 (0x5602cb3be2a0 in /home/user/.conda/envs/det2/bin/python)
frame #10: + 0x128a56 (0x5602cb3caa56 in /home/user/.conda/envs/det2/bin/python)
frame #11: + 0x128a0c (0x5602cb3caa0c in /home/user/.conda/envs/det2/bin/python)
frame #12: + 0x128a0c (0x5602cb3caa0c in /home/user/.conda/envs/det2/bin/python)
frame #13: + 0x128a0c (0x5602cb3caa0c in /home/user/.conda/envs/det2/bin/python)
frame #14: + 0x128a0c (0x5602cb3caa0c in /home/user/.conda/envs/det2/bin/python)
frame #15: + 0x128a0c (0x5602cb3caa0c in /home/user/.conda/envs/det2/bin/python)
frame #16: + 0x128a0c (0x5602cb3caa0c in /home/user/.conda/envs/det2/bin/python)
frame #17: + 0x128a0c (0x5602cb3caa0c in /home/user/.conda/envs/det2/bin/python)
frame #18: + 0x11d6e1 (0x5602cb3bf6e1 in /home/user/.conda/envs/det2/bin/python)
frame #19: + 0x1e422f (0x5602cb48622f in /home/user/.conda/envs/det2/bin/python)
frame #20: + 0x1e4633 (0x5602cb486633 in /home/user/.conda/envs/det2/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x238c (0x5602cb465cdc in /home/user/.conda/envs/det2/bin/python)
frame #22: _PyEval_EvalCodeWithName + 0x2d2 (0x5602cb42cbd2 in /home/user/.conda/envs/det2/bin/python)
frame #23: _PyFunction_Vectorcall + 0x1e3 (0x5602cb42da83 in /home/user/.conda/envs/det2/bin/python)
frame #24: + 0x1012f4 (0x5602cb3a32f4 in /home/user/.conda/envs/det2/bin/python)
frame #25: _PyFunction_Vectorcall + 0x10b (0x5602cb42d9ab in /home/user/.conda/envs/det2/bin/python)
frame #26: + 0x1010b7 (0x5602cb3a30b7 in /home/user/.conda/envs/det2/bin/python)
frame #27: _PyEval_EvalCodeWithName + 0x2d2 (0x5602cb42cbd2 in /home/user/.conda/envs/det2/bin/python)
frame #28: _PyFunction_Vectorcall + 0x1e3 (0x5602cb42da83 in /home/user/.conda/envs/det2/bin/python)
frame #29: + 0x1000de (0x5602cb3a20de in /home/user/.conda/envs/det2/bin/python)
frame #30: _PyEval_EvalCodeWithName + 0x2d2 (0x5602cb42cbd2 in /home/user/.conda/envs/det2/bin/python)
frame #31: PyEval_EvalCodeEx + 0x44 (0x5602cb42d894 in /home/user/.conda/envs/det2/bin/python)
frame #32: PyEval_EvalCode + 0x1c (0x5602cb4bc2dc in /home/user/.conda/envs/det2/bin/python)
frame #33: + 0x21a384 (0x5602cb4bc384 in /home/user/.conda/envs/det2/bin/python)
frame #34: + 0x24c6d4 (0x5602cb4ee6d4 in /home/user/.conda/envs/det2/bin/python)
frame #35: PyRun_StringFlags + 0x7d (0x5602cb4f0f2d in /home/user/.conda/envs/det2/bin/python)
frame #36: PyRun_SimpleStringFlags + 0x3f (0x5602cb3b74fb in /home/user/.conda/envs/det2/bin/python)
frame #37: + 0x1159ce (0x5602cb3b79ce in /home/user/.conda/envs/det2/bin/python)
frame #38: Py_BytesMain + 0x39 (0x5602cb4f11d9 in /home/user/.conda/envs/det2/bin/python)
frame #39: __libc_start_main + 0xf0 (0x7f4995f02830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #40: + 0x1ded73 (0x5602cb480d73 in /home/user/.conda/envs/det2/bin/python)
Traceback (most recent call last): File "train_net.py", line 126, in
launch(
File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/detectron2/engine/launch.py", line 50, in launch
mp.spawn(
File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, args) File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/detectron2/engine/launch.py", line 89, in _distributed_worker main_func(args) File "/home/user/Desktop/My-Project/detectron2_repo/projects/PointRend/train_net.py", line 120, in main return trainer.train() File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 401, in train super().train(self.start_iter, self.max_iter) File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 132, in train self.run_step() File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 215, in run_step loss_dict = self.model(data) File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, kwargs) File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 445, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, kwargs) File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/detectron2/modeling/metaarch/rcnn.py", line 130, in forward , detector_losses = self.roi_heads(images, features, proposals, gt_instances) File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/detectron2/modeling/roi_heads/roi_heads.py", line 591, in forward losses.update(self._forward_mask(features, proposals)) File "/home/user/Desktop/My-Project/detectron2_repo/projects/PointRend/point_rend/roi_heads.py", line 123, in _forward_mask losses = {"loss_mask": mask_rcnn_loss(mask_coarse_logits, proposals)} File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/detectron2/modeling/roi_heads/mask_head.py", line 55, in mask_rcnn_loss gt_masks_per_image = instances_per_image.gt_masks.crop_and_resize( File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/detectron2/structures/masks.py", line 195, in crop_and_resize ROIAlign((mask_size, mask_size), 1.0, 0, aligned=True) File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/detectron2/layers/roi_align.py", line 94, in forward return roi_align( File "/home/user/.conda/envs/det2/lib/python3.8/site-packages/detectron2/layers/roi_align.py", line 19, in forward output = _C.roi_align_forward( RuntimeError: CUDA error: an illegal memory access was encountered
sys.platform linux Python 3.8.2 (default, May 7 2020, 20:00:49) [GCC 7.3.0] numpy 1.18.1 detectron2 0.1.2 @/home/user/.conda/envs/det2/lib/python3.8/site-packages/detectron2 detectron2 compiler GCC 5.4 detectron2 CUDA compiler 10.1 detectron2 arch flags sm_61 DETECTRON2_ENV_MODULE
PyTorch 1.5.0 @/home/user/.conda/envs/det2/lib/python3.8/site-packages/torch
PyTorch debug build False
CUDA available True
GPU 0,1 TITAN Xp
CUDA_HOME /usr/local/cuda
NVCC Cuda compilation tools, release 10.1, V10.1.243
Pillow 7.1.2
torchvision 0.6.0a0+82fd1c8 @/home/user/.conda/envs/det2/lib/python3.8/site-packages/torchvision
torchvision arch flags sm_35, sm_50, sm_60, sm_70, sm_75
fvcore 0.1.1
cv2 4.2.0
PyTorch built with:
If your issue looks like an installation issue / environment issue, please first try to solve it yourself with the instructions in https://detectron2.readthedocs.io/tutorials/install.html#common-installation-issues