Illegal memory access when training with rotated boxes

jvdgoltz commented 2 years ago

Hello all, really enjoy detectron2 and its possibilities for customization but I encountered a problem that makes me scratch my head:

Full runnable code or full changes you made: I trained https://github.com/facebookresearch/detectron2/blob/main/configs/COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml with the configuration for rotated boxes that is based on https://github.com/facebookresearch/detectron2/issues/21#issuecomment-595522318:

INPUT:
MIN_SIZE_TRAIN: (1728, 1814, 1901, 1987, 2074, 2160)
MAX_SIZE_TRAIN: 3600
MIN_SIZE_TEST: 2160
MAX_SIZE_TEST: 3600
MODEL:
ANCHOR_GENERATOR:
NAME: RotatedAnchorGenerator
SIZES: [[16], [32], [64], [128], [256]]
ASPECT_RATIOS: [[0.25, 1]]
ANGLES: [[-72, -36, 0, 36, 72]]
PROPOSAL_GENERATOR:
NAME: RRPN
RPN:
BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0, 1.0)
ROI_BOX_HEAD:
POOLER_TYPE: ROIAlignRotated
BBOX_REG_WEIGHTS: (10.0, 10.0, 5.0, 5.0, 1.0)
ROI_HEADS:
NAME: RROIHeads
BATCH_SIZE_PER_IMAGE: 768
TEST:
DETECTIONS_PER_IMAGE: 350

What exact command you run: Custom train.py script for initializing the config, but very similar to train_net.py, after issue appeared for the first time, I also set CUDA_LAUNCH_BLOCKING=1
Full logs or other relevant observations:

After a few hundred/thousand iterations my training script throws an exception.

# Ran training with CUDA_LAUNCH_BLOCKING=1

[05/06 10:36:10] d2.utils.events INFO:  eta: 7:45:45  iter: 2139  total_loss: 0.829  loss_cls: 0.2315  loss_box_reg: 0.24  loss_rpn_cls: 0.02631  loss_rpn_loc: 0.2282  time: 0.5957  data_time: 0.0151  lr: 0.002  max_mem: 16124M
[05/06 10:36:18] d2.engine.train_loop ERROR: Exception during training:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/opt/conda/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/opt/conda/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 395, in run_step
loss_dict = self.model(data)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 157, in forward
proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/rpn.py", line 471, in forward
gt_labels, gt_boxes = self.label_and_sample_anchors(anchors, gt_instances)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/rrpn.py", line 176, in label_and_sample_anchors
match_quality_matrix = retry_if_cuda_oom(pairwise_iou_rotated)(gt_boxes_i, anchors)
File "/opt/conda/lib/python3.8/site-packages/detectron2/utils/memory.py", line 70, in wrapped
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/detectron2/structures/rotated_boxes.py", line 503, in pairwise_iou
return pairwise_iou_rotated(boxes1.tensor, boxes2.tensor)
File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/rotated_boxes.py", line 22, in pairwise_iou_rotated
return _C.box_iou_rotated(boxes1, boxes2)
RuntimeError: CUDA error: an illegal memory access was encountered

I encountered this for the first time when switching my config from a aligned box detection to rotated box detection. Additionally, you should know that we are aspiring to train with a very high image resolution (eventually 8000x5000 pixels). In that config, I was using (Don't have exact env but this should give all the info needed):

detectron2              0.6 
CUDA compiler           CUDA 10.2
PyTorch                 1.10.2
GPU available           Yes
GPU 0                   V100-16GB

At first I thought it was a memory issue, even though I still had some free memory according to nvidia-smi. So I reduced the amount of proposals by reducing the number of aspect rations and angles (see config above) and the problem disappeared. I was not too happy with that because I couldn't make full use of the GPU.

Instead, we scaled up to a 4 x V100 setup (so training with DDP and 4 workers) to increase the input resolution to the above values. I tried to add back an additional value for the aspect ratio, and that brought back the Illegal memory access again. So I reverted and it worked again.

Recently, we switched to 2 x A100 GPUs. For that I had to use CUDA 11, and now the error is coming back even though I didn't increase the number of proposals or input resolution.

Environment:

This is the environment for the most recent configuration that I used when I encountered this issue.

[05/06 10:06:34] detectron2 INFO: Rank of current process: 0. World size: 2
[05/06 10:06:35] detectron2 INFO: Environment info:
----------------------  -----------------------------------------------------------------------------------
sys.platform            linux
Python                  3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]
numpy                   1.21.5
detectron2              0.6 @/opt/conda/lib/python3.8/site-packages/detectron2
Compiler                GCC 7.3
CUDA compiler           CUDA 11.1
detectron2 arch flags   /opt/conda/lib/python3.8/site-packages/detectron2/_C.cpython-38-x86_64-linux-gnu.so
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.10.2 @/opt/conda/lib/python3.8/site-packages/torch
PyTorch debug build     False
GPU available           Yes
GPU 0,1                 A100-SXM4-40GB (arch=8.0)
Driver version
CUDA_HOME               None - invalid!
Pillow                  9.0.1
torchvision             0.11.3 @/opt/conda/lib/python3.8/site-packages/torchvision
torchvision arch flags  /opt/conda/lib/python3.8/site-packages/torchvision/_C.so
fvcore                  0.1.5.post20220504
iopath                  0.1.9
cv2                     4.5.5
----------------------  -----------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

So this error happened on two different GPU platforms with two different CUDA versions and it happened always in the _C.box_iou_rotated(boxes1, boxes2). I never had this problem when training on the aligned boxes. It could be a CUDA bug, but because it is so specific to rotated boxes I came here first.

Help is greatly appreciated!

github-actions[bot] commented 2 years ago

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template. The following information is missing: "Instructions To Reproduce the Issue and Full Logs";

jvdgoltz commented 2 years ago

I was looking more closely at this part of the stack trace:

File "/opt/conda/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/rrpn.py", line 176, in label_and_sample_anchors
match_quality_matrix = retry_if_cuda_oom(pairwise_iou_rotated)(gt_boxes_i, anchors)

As a workaround, I could try implement my own RRPN using a custom retry_if_cuda_oom:

@contextmanager
def _ignore_torch_cuda_oom():
    """
    A context which ignores CUDA OOM exception from pytorch.
    """
    try:
        yield
    except RuntimeError as e:
        # NOTE: the string may change?
        if "CUDA out of memory. " in str(e):
            pass
        elif "CUDA error: an illegal memory access was encountered" in str(e):
            pass
        else:
            raise

I will also investigate if gt_boxes_i, anchors have weird values!

Would love to hear your opinions!

ppwwyyxx commented 2 years ago

From the given information it does suggest a potential bug in box_iou_rotated, especially given the large image sizes I suspect something is out of bound. But we do not know how it happened.

The best way to help report the issue is to catch the error and save the input data that causes box_iou_rotated to fail.

If you can then, in a short script, call box_iou_rotated with the bad data and reproduce the issue, provide the script and the data and I'm sure it can be fixed.

Mark-C-Lowell commented 2 years ago

We encountered this problem as well. We caught the boxes and pickled them, and it appears to be caused by the product of the number of boxes being too large. We can replicate the illegal memory access using:

import torch
from detectron2.layers.rotated_boxes import pairwise_iou_rotated

M = 2 * 1024
N = 1024**2

boxes_1 = torch.ones((M, 5), device='cuda')
boxes_2 = torch.ones((N, 5), device='cuda')
pairwise_iou_rotated(boxes_1, boxes_2)
torch.cuda.synchronize()

If we reduce M or N by one, the illegal memory access no longer occurs.

jvdgoltz commented 2 years ago

@Mark-C-Lowell Good job debugging the issue! Saved me some hassle ;) It totally makes sense: The combination of super high resolution and rotated bounding boxers results in many more anchors which is where our issue appears.

We work around it by implementing a strided anchor generator and RPN head (something that was on our roadmap anyway) that reduces the number of anchors significantly

ppwwyyxx commented 2 years ago

I cannot reproduce the error using code above - it runs correctly. This problem may be specific to certain environment.

Mark-C-Lowell commented 2 years ago

Sorry, I had a typo in my comment. It errors out when $M * N$ exceeds $2 \cdot 1024^3$, not when it equals it.

facebookresearch / detectron2

Illegal memory access when training with rotated boxes #4216

Environment: