Encountering Out of Memory Issue when Using Specific Data Augmentation Methods

arthas1989 commented 3 years ago

I am using the original code to train a model with the SOLO v2 and the COCO dataset. If I use Resizing and/or Flipping, there will be no issue and the training process can be successfully completed. However, if I use Crop/Rotate, there will pop up the out of memory issue. The error log and environment info are attached below. Thank you very much and I am looking forward to hearing from you.

Error Log

Traceback (most recent call last): File "/snap/pycharm-community/240/plugins/python-ce/helpers/pydev/pydevd.py", line 1483, in _exec pydev_imports.execfile(file, globals, locals) # execute the script File "/snap/pycharm-community/240/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/home/dev/AdelaiDet/tools/train_net.py", line 227, in args=(args,), File "/home/dev/detectron2/detectron2/engine/launch.py", line 59, in launch daemon=False, File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/home/dev/detectron2/detectron2/engine/launch.py", line 94, in _distributed_worker main_func(args) File "/home/dev/AdelaiDet/tools/train_net.py", line 215, in main return trainer.train() File "/home/dev/AdelaiDet/tools/train_net.py", line 97, in train self.train_loop(self.start_iter, self.max_iter) File "/home/dev/AdelaiDet/tools/train_net.py", line 86, in train_loop self.run_step() File "/home/dev/detectron2/detectron2/engine/train_loop.py", line 216, in run_step loss_dict = self.model(data) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 619, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/dev/AdelaiDet/adet/modeling/condinst/condinst.py", line 124, in forward features = self.backbone(images_norm.tensor) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/home/dev/detectron2/detectron2/modeling/backbone/fpn.py", line 123, in forward bottom_up_features = self.bottom_up(x) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/home/dev/detectron2/detectron2/modeling/backbone/resnet.py", line 435, in forward x = stage(x) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/home/dev/detectron2/detectron2/modeling/backbone/resnet.py", line 201, in forward out = self.conv3(out) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/home/dev/detectron2/detectron2/layers/wrappers.py", line 96, in forward x = self.norm(x) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/dev/detectron2/detectron2/layers/batch_norm.py", line 53, in forward return x * scale + bias RuntimeError: CUDA out of memory. Tried to allocate 342.00 MiB (GPU 1; 10.76 GiB total capacity; 9.06 GiB already allocated; 292.38 MiB free; 9.58 GiB reserved in total by PyTorch)

Environment:

Paste the output of the following command:

wget -nc -nv https://github.com/facebookresearch/detectron2/raw/master/detectron2/utils/collect_env.py && python collect_env.py

sys.platform linux Python 3.7.10	packaged by conda-forge	(default, Feb 19 2021, 16:07:37) [GCC 9.3.0] numpy 1.20.2 detectron2 0.1.3 @/home/dev/detectron2/detectron2 Compiler GCC 7.5 CUDA compiler CUDA 10.2 detectron2 arch flags sm_75 DETECTRON2_ENV_MODULE PyTorch 1.7.0 @/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch PyTorch debug build True GPU available True GPU 0,1,2,3 GeForce RTX 2080 Ti CUDA_HOME /usr/local/cuda-10.2 Pillow 8.1.0 torchvision 0.8.1 @/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torchvision torchvision arch flags sm_35, sm_50, sm_60, sm_70, sm_75 fvcore 0.1.3.post20210317 cv2 4.1.0

PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2021.2-Product Build 20210312 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 10.2
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.5
Magma 2.5.2
Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

github-actions[bot] commented 3 years ago

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template. The following information is missing: "Instructions To Reproduce the Issue and Full Logs";

arthas1989 commented 3 years ago

Instructions To Reproduce the Issue:

Command Line Args: Namespace(config_file='/home/dev/AdelaiDet/configs/CondInst/MS_R_101_3x.yaml', dist_url='tcp://127.0.0.1:50153', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=[], resume=False)

Data Augmentation: Rotate/Crop

Expected behavior:

Traceback (most recent call last): File "/snap/pycharm-community/240/plugins/python-ce/helpers/pydev/pydevd.py", line 1483, in _exec pydev_imports.execfile(file, globals, locals) # execute the script File "/snap/pycharm-community/240/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/home/dev/AdelaiDet/tools/train_net.py", line 227, in args=(args,), File "/home/dev/detectron2/detectron2/engine/launch.py", line 59, in launch daemon=False, File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/home/dev/detectron2/detectron2/engine/launch.py", line 94, in _distributed_worker main_func(args) File "/home/dev/AdelaiDet/tools/train_net.py", line 215, in main return trainer.train() File "/home/dev/AdelaiDet/tools/train_net.py", line 97, in train self.train_loop(self.start_iter, self.max_iter) File "/home/dev/AdelaiDet/tools/train_net.py", line 86, in train_loop self.run_step() File "/home/dev/detectron2/detectron2/engine/train_loop.py", line 216, in run_step loss_dict = self.model(data) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 619, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/dev/AdelaiDet/adet/modeling/condinst/condinst.py", line 124, in forward features = self.backbone(images_norm.tensor) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/home/dev/detectron2/detectron2/modeling/backbone/fpn.py", line 123, in forward bottom_up_features = self.bottom_up(x) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/home/dev/detectron2/detectron2/modeling/backbone/resnet.py", line 435, in forward x = stage(x) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/home/dev/detectron2/detectron2/modeling/backbone/resnet.py", line 201, in forward out = self.conv3(out) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/home/dev/detectron2/detectron2/layers/wrappers.py", line 96, in forward x = self.norm(x) File "/home/dev/anaconda3/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/dev/detectron2/detectron2/layers/batch_norm.py", line 53, in forward return x * scale + bias RuntimeError: CUDA out of memory. Tried to allocate 342.00 MiB (GPU 1; 10.76 GiB total capacity; 9.06 GiB already allocated; 292.38 MiB free; 9.58 GiB reserved in total by PyTorch)

ppwwyyxx commented 3 years ago

I believe it just needs more memory so this issue is not unexpected.

facebookresearch / detectron2