GPU usage keeps increasing until OOM error on iSAID dataset.

engharat commented 4 years ago

If you do not know the root cause of the problem / bug, and wish someone to help you, please include:

How To Reproduce the Issue

Run a simple training with any detectron2 backbone on iSAID dataset https://captain-whu.github.io/iSAID/. iSAID is a instance segmentation dataset with COCO-style json data, using 15 object categories and having some images with a big number of instances(cars). iSAID is preprocessed by the author's script which convert labels bboxes and metadata to the COCO format, while creating 800x800 patches of the high resolution original images.

what changes you made (git diff) or what code you wrote I used the simple detectron2 colab tutorial code, using a register_coco_instances function instead of defining a custom function, as ISAID is fully compatible with COCO format. Here is a link to the code for reproducing the error: https://drive.google.com/open?id=1bo0GOhHLlvEyc6E9DOZzlszg59THOT9x
what exact command you run python3 training_naive.py, which runs a register_coco_instances function, a cfg setup, and then a simple DefaultTrainer.train()

what you observed (including the full logs):

The GPU memory usage keeps increasing after several iterations, until a crash for out of memory error. Using torch.cuda_empty_cache() or the suggested
cfg.MODEL.RPN.PRE_NMS_TOPK_TRAIN = 200
cfg.MODEL.RPN.POST_NMS_TOPK_TRAIN = 200
did not solved neither.
Here is the link to the full output from bash: [https://drive.google.com/open?id=1SszOAY9pEBFSsfp7nyc0Gv_mcCiHoAKo](url)

Expected behavior

If there are no obvious error in "what you observed" provided above, please tell us the expected behavior.

If you expect the model to work better, note that we do not help you train your model. Only in one of the two conditions we will help with it: (1) You're unable to reproduce the results in detectron2 model zoo. (2) It indicates a detectron2 bug.

Environment

Please paste the output of python -m detectron2.utils.collect_env. If detectron2 hasn't been successfully installed, use python detectron2/utils/collect_env.py.

(pytorch) paolo@ALCOR-TITANV-WS:~/libriaries/prove_detectron2$ python -m detectron2.utils.collect_env

sys.platform linux Python 3.6.8 (default, Oct 9 2019, 14:04:01) [GCC 5.4.0 20160609] Numpy 1.17.4 Detectron2 Compiler GCC 5.4 Detectron2 CUDA Compiler 10.1 DETECTRON2_ENV_MODULE PyTorch 1.3.1 PyTorch Debug Build False torchvision 0.4.2 CUDA available True GPU 0,1,2,3 TITAN V CUDA_HOME /usr/local/cuda-10.1 NVCC Cuda compilation tools, release 10.1, V10.1.105 Pillow 6.2.1 cv2 4.1.2

PyTorch built with:

GCC 7.3
Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CUDA Runtime 10.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.3
Magma 2.5.1
Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

(pytorch) paolo@ALCOR-TITANV-WS:~/libriaries/prove_detectron2$

engharat commented 4 years ago

Update: the issue disappear if I remove the 100 images which show the highest number of instances. Top 100 images have 700 instances per image, with the top-10 images having 3000 instances. Still, PANnet official implementation, which is heavily based on detectron 1 code, is able to run on the whole dataset without any issue.

ppwwyyxx commented 4 years ago

Thanks for the information. this is quite helpful.

I think there might be some optimization we can do to improve memory utilization for such cases.

ppwwyyxx commented 4 years ago

btw, this works for detectron 1, because the part of logic that throws GPU OOM in detectron2 runs on CPU in detectron 1 (and is also part of the reason why detectron2 is faster)

ppwwyyxx commented 4 years ago

The memory consumption is by design of the algorithm itself, so unless we invent modifications to the algorithm it's unlikely to be reduced. However, we could run the memory-expensive operations on CPU if needed. https://github.com/facebookresearch/detectron2/commit/baf6667e9c8799114437fd1a3e07c146f0e5338f contains such an improvment that might resolve your issue.

ppwwyyxx commented 4 years ago

Closing due to no activity. The original location where it goes OOM should not go OOM any more given the above fix.

maxiuw commented 2 years ago

Similar problem while using collab and custom data set. Solved by tinkering with cfg settings.

cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml"))
cfg.DATASETS.TRAIN = ("unityDF1",)
cfg.DATASETS.TEST = ()
cfg.DATALOADER.NUM_WORKERS = 1
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml")  # Let training initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2  # This is the real "batch size" commonly known to deep learning people
cfg.SOLVER.BASE_LR = 0.00025  # pick a good LR
cfg.SOLVER.MAX_ITER = 300    # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset
cfg.SOLVER.STEPS = []        # do not decay learning rate

BATCH_SIZE = 32
cfg.BATCH_SIZE_PER_IMAGE = BATCH_SIZE
cfg.MODEL.BATCH_SIZE_PER_IMAGE = BATCH_SIZE
cfg.MODEL.FPN.BATCH_SIZE_PER_IMAGE = BATCH_SIZE
cfg.MODEL.RPN.BATCH_SIZE_PER_IMAGE = BATCH_SIZE
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = BATCH_SIZE
cfg.DATALOADER.BATCH_SIZE_PER_IMAGE = BATCH_SIZE
cfg.SOLVER.BATCH_SIZE_PER_IMAGE = BATCH_SIZE
# 
cfg.MODEL.RPN.PRE_NMS_TOPK_TRAIN = 10
cfg.MODEL.RPN.POST_NMS_TOPK_TRAIN = 10
cfg.DATASETS.PRECOMPUTED_PROPOSAL_TOPK_TRAIN = 10
cfg.INPUT.MAX_SIZE_TRAIN = 32
cfg.INPUT.MAX_SIZE_TEST = 32
# cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 2   # The "RoIHead batch size". 128 is faster, and good enough for this toy dataset (default: 512)
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 3

antoinedelplace commented 1 week ago

I have a similar problem. When too many instances are in the image, the GPU goes OOM. Is there any way to prevent this memory problem? Like some limitation in the number of instances? Or something like a batch for the number of instances?

facebookresearch / detectron2