Closed engharat closed 4 years ago
Update: the issue disappear if I remove the 100 images which show the highest number of instances. Top 100 images have 700 instances per image, with the top-10 images having 3000 instances. Still, PANnet official implementation, which is heavily based on detectron 1 code, is able to run on the whole dataset without any issue.
Thanks for the information. this is quite helpful.
I think there might be some optimization we can do to improve memory utilization for such cases.
btw, this works for detectron 1, because the part of logic that throws GPU OOM in detectron2 runs on CPU in detectron 1 (and is also part of the reason why detectron2 is faster)
The memory consumption is by design of the algorithm itself, so unless we invent modifications to the algorithm it's unlikely to be reduced. However, we could run the memory-expensive operations on CPU if needed. https://github.com/facebookresearch/detectron2/commit/baf6667e9c8799114437fd1a3e07c146f0e5338f contains such an improvment that might resolve your issue.
Closing due to no activity. The original location where it goes OOM should not go OOM any more given the above fix.
Similar problem while using collab and custom data set. Solved by tinkering with cfg
settings.
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml"))
cfg.DATASETS.TRAIN = ("unityDF1",)
cfg.DATASETS.TEST = ()
cfg.DATALOADER.NUM_WORKERS = 1
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml") # Let training initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2 # This is the real "batch size" commonly known to deep learning people
cfg.SOLVER.BASE_LR = 0.00025 # pick a good LR
cfg.SOLVER.MAX_ITER = 300 # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset
cfg.SOLVER.STEPS = [] # do not decay learning rate
BATCH_SIZE = 32
cfg.BATCH_SIZE_PER_IMAGE = BATCH_SIZE
cfg.MODEL.BATCH_SIZE_PER_IMAGE = BATCH_SIZE
cfg.MODEL.FPN.BATCH_SIZE_PER_IMAGE = BATCH_SIZE
cfg.MODEL.RPN.BATCH_SIZE_PER_IMAGE = BATCH_SIZE
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = BATCH_SIZE
cfg.DATALOADER.BATCH_SIZE_PER_IMAGE = BATCH_SIZE
cfg.SOLVER.BATCH_SIZE_PER_IMAGE = BATCH_SIZE
#
cfg.MODEL.RPN.PRE_NMS_TOPK_TRAIN = 10
cfg.MODEL.RPN.POST_NMS_TOPK_TRAIN = 10
cfg.DATASETS.PRECOMPUTED_PROPOSAL_TOPK_TRAIN = 10
cfg.INPUT.MAX_SIZE_TRAIN = 32
cfg.INPUT.MAX_SIZE_TEST = 32
# cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 2 # The "RoIHead batch size". 128 is faster, and good enough for this toy dataset (default: 512)
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 3
I have a similar problem. When too many instances are in the image, the GPU goes OOM. Is there any way to prevent this memory problem? Like some limitation in the number of instances? Or something like a batch for the number of instances?
If you do not know the root cause of the problem / bug, and wish someone to help you, please include:
How To Reproduce the Issue
Run a simple training with any detectron2 backbone on iSAID dataset https://captain-whu.github.io/iSAID/. iSAID is a instance segmentation dataset with COCO-style json data, using 15 object categories and having some images with a big number of instances(cars). iSAID is preprocessed by the author's script which convert labels bboxes and metadata to the COCO format, while creating 800x800 patches of the high resolution original images.
what changes you made (
git diff
) or what code you wrote I used the simple detectron2 colab tutorial code, using a register_coco_instances function instead of defining a custom function, as ISAID is fully compatible with COCO format. Here is a link to the code for reproducing the error: https://drive.google.com/open?id=1bo0GOhHLlvEyc6E9DOZzlszg59THOT9xwhat exact command you run python3 training_naive.py, which runs a register_coco_instances function, a cfg setup, and then a simple DefaultTrainer.train()
what you observed (including the full logs):
Expected behavior
If there are no obvious error in "what you observed" provided above, please tell us the expected behavior.
If you expect the model to work better, note that we do not help you train your model. Only in one of the two conditions we will help with it: (1) You're unable to reproduce the results in detectron2 model zoo. (2) It indicates a detectron2 bug.
Environment
Please paste the output of
python -m detectron2.utils.collect_env
. If detectron2 hasn't been successfully installed, usepython detectron2/utils/collect_env.py
.(pytorch) paolo@ALCOR-TITANV-WS:~/libriaries/prove_detectron2$ python -m detectron2.utils.collect_env
sys.platform linux Python 3.6.8 (default, Oct 9 2019, 14:04:01) [GCC 5.4.0 20160609] Numpy 1.17.4 Detectron2 Compiler GCC 5.4 Detectron2 CUDA Compiler 10.1 DETECTRON2_ENV_MODULE
PyTorch 1.3.1
PyTorch Debug Build False
torchvision 0.4.2
CUDA available True
GPU 0,1,2,3 TITAN V
CUDA_HOME /usr/local/cuda-10.1
NVCC Cuda compilation tools, release 10.1, V10.1.105
Pillow 6.2.1
cv2 4.1.2
PyTorch built with:
(pytorch) paolo@ALCOR-TITANV-WS:~/libriaries/prove_detectron2$