TuSimple / mx-maskrcnn

An MXNet implementation of Mask R-CNN
Apache License 2.0
1.75k stars 550 forks source link

src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory #69

Open wenhe-jia opened 6 years ago

wenhe-jia commented 6 years ago

When generating RPN detection, after training RPN1, the processing turned down. The error message is shown as below:

Traceback (most recent call last):
  File "train_alternate_mask_fpn.py", line 116, in <module>
    main()
  File "train_alternate_mask_fpn.py", line 113, in main
    args.rcnn_epoch, args.rcnn_lr, args.rcnn_lr_step)
  File "train_alternate_mask_fpn.py", line 39, in alternate_train
    vis=False, shuffle=False, thresh=0)
  File "/home/jiawenhe/workspace/mx-maskrcnn/rcnn/tools/test_rpn.py", line 60, in test_rpn
    arg_params=arg_params, aux_params=aux_params)
  File "/home/jiawenhe/workspace/mx-maskrcnn/rcnn/core/tester.py", line 22, in __init__
    self._mod.bind(provide_data, provide_label, for_training=False)
  File "/home/jiawenhe/workspace/mx-maskrcnn/rcnn/core/module.py", line 141, in bind
    force_rebind=False, shared_module=None)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/module/module.py", line 417, in bind
    state_names=self._state_names)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/module/executor_group.py", line 231, in __init__
    self.bind_exec(data_shapes, label_shapes, shared_group)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/module/executor_group.py", line 327, in bind_exec
    shared_group))
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/module/executor_group.py", line 603, in _bind_ith_exec
    shared_buffer=shared_data_arrays, **input_shapes)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/symbol/symbol.py", line 1491, in simple_bind
    raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (1, 3, 1024, 2048)
im_info: (1, 3L)
[21:01:05] src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory

I use 4 TITAN XP, with 1 image per GPU. I do not know where the problem is.

Zehaos commented 6 years ago

Hi, @LeonJWH Have you tried to resume the training from this step?

wenhe-jia commented 6 years ago

@Zehaos I took your advice and tried to resume training from this step, and it goes well by now.
Is this problem occurred during your training progress? What is that for?

zpp13 commented 6 years ago

@LeonJWH Hi, I encountered this problem during my training, how do you resume training from this step?

chenmyzju commented 6 years ago

@LeonJWH Hi, I encountered this problem during my training, how do you resume training from this step?

chenmyzju commented 6 years ago

Hi, @Zehaos I've met the same error and I've tried BATCH_ROIS 128-> 64, still have the error

wenhe-jia commented 6 years ago

@zpp13 @chenmyzju I just commented out the code for training RPN1, and run bash scripts/train_alternate.sh to resume training from generating RPN detection. And you should kill the progress on your GPU, sometimes the GPU memory won't be released after RPN training process is finished.

kaiyuyue commented 6 years ago

Have you guys checked out this another duplicated configuration out of the config.py?

wenhe-jia commented 6 years ago

@KaiyuYue yes, set a small batch_rois can reduce the GPU usage when training rcnn, but also get a low performance at the end. Check out in repo https://github.com/LeonJWH/mx-maskrcnn.

zhuaa commented 6 years ago

I encountered the "out of memory" problem during "# TRAIN RCNN WITH IMAGENET INIT AND RPN DETECTION" the error message is:

DeprecationWarning: Numeric-style type codes are deprecated and will result in an error in the future. label.append(labels[self.label.index('rcnn_label_stride%s' % s)].asnumpy().reshape((-1,)).astype('Int32')) Traceback (most recent call last): File "train_alternate_mask_fpn.py", line 163, in main() File "train_alternate_mask_fpn.py", line 160, in main args.rcnn_epoch, args.rcnn_lr, args.rcnn_lr_step) File "train_alternate_mask_fpn.py", line 93, in alternate_train train_shared=False, lr=rcnn_lr, lr_step=rcnn_lr_step, proposal='rpn', maskrcnn_stage='rcnn1') File "/home/wp/maskrcnn/mx-maskrcnn-master/rcnn/tools/train_maskrcnn.py", line 208, in train_maskrcnn arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch) File "./incubator-mxnet/python/mxnet/module/base_module.py", line 496, in fit self.update_metric(eval_metric, data_batch.label) File "/home/wp/maskrcnn/mx-maskrcnn-master/rcnn/core/module.py", line 210, in update_metric self._curr_module.update_metric(eval_metric, labels) File "./incubator-mxnet/python/mxnet/module/module.py", line 749, in update_metric self._exec_group.update_metric(eval_metric, labels) File "./incubator-mxnet/python/mxnet/module/executor_group.py", line 616, in update_metric eval_metric.updatedict(labels, preds) File "./incubator-mxnet/python/mxnet/metric.py", line 304, in update_dict metric.update_dict(labels, preds) File "./incubator-mxnet/python/mxnet/metric.py", line 132, in update_dict self.update(label, pred) File "/home/wp/maskrcnn/mx-maskrcnn-master/rcnn/core/metric.py", line 73, in update pred_label = pred.asnumpy().reshape(-1, last_dim).argmax(axis=1).astype('int32') File "./incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1826, in asnumpy ctypes.c_size_t(data.size))) File "./incubator-mxnet/python/mxnet/base.py", line 149, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [10:47:51] src/storage/./pooled_storage_manager.h:108: cudaMalloc failed: out of memory

I changed the BATCH_ROIS 128->32. But useless. Does anybody know how to deal with it?

zhuaa commented 6 years ago

solved the problem by killing some "stopped "python process and changed the ROI to smaller.

zzw1123 commented 5 years ago

@zhuaa What is the size of ROI after you modify it? I have also met this problem which cannot be solved by changing BATCH_ROIS to 64.

thomasyue commented 5 years ago

@zzw1123 did you solve the problem? I changed TRAIN.BATCH_ROIS=8 still didn't work