dmlc / gluon-cv

Gluon CV Toolkit
http://gluon-cv.mxnet.io
Apache License 2.0
5.83k stars 1.21k forks source link

Instance Segmentation: Multi GPU data and mask data are within different gpu memories, and sharding failed with odd number of gpus #1575

Closed danirisdiandita closed 3 years ago

danirisdiandita commented 3 years ago

I got error with 1 gpu (out of memory)

CUDNN_AUTOTUNE_DEFAULT=0 MXNET_GPU_MEM_POOL_TYPE=Round MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF=32 python3 train_mask_rcnn.py --gpus 0 --dataset coco --network resnet101_v1d --epochs 26 --lr-decay-epoch 17,23 --val-interval 2 --use-fpn

shown below:

terminate called after throwing an instance of 'dmlc::Error'
  what():  [15:00:54] /home/centos/mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:217: Check failed: e == cudaSuccess: CUDA: out of memory

I got error with 2 gpus by running

CUDNN_AUTOTUNE_DEFAULT=0 MXNET_GPU_MEM_POOL_TYPE=Round MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF=32 python3 train_mask_rcnn.py --gpus 0,1 --dataset coco --network resnet101_v1d --epochs 26 --lr-decay-epoch 17,23 --val-interval 2 --use-fpn
[14:40:50] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[14:40:56] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
loading annotations into memory...
Done (t=25.54s)
creating index...
index created!
loading annotations into memory...
Done (t=0.80s)
creating index...
index created!
/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/block.py:1512: UserWarning: Cannot decide type for the following arguments. Consider providing them as input:
    data: None
  input_sym_arg_type = in_param.infer_type()[0]
/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/parameter.py:707: UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator0_anchor_" does not support grad_req other than "null", and new value "write" is ignored.
  warnings.warn('Constant parameter "{}" does not support '
/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/parameter.py:707: UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator1_anchor_" does not support grad_req other than "null", and new value "write" is ignored.
  warnings.warn('Constant parameter "{}" does not support '
/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/parameter.py:707: UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator2_anchor_" does not support grad_req other than "null", and new value "write" is ignored.
  warnings.warn('Constant parameter "{}" does not support '
/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/parameter.py:707: UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator3_anchor_" does not support grad_req other than "null", and new value "write" is ignored.
  warnings.warn('Constant parameter "{}" does not support '
/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/parameter.py:707: UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator4_anchor_" does not support grad_req other than "null", and new value "write" is ignored.
  warnings.warn('Constant parameter "{}" does not support '
INFO:root:Namespace(amp=False, batch_size=8, clip_gradient=-1.0, custom_model=None, dataset='coco', disable_hybridization=False, epochs=26, executor_threads=1, gpus='0,1', horovod=False, kv_store='nccl', log_interval=100, lr=0.01, lr_decay=0.1, lr_decay_epoch='17,23', lr_warmup=1000.0, lr_warmup_factor=0.3333333333333333, momentum=0.9, network='resnet101_v1d', norm_layer=None, num_workers=4, rcnn_smoothl1_rho=1.0, resume='', rpn_smoothl1_rho=0.1111111111111111, save_interval=1, save_prefix='mask_rcnn_fpn_resnet101_v1d_coco', seed=233, start_epoch=0, static_alloc=False, use_ext=False, use_fpn=True, val_interval=2, verbose=False, wd=0.0001)
INFO:root:Start training from [Epoch 0]
INFO:root:[Epoch 0 Iteration 0] Set learning rate to 0.003333333333333333
[14:42:52] src/imperative/./cached_op.h:257: Disabling fusion due to altered topological order of inputs.
[14:42:53] src/imperative/./cached_op.h:257: Disabling fusion due to altered topological order of inputs.
Exception in thread Thread-7:
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/gluoncv/utils/parallel.py", line 105, in _worker
    out = parallel.forward_backward(x)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/gluoncv/model_zoo/rcnn/mask_rcnn/data_parallel.py", line 48, in forward_backward
    cls_targets, box_targets, box_masks, indices = self.net(data, gt_box, gt_label)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/block.py", line 682, in __call__
    out = self.forward(*args)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/block.py", line 1244, in forward
    return self._call_cached_op(x, *args)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/block.py", line 1028, in _call_cached_op
    out = self._cached_op(*cargs)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/_ctypes/ndarray.py", line 148, in __call__
    check_call(_LIB.MXInvokeCachedOpEx(
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "src/imperative/cached_op.cc", line 777
MXNetError: Check failed: inputs[i]->ctx() == default_ctx (gpu(0) vs. gpu(1)) : CachedOp requires all inputs to live on the same context. But data0 is on gpu(1) while maskrcnn0_normalizedperclassboxcenterencoder0_means is on gpu(0)

and get error by using 3 gpus by running

CUDNN_AUTOTUNE_DEFAULT=0 MXNET_GPU_MEM_POOL_TYPE=Round MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF=32 python3 train_mask_rcnn.py --gpus 0,1,2 --dataset coco --network resnet101_v1d --epochs 26 --lr-decay-epoch 17,23 --val-interval 2 --use-fpn 

shown below

[15:05:44] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[15:05:49] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[15:05:54] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
loading annotations into memory...
Done (t=25.47s)
creating index...
index created!
loading annotations into memory...
Done (t=0.81s)
creating index...
index created!
/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/block.py:1512: UserWarning: Cannot decide type for the following arguments. Consider providing them as input:
    data: None
  input_sym_arg_type = in_param.infer_type()[0]
/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/parameter.py:707: UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator0_anchor_" does not support grad_req other than "null", and new value "write" is ignored.
  warnings.warn('Constant parameter "{}" does not support '
/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/parameter.py:707: UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator1_anchor_" does not support grad_req other than "null", and new value "write" is ignored.
  warnings.warn('Constant parameter "{}" does not support '
/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/parameter.py:707: UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator2_anchor_" does not support grad_req other than "null", and new value "write" is ignored.
  warnings.warn('Constant parameter "{}" does not support '
/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/parameter.py:707: UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator3_anchor_" does not support grad_req other than "null", and new value "write" is ignored.
  warnings.warn('Constant parameter "{}" does not support '
/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/parameter.py:707: UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator4_anchor_" does not support grad_req other than "null", and new value "write" is ignored.
  warnings.warn('Constant parameter "{}" does not support '
INFO:root:Namespace(amp=False, batch_size=8, clip_gradient=-1.0, custom_model=None, dataset='coco', disable_hybridization=False, epochs=26, executor_threads=1, gpus='0,1,2', horovod=False, kv_store='nccl', log_interval=100, lr=0.01, lr_decay=0.1, lr_decay_epoch='17,23', lr_warmup=1000.0, lr_warmup_factor=0.3333333333333333, momentum=0.9, network='resnet101_v1d', norm_layer=None, num_workers=4, rcnn_smoothl1_rho=1.0, resume='', rpn_smoothl1_rho=0.1111111111111111, save_interval=1, save_prefix='mask_rcnn_fpn_resnet101_v1d_coco', seed=233, start_epoch=0, static_alloc=False, use_ext=False, use_fpn=True, val_interval=2, verbose=False, wd=0.0001)
INFO:root:Start training from [Epoch 0]
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
INFO:root:[Epoch 0 Iteration 0] Set learning rate to 0.003333333333333333
infer_shape error. Arguments:
  data0: (3, 3, 800, 1067)
  data1: (3, 5, 4)
  data2: (3, 5, 1)
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/block.py", line 964, in _build_cache
    i.data()
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/parameter.py", line 571, in data
    return self._check_and_get(self._data, ctx)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/parameter.py", line 230, in _check_and_get
    raise DeferredInitializationError(
mxnet.gluon.parameter.DeferredInitializationError: Parameter 'P2_conv_lat_weight' has not been initialized yet because initialization was deferred. Actual initialization happens during the first forward pass. Please pass one batch of data through the network before accessing Parameters. You can also avoid deferred initialization by specifying in_units, num_features, etc., for network layers.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/block.py", line 987, in _deferred_infer_shape
    self.infer_shape(*args)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/block.py", line 1167, in infer_shape
    self._infer_attrs('infer_shape', 'shape', *args)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/block.py", line 1155, in _infer_attrs
    arg_attrs, _, aux_attrs = getattr(out, infer_fn)(
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/symbol/symbol.py", line 1101, in infer_shape
    res = self._infer_shape_impl(False, *args, **kwargs)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/symbol/symbol.py", line 1250, in _infer_shape_impl
    check_call(infer_func(
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: MXNetError: Error in operator maskrcnn0_multiclassencoder0_pick0: Shape inconsistent, Provided = [2,512], inferred shape=[3,512]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_mask_rcnn.py", line 733, in <module>
    train(net, train_data, val_data, eval_metric, batch_size, ctx, logger, args)
  File "train_mask_rcnn.py", line 572, in train
    executor.put(data)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/gluoncv/utils/parallel.py", line 119, in put
    out = self._parallizable.forward_backward(x)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/gluoncv/model_zoo/rcnn/mask_rcnn/data_parallel.py", line 48, in forward_backward
    cls_targets, box_targets, box_masks, indices = self.net(data, gt_box, gt_label)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/block.py", line 682, in __call__
    out = self.forward(*args)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/block.py", line 1244, in forward
    return self._call_cached_op(x, *args)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/block.py", line 995, in _call_cached_op
    self._build_cache(*args)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/block.py", line 966, in _build_cache
    self._deferred_infer_shape(*args)
  File "/home/user/miniconda3/envs/gluon/lib/python3.8/site-packages/mxnet/gluon/block.py", line 991, in _deferred_infer_shape
    raise ValueError(error_msg)
ValueError: Deferred initialization failed because shape cannot be inferred. MXNetError: Error in operator maskrcnn0_multiclassencoder0_pick0: Shape inconsistent, Provided = [2,512], inferred shape=[3,512]
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards
djaym7 commented 3 years ago

I am getting this error too: CachedOp requires all inputs to live on the same context. But data0 is on gpu(3) while maskrcnn0_normalizedperclassboxcenterencoder0_means is on gpu(2) in FasterRCNN training and MaskRCNN training. Something is wrong with parallelization in last few updates probably

lgg commented 3 years ago

I got the same error:

Exception in thread Thread-7:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/mxnet1.8/xray_gluon_mxnet1.9.0/gluoncv/utils/parallel.py", line 105, in _worker
    out = parallel.forward_backward(x)
  File "/home/user/mxnet1.8/xray_gluon_mxnet1.9.0/gluoncv/model_zoo/rcnn/mask_rcnn/data_parallel.py", line 48, in forward_backward
    cls_targets, box_targets, box_masks, indices = self.net(data, gt_box, gt_label)
  File "/home/user/mxnet1.8/xray_gluon_mxnet1.9.0/venv/lib/python3.8/site-packages/mxnet/gluon/block.py", line 825, in __call__
    out = self.forward(*args)
  File "/home/user/mxnet1.8/xray_gluon_mxnet1.9.0/venv/lib/python3.8/site-packages/mxnet/gluon/block.py", line 1482, in forward
    return self._call_cached_op(x, *args)
  File "/home/user/mxnet1.8/xray_gluon_mxnet1.9.0/venv/lib/python3.8/site-packages/mxnet/gluon/block.py", line 1225, in _call_cached_op
    out = self._cached_op(*cargs)
  File "/home/user/mxnet1.8/xray_gluon_mxnet1.9.0/venv/lib/python3.8/site-packages/mxnet/_ctypes/ndarray.py", line 148, in __call__
    check_call(_LIB.MXInvokeCachedOpEx(
  File "/home/user/mxnet1.8/xray_gluon_mxnet1.9.0/venv/lib/python3.8/site-packages/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "../src/imperative/cached_op.cc", line 777
MXNetError: Check failed: inputs[i]->ctx() == default_ctx (gpu(0) vs. gpu(1)) : CachedOp requires all inputs to live on the same context. But data0 is on gpu(1) while maskrcnn0_normalizedperclassboxcenterencoder0_means is on gpu(0)

mxnet build from source from v1.x (1.9.0)

djaym7 commented 3 years ago

@zhreshold

Zha0q1 commented 3 years ago

I got the same error too with the latest script and mxnet 1.9 nightly wheel btw WARNING:root:Batch size cannot be evenly split. Trying to shard 8 items into 3 shards might be resolved by setting batch-size

lgg commented 3 years ago

@zhreshold @szha please, can you take a look on this issue?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

lgg commented 3 years ago

@zhreshold @szha By the way - I can still confirm this issue.

szha commented 3 years ago

@zhreshold would you like to reopen this issue?