Closed karan6181 closed 3 years ago
File "src/ndarray/./../operator/tensor/.././../common/../operator/nn/mkldnn/mkldnn_base-inl.h", line 246 MXNetError: unknown type for MKLDNN :2
If you look at this file: https://github.com/apache/incubator-mxnet/blob/64f737cdd59fe88d2c5b479f25d011c5156b6a8a/src/operator/nn/mkldnn/mkldnn_base-inl.h#L234-L247
You'll see its running dtype checking and doesnt find a case in the switch for value 2. If you go look at what value 2 is: https://github.com/apache/incubator-mxnet/blob/64f737cdd59fe88d2c5b479f25d011c5156b6a8a/3rdparty/mshadow/mshadow/base.h#L351-L354
You'll see that its kFloat16
. So you're script is somehow using FP16 and MKLDNN doesnt support that type.
Thank You @samskalicky . Mask RCNN does use AMP and it casts the weights and gradients to FP16 here https://github.com/dmlc/gluon-cv/blob/master/scripts/instance/mask_rcnn/train_mask_rcnn.py#L705
I also tried running Mask RCNN script on single node using mxnet-cu101mkl==1.6.0.post0
with gluon-cv==0.8.0
and I was able to run it successfully without any issue.
Below is the output from the run:
python gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py --gpus 0,1,2,3,4,5,6,7 --num-workers 4 --amp --lr-decay-epoch 8,10 --epochs 6 --log-interval 10 --val-interval 12 --batch-size 8 --use-fpn --lr 0.01 --lr-warmup-factor 0.001 --lr-warmup 1600 --static-alloc --clip-gradient 1.5 --use-ext --seed 987
/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/block.py:1389: UserWarning: Cannot decide type for the following arguments. Consider providing them as input:
data: None
input_sym_arg_type = in_param.infer_type()[0]
[23:15:28] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[23:15:31] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[23:15:34] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[23:15:36] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[23:15:39] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[23:15:41] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[23:15:44] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[23:15:46] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
loading annotations into memory...
Done (t=14.20s)
creating index...
index created!
loading annotations into memory...
Done (t=0.39s)
creating index...
index created!
creating index...
/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/parameter.py:701: UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator0_anchor_" does not support grad_req other than "null", and new value "write" is ignored.
warnings.warn('Constant parameter "{}" does not support '
/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/parameter.py:701: UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator1_anchor_" does not support grad_req other than "null", and new value "write" is ignored.
warnings.warn('Constant parameter "{}" does not support '
/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/parameter.py:701: UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator2_anchor_" does not support grad_req other than "null", and new value "write" is ignored.
warnings.warn('Constant parameter "{}" does not support '
/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/parameter.py:701: UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator3_anchor_" does not support grad_req other than "null", and new value "write" is ignored.
warnings.warn('Constant parameter "{}" does not support '
/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/parameter.py:701: UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator4_anchor_" does not support grad_req other than "null", and new value "write" is ignored.
warnings.warn('Constant parameter "{}" does not support '
INFO:root:Namespace(amp=True, batch_size=8, clip_gradient=1.5, custom_model=None, dataset='coco', disable_hybridization=False, epochs=6, executor_threads=1, gpus='0,1,2,3,4,5,6,7', horovod=False, kv_store='device', log_interval=10, lr=0.01, lr_decay=0.1, lr_decay_epoch='8,10', lr_warmup='1600', lr_warmup_factor=0.001, momentum=0.9, network='resnet50_v1b', norm_layer=None, num_workers=4, rcnn_smoothl1_rho=1.0, resume='', rpn_smoothl1_rho=0.1111111111111111, save_interval=1, save_prefix='mask_rcnn_fpn_resnet50_v1b_coco', seed=987, start_epoch=0, static_alloc=True, train_datapath='/scratch/data/mask_rcnn/mxnet/', use_ext=True, use_fpn=True, val_datapath='/scratch/data/mask_rcnn/mxnet/', val_interval=12, verbose=False, wd=0.0001)
INFO:root:Start training from [Epoch 0]
INFO:root:[Epoch 0 Iteration 0] Set learning rate to 1e-05
[23:16:40] src/imperative/cached_op.cc:192: Disabling fusion due to altered topological order of inputs.
[23:16:40] src/imperative/cached_op.cc:192: Disabling fusion due to altered topological order of inputs.
[23:16:41] src/imperative/cached_op.cc:192: Disabling fusion due to altered topological order of inputs.
[23:16:42] src/imperative/cached_op.cc:192: Disabling fusion due to altered topological order of inputs.
[23:16:42] src/imperative/cached_op.cc:192: Disabling fusion due to altered topological order of inputs.
[23:16:43] src/imperative/cached_op.cc:192: Disabling fusion due to altered topological order of inputs.
[23:16:43] src/imperative/cached_op.cc:192: Disabling fusion due to altered topological order of inputs.
[23:16:44] src/imperative/cached_op.cc:192: Disabling fusion due to altered topological order of inputs.
[23:16:47] src/kvstore/././comm.h:744: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[23:16:47] src/kvstore/././comm.h:753: .vvvv...
[23:16:47] src/kvstore/././comm.h:753: v.vv.v..
[23:16:47] src/kvstore/././comm.h:753: vv.v..v.
[23:16:47] src/kvstore/././comm.h:753: vvv....v
[23:16:47] src/kvstore/././comm.h:753: v....vvv
[23:16:47] src/kvstore/././comm.h:753: .v..v.vv
[23:16:47] src/kvstore/././comm.h:753: ..v.vv.v
[23:16:47] src/kvstore/././comm.h:753: ...vvvv.
INFO:root:AMP: decreasing loss scale to 32768.000000
INFO:root:AMP: decreasing loss scale to 16384.000000
INFO:root:AMP: decreasing loss scale to 8192.000000
INFO:root:AMP: decreasing loss scale to 4096.000000
INFO:root:AMP: decreasing loss scale to 2048.000000
INFO:root:AMP: decreasing loss scale to 1024.000000
INFO:root:[Epoch 0][Batch 9], Speed: 4.880 samples/sec, RPN_Conf=0.606,RPN_SmoothL1=0.156,RCNN_CrossEntropy=4.487,RCNN_SmoothL1=0.021,RCNN_Mask=1.882,RPNAcc=0.751,RPNL1Loss=1.384,RCNNAcc=0.004,RCNNL1Loss=0.947,MaskAcc=0.518,MaskFGAcc=0.522
INFO:root:[Epoch 0 Iteration 10] Set learning rate to 7.24375e-05
INFO:root:[Epoch 0][Batch 19], Speed: 15.230 samples/sec, RPN_Conf=0.577,RPN_SmoothL1=0.151,RCNN_CrossEntropy=4.018,RCNN_SmoothL1=0.019,RCNN_Mask=1.803,RPNAcc=0.797,RPNL1Loss=1.402,RCNNAcc=0.314,RCNNL1Loss=0.879,MaskAcc=0.513,MaskFGAcc=0.523
INFO:root:[Epoch 0 Iteration 20] Set learning rate to 0.000134875
INFO:root:[Epoch 0][Batch 29], Speed: 11.308 samples/sec, RPN_Conf=0.526,RPN_SmoothL1=0.143,RCNN_CrossEntropy=3.149,RCNN_SmoothL1=0.020,RCNN_Mask=1.635,RPNAcc=0.828,RPNL1Loss=1.323,RCNNAcc=0.535,RCNNL1Loss=0.927,MaskAcc=0.514,MaskFGAcc=0.525
INFO:root:[Epoch 0 Iteration 30] Set learning rate to 0.0001973125
INFO:root:[Epoch 0][Batch 39], Speed: 16.033 samples/sec, RPN_Conf=0.479,RPN_SmoothL1=0.139,RCNN_CrossEntropy=2.477,RCNN_SmoothL1=0.022,RCNN_Mask=1.504,RPNAcc=0.842,RPNL1Loss=1.277,RCNNAcc=0.645,RCNNL1Loss=0.998,MaskAcc=0.514,MaskFGAcc=0.527
INFO:root:[Epoch 0 Iteration 40] Set learning rate to 0.00025975
INFO:root:[Epoch 0][Batch 49], Speed: 13.002 samples/sec, RPN_Conf=0.435,RPN_SmoothL1=0.129,RCNN_CrossEntropy=2.047,RCNN_SmoothL1=0.026,RCNN_Mask=1.390,RPNAcc=0.856,RPNL1Loss=1.229,RCNNAcc=0.711,RCNNL1Loss=1.112,MaskAcc=0.518,MaskFGAcc=0.529
INFO:root:[Epoch 0 Iteration 50] Set learning rate to 0.0003221875
INFO:root:[Epoch 0][Batch 59], Speed: 13.902 samples/sec, RPN_Conf=0.423,RPN_SmoothL1=0.127,RCNN_CrossEntropy=1.780,RCNN_SmoothL1=0.032,RCNN_Mask=1.303,RPNAcc=0.858,RPNL1Loss=1.156,RCNNAcc=0.753,RCNNL1Loss=1.227,MaskAcc=0.516,MaskFGAcc=0.532
INFO:root:[Epoch 0 Iteration 60] Set learning rate to 0.000384625
INFO:root:[Epoch 0][Batch 69], Speed: 13.738 samples/sec, RPN_Conf=0.402,RPN_SmoothL1=0.122,RCNN_CrossEntropy=1.582,RCNN_SmoothL1=0.039,RCNN_Mask=1.230,RPNAcc=0.862,RPNL1Loss=1.104,RCNNAcc=0.782,RCNNL1Loss=1.369,MaskAcc=0.517,MaskFGAcc=0.534
INFO:root:[Epoch 0 Iteration 70] Set learning rate to 0.0004470625
INFO:root:[Epoch 0][Batch 79], Speed: 12.123 samples/sec, RPN_Conf=0.385,RPN_SmoothL1=0.116,RCNN_CrossEntropy=1.440,RCNN_SmoothL1=0.048,RCNN_Mask=1.172,RPNAcc=0.865,RPNL1Loss=1.055,RCNNAcc=0.802,RCNNL1Loss=1.537,MaskAcc=0.517,MaskFGAcc=0.536
However, running it with mxnet-cu101==1.7.0
and gluoncv==0.8.0
fails with:
Traceback (most recent call last):
File "/shared/mx_oob_env/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/data/dataloader.py", line 429, in _worker_fn
batch = batchify_fn([_worker_dataset[i] for i in samples])
File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/data/dataloader.py", line 429, in <listcomp>
batch = batchify_fn([_worker_dataset[i] for i in samples])
File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/data/dataset.py", line 219, in __getitem__
return self._fn(*item)
File "/shared/mx_oob_env/lib/python3.8/site-packages/gluoncv/data/transforms/presets/rcnn.py", line 407, in __call__
cls_target, box_target, box_mask = self._target_generator(
File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/block.py", line 682, in __call__
out = self.forward(*args)
File "/shared/mx_oob_env/lib/python3.8/site-packages/gluoncv/model_zoo/rcnn/rpn/rpn_target.py", line 157, in forward
ious = mx.nd.contrib.box_iou(anchor, bbox, format='corner').asnumpy()
File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/ndarray/ndarray.py", line 2563, in asnumpy
check_call(_LIB.MXNDArraySyncCopyToCPU(
File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/base.py", line 246, in check_call
raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
File "src/ndarray/./../operator/tensor/.././../common/../operator/nn/mkldnn/mkldnn_base-inl.h", line 246
MXNetError: unknown type for MKLDNN :2
@bgawrych @bartekkuncer @grygielski FYI, looks like something changed from 1.6.0 to 1.7.0 that is causing this issue when running on CPU with MKLDNN
I would suspect that the merging of mkldnn as default caused some issue in the contrib operators.
Update: Commenting out this line of code (https://github.com/dmlc/gluon-cv/blob/master/scripts/instance/mask_rcnn/train_mask_rcnn.py#L705-L710) seems to work with Horovod v0.21.0
, mxnet-cu101==1.7.0
and gluoncv==0.8.0
. However, running the same script without horovod fails with different issue which is mentioned below:
INFO:root:[Epoch 0 Iteration 0] Set learning rate to 1e-05
[00:26:11] src/imperative/./cached_op.h:257: Disabling fusion due to altered topological order of inputs.
[00:26:12] src/imperative/./cached_op.h:257: Disabling fusion due to altered topological order of inputs.
Exception in thread Thread-7:
Traceback (most recent call last):
File "/shared/mx_oob_env/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/shared/mx_oob_env/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/shared/mx_oob_env/lib/python3.8/site-packages/gluoncv/utils/parallel.py", line 105, in _worker
out = parallel.forward_backward(x)
File "/shared/mx_oob_env/lib/python3.8/site-packages/gluoncv/model_zoo/rcnn/mask_rcnn/data_parallel.py", line 48, in forward_backward
cls_targets, box_targets, box_masks, indices = self.net(data, gt_box, gt_label)
File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/block.py", line 747, in __call__
out = self.forward(*args)
File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/block.py", line 1309, in forward
return self._call_cached_op(x, *args)
File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/block.py", line 1093, in _call_cached_op
out = self._cached_op(*cargs)
File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/_ctypes/ndarray.py", line 148, in __call__
check_call(_LIB.MXInvokeCachedOpEx(
File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/base.py", line 246, in check_call
raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
File "src/imperative/cached_op.cc", line 777
MXNetError: Check failed: inputs[i]->ctx() == default_ctx (gpu(0) vs. gpu(1)) : CachedOp requires all inputs to live on the same context. But data0 is on gpu(1) while maskrcnn0_normalizedperclassboxcenterencoder0_means is on gpu(0)
I will try to analyze the issue.
@anko-intel thanks for the fix!
Description
MXNetError: unknown type for MKLDNN :2
issue usingmxnet-cu101==1.7.0
Error Message
GluonCV: v0.9.0 Horovod: v0.21.0
To Reproduce
Without Horovod
Full log: https://gist.github.com/karan6181/efa4ad8f61c3e21cbee9c55fea98b2f0
With Horovod
Environment
We recommend using our script for collecting the diagnostic information with the following command
curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3