apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

ndarray.cc:640 Check failed: !is_view #17890

Closed caishanli closed 4 years ago

caishanli commented 4 years ago

Description

I install anaconda and create two venvs: mxnet-cu102 and mxnet-cu102mkl. with mxnet-cu102mkl, code in reproduce fail, but ok in mxnet-cu102

Error Message

Traceback (most recent call last):
  File "test_mkl.py", line 34, in <module>
    mx.nd.waitall()
  File "/home/caishanli/.conda/envs/mxnet_mkl/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py", line 200, in waitall
    check_call(_LIB.MXNDArrayWaitAll())
  File "/home/caishanli/.conda/envs/mxnet_mkl/lib/python3.7/site-packages/mxnet/base.py", line 255, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [13:47:16] src/ndarray/ndarray.cc:640: Check failed: !is_view: 
Stack trace:
  [bt] (0) /home/caishanli/.conda/envs/mxnet_mkl/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x6d554b) [0x7fd69a35c54b]
  [bt] (1) /home/caishanli/.conda/envs/mxnet_mkl/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::NDArray::GetMKLDNNData() const+0x13c) [0x7fd69d7565cc]
  [bt] (2) /home/caishanli/.conda/envs/mxnet_mkl/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::op::MKLDNNTransposeForward(nnvm::NodeAttrs const&, mxnet::OpContext const&, mxnet::NDArray const&, mxnet::OpReqType const&, mxnet::NDArray const&)+0x408) [0x7fd69a411348]
  [bt] (3) /home/caishanli/.conda/envs/mxnet_mkl/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x36232d4) [0x7fd69d2aa2d4]
  [bt] (4) /home/caishanli/.conda/envs/mxnet_mkl/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFComputeEx(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x3c4) [0x7fd69d5fdf74]
  [bt] (5) /home/caishanli/.conda/envs/mxnet_mkl/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x3896084) [0x7fd69d51d084]
  [bt] (6) /home/caishanli/.conda/envs/mxnet_mkl/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x38a4091) [0x7fd69d52b091]
  [bt] (7) /home/caishanli/.conda/envs/mxnet_mkl/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x38a74e4) [0x7fd69d52e4e4]
  [bt] (8) /home/caishanli/.conda/envs/mxnet_mkl/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x38a27e4) [0x7fd69d5297e4]

To Reproduce

The test code is a mini version from gluoncv's train_yolo_v3.py, as below:

import warnings
import mxnet as mx
from mxnet import gluon
import gluoncv as gcv
gcv.utils.check_version('0.6.0')
from gluoncv import data as gdata
from gluoncv.data.transforms.presets.yolo import YOLO3DefaultTrainTransform
from gluoncv.model_zoo.yolo import get_yolov3
from gluoncv.model_zoo.mobilenet import get_mobilenet

classes = [str(i) for i in range(67)]

ctx = mx.cpu()
batch_size = 2
width, height = 320, 320

base_net = get_mobilenet(multiplier=1, pretrained=True)
stages = [base_net.features[:33], base_net.features[33:69], base_net.features[69:-2]]
anchors = [[10, 13, 16, 30, 33, 23],
            [30, 61, 62, 45, 59, 119],
            [116, 90, 156, 198, 373, 326]]
strides = [8, 16, 32]
net = get_yolov3('mobilenet1.0', stages, [512, 256, 128], anchors, strides, classes, 'voc', pretrained=False)
with warnings.catch_warnings(record=True) as w:
    warnings.simplefilter("always")
    net.initialize(ctx=ctx)`

train_dataset = gdata.VOCDetection()
train_loader = gluon.data.DataLoader(
    train_dataset.transform(YOLO3DefaultTrainTransform(width, height, net)), batch_size, True)
for epoch in range(10):
    mx.nd.waitall()
    for i, data in enumerate(train_loader):
        print('ok')

Steps to reproduce

conda venv: mxnet-cu102

1.set classes to 67 ok 2.set classes to 65, 66, 68, 69 ok

conda venv: mxnet-cu102mkl

1.set classes to 67 fail 2.set classes to 65, 66, 68, 69 ok

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

----------Python Info----------
Version      : 3.7.6
Compiler     : GCC 7.3.0
Build        : ('default', 'Jan  8 2020 19:59:22')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 20.0.2
Directory    : /home/caishanli/.conda/envs/mxnet_mkl/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /home/caishanli/.conda/envs/mxnet_mkl/lib/python3.7/site-packages/mxnet
Num GPUs     : 2
Commit Hash   : 6eec9da55c5096079355d1f1a5fa58dcf35d6752
----------System Info----------
Platform     : Linux-5.5.9-arch1-2-x86_64-with-arch-Arch-Linux
system       : Linux
node         : archlinux
release      : 5.5.9-arch1-2
version      : #1 SMP PREEMPT Thu, 12 Mar 2020 23:01:33 +0000
----------Hardware Info----------
machine      : x86_64
processor    : 
架构:                           x86_64
CPU 运行模式:                   32-bit, 64-bit
字节序:                         Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU:                             8
在线 CPU 列表:                  0-7
每个核的线程数:                 2
每个座的核数:                   4
座:                             1
NUMA 节点:                      1
厂商 ID:                        GenuineIntel
CPU 系列:                       6
型号:                           60
型号名称:                       Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
步进:                           3
CPU MHz:                        3479.679
CPU 最大 MHz:                   3900.0000
CPU 最小 MHz:                   800.0000
BogoMIPS:                       6787.05
虚拟化:                         VT-x
L1d 缓存:                       128 KiB
L1i 缓存:                       128 KiB
L2 缓存:                        1 MiB
L3 缓存:                        8 MiB
NUMA 节点0 CPU:                 0-7
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Tsx async abort:   Not affected
标记:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
                                  syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf p
                                 ni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt
                                  tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shado
                                 w vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln p
                                 ts md_clear flush_l1d
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0069 sec, LOAD: 4.6050 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0013 sec, LOAD: 2.4630 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0437 sec, LOAD: 2.5802 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0399 sec, LOAD: 1.0459 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0392 sec, LOAD: 0.9735 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.3429 sec, LOAD: 2.1664 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.5987 sec, LOAD: 7.1233 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0376 sec, LOAD: 1.3676 sec.
TaoLv commented 4 years ago

@caishanli Could you please try the latest master branch to see if the error is still there? Thanks.

caishanli commented 4 years ago

@caishanli Could you please try the latest master branch to see if the error is still there? Thanks.

do you mean clone master and build with mkl and test?

TaoLv commented 4 years ago

Yes. You can also install the nightly build from here: https://repo.mxnet.io/dist/index.html. I think the cpu variant should have MKL-DNN enabled.

caishanli commented 4 years ago

I find the lastest linux build in https://dist.mxnet.io/python/mkl is mxnet_mkl-1.6.0b20200215-py2.py3-none-manylinux1_x86_64.whl is it ok?

TaoLv commented 4 years ago

We don't release mxnet-mkl nightly build anymore. MKL-DNN has been enabled in the linux cpu build. You can try the latest one by:

pip install --pre mxnet -f https://dist.mxnet.io/python/cpu
szha commented 4 years ago

We don't release mxnet-mkl nightly build anymore

I'm guessing this is also the culprit for mxnet-cu102mkl. @caishanli could you confirm the versions for your mxnet-cu102 and mxnet-cu102mkl setup?

caishanli commented 4 years ago

We don't release mxnet-mkl nightly build anymore. MKL-DNN has been enabled in the linux cpu build. You can try the latest one by:

pip install --pre mxnet -f https://dist.mxnet.io/python/cpu

I tested with it and the result is ok

caishanli commented 4 years ago

We don't release mxnet-mkl nightly build anymore

I'm guessing this is also the culprit for mxnet-cu102mkl. @caishanli could you confirm the versions for your mxnet-cu102 and mxnet-cu102mkl setup?

mxnet-cu102 1.6.0 mxnet-cu102mkl 1.6.0

TaoLv commented 4 years ago

I tested with it and the result is ok

Nice! Thank you for confirming that. Hope a nightly build is okay for your work.

djaym7 commented 4 years ago

Issue still exist on cu101 1.6.0 mkl