apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

MXNet serialization format depends on endianness #18667

Closed Masquerade0097 closed 3 years ago

Masquerade0097 commented 4 years ago

Description

We are working on extending MXNet support to IBM Z (s390x architecture) machines. We were able to successfully install MXNet on s390x but we encountered the following error while trying to run it.

Error Message

Traceback (most recent call last):
  File "scr.py", line 76, in <module>
    finetune_net = resnet50_v2(pretrained=True, ctx=ctx)
  File "/root/mxnet-1.5.0/mxnet/python/mxnet/gluon/model_zoo/vision/resnet.py", line 512, in resnet50_v2
    return get_resnet(2, 50, **kwargs)
  File "/root/mxnet-1.5.0/mxnet/python/mxnet/gluon/model_zoo/vision/resnet.py", line 391, in get_resnet
    root=root), ctx=ctx)
  File "/root/mxnet-1.5.0/mxnet/python/mxnet/gluon/block.py", line 384, in load_parameters
    loaded = ndarray.load(filename)
  File "/root/mxnet-1.5.0/mxnet/python/mxnet/ndarray/utils.py", line 175, in load
    ctypes.byref(names)))
  File "/root/mxnet-1.5.0/mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [19:50:04] include/mxnet/./tuple.h:354: Check failed: ndim >= -1 (-906324999 vs. -1) : ndim cannot be less than -1, received -906324999
Stack trace:
  [bt] (0) /opt/mxnet/lib/libmxnet.so(mxnet::Tuple<long>::SetDim(int)+0x670) [0x3ff841f6888]
  [bt] (1) /opt/mxnet/lib/libmxnet.so(mxnet::TShape::TShape(int, long)+0x2a) [0x3ff841f6d12]
  [bt] (2) /opt/mxnet/lib/libmxnet.so(mxnet::LegacyTShapeLoad(dmlc::Stream*, mxnet::TShape*, unsigned int)+0x4e) [0x3ff8629f0be]
  [bt] (3) /opt/mxnet/lib/libmxnet.so(mxnet::NDArray::LegacyLoad(dmlc::Stream*, unsigned int)+0x68) [0x3ff862a4578]
  [bt] (4) /opt/mxnet/lib/libmxnet.so(mxnet::NDArray::Load(dmlc::Stream*)+0x12e) [0x3ff862a91de]
  [bt] (5) /opt/mxnet/lib/libmxnet.so(mxnet::NDArray::Load(dmlc::Stream*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*)+0x8b8) [0x3ff862ac4d8]
  [bt] (6) /opt/mxnet/lib/libmxnet.so(MXNDArrayLoad+0xf4) [0x3ff8692eaa4]
  [bt] (7) /usr/lib/s390x-linux-gnu/libffi.so.6(ffi_call_SYSV+0x98) [0x3ff88385760]
  [bt] (8) /usr/lib/s390x-linux-gnu/libffi.so.6(ffi_call+0x72) [0x3ff88385e12]

To Reproduce

  1. Install MXNet from source with the flag USE_SSE=0.
  2. Follow the tutorial to test the library - https://mxnet.apache.org/api/python/docs/tutorials/getting-started/gluon_from_experiment_to_deployment.html

Environment

Using MXNet 1.5.1

----------Python Info----------
Version      : 3.6.9
Compiler     : GCC 8.4.0
Build        : ('default', 'Apr 18 2020 01:56:04')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 9.0.1
Directory    : /usr/lib/python3/dist-packages/pip
----------MXNet Info-----------
Version      : 1.5.0
Directory    : /root/mxnet-1.5.0/mxnet/python/mxnet
Num GPUs     : 0
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.15.0-99-generic-s390x-with-Ubuntu-18.04-bionic
system       : Linux
node         : e0e720e9a63c
release      : 4.15.0-99-generic
version      : #100-Ubuntu SMP Wed Apr 22 20:31:47 UTC 2020
----------Hardware Info----------
machine      : s390x
processor    : s390x
Architecture:        s390x
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Big Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s) per book:  1
Book(s) per drawer:  1
Drawer(s):           4
NUMA node(s):        1
Vendor ID:           IBM/S390
Machine type:        3906
CPU dynamic MHz:     5208
CPU static MHz:      5208
BogoMIPS:            21881.00
Hypervisor:          z/VM 7.1.0
Hypervisor vendor:   IBM
Virtualization type: full
Dispatching mode:    horizontal
L1d cache:           128K
L1i cache:           128K
L2d cache:           4096K
L2i cache:           2048K
L3 cache:            131072K
L4 cache:            688128K
NUMA node0 CPU(s):   0-3
Flags:               esan3 zarch stfle msa ldisp eimm dfp edat etf3eh highgprs te vx vxd vxe gs sie
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0044 sec, LOAD: 0.0622 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0037 sec, LOAD: 0.0612 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0039 sec, LOAD: 0.1054 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0040 sec, LOAD: 0.0172 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0037 sec, LOAD: 0.2370 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0038 sec, LOAD: 0.0423 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0038 sec, LOAD: 0.4736 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.0038928985595703125 sec.
leezu commented 4 years ago

The Intel x86 and AMD64 / x86-64 series of processors use the little-endian format whereas s390x architecture uses big-endian format. Current MXNet parameter serialization format is just a memory dump and can't be loaded on a system with different endianness.

For MXNet 2, we're switching to using the numpy serialization format which will prevent such issues (https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)

For now, you can't use the pretrained models directly on s390x architecture but you'd need some workaround where you load the parameter on a x86 machine, call asnumpy, save via numpy and finally load the parameters on s390x machine using numpy and convert them to mxnet. Would that work for you?

Masquerade0097 commented 4 years ago

@leezu Thanks for your reply.

For now, you can't use the pretrained models directly on s390x architecture but you'd need some workaround where you load the parameter on a x86 machine, call asnumpy, save via numpy and finally load the parameters on s390x machine using numpy and convert them to mxnet. Would that work for you?

I'll give it a try.

leezu commented 4 years ago

Are you sure the build succeed? Above logs are only related to the build configuration and not the actual build