apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Python 3.8 wheel appears to be slightly corrupted #18754

Closed davidhewitt closed 4 years ago

davidhewitt commented 4 years ago

Description

I'm attempting to debug a crash observed by a user of PyO3 (https://github.com/PyO3/pyo3/issues/1044) which occurs when mxnet is imported.

Attempting to use gdb (via rust-gdb wrapper script) suggests that mxnet.so is partially corrupted. readelf -a also emits some warnings. Both are pasted below.

Error Message

Errors seen from readelf -a path/to/libmxnet.so | grep -i warning:

readelf: Warning: Section 0 has an out of range sh_link value of 1179403647
readelf: Warning: Section 1 has an out of range sh_link value of 2381354254
readelf: Warning: Section 2 has an out of range sh_link value of 3825592164
readelf: Warning: Section 3 has an out of range sh_link value of 149717832
readelf: Warning: Section 3 has an out of range sh_info value of 3171257160
readelf: Warning: Section 4 has an out of range sh_link value of 2781517102
readelf: Warning: Section 4 has an out of range sh_info value of 3221398224
readelf: Warning: Section 5 has an out of range sh_link value of 222961676
readelf: Warning: Section 5 has an out of range sh_info value of 2840782096
readelf: Warning: Section 6 has an out of range sh_link value of 536469743
readelf: Warning: Section 7 has an out of range sh_link value of 548555395
readelf: Warning: Section 7 has an out of range sh_info value of 564797439
readelf: Warning: [ 0]: Unexpected value (65794) in info field.
readelf: Warning: Size of section 1 is larger than the entire file!
readelf: Warning: [ 3]: Expected link to another section in info fieldreadelf: Warning: [ 4]: Expected link to another section in info fieldreadelf: Warning: [
 5]: Expected link to another section in info fieldreadelf: Warning: Size of section 6 is larger than the entire file!
readelf: Warning: [ 7]: Expected link to another section in info fieldreadelf: Error: no .dynamic section in the dynamic segment

Errors seen from gdb:

BFD: warning: /home/david/dev/pyo3-scratch/.direnv/python-3.8.2/lib/python3.8/site-packages/mxnet/libmxnet.so has a corrupt section with a size (ff20fb37000000
08) larger than the file size
BFD: warning: /home/david/dev/pyo3-scratch/.direnv/python-3.8.2/lib/python3.8/site-packages/mxnet/libmxnet.so has a corrupt section with a size (ff20fb37000000
08) larger than the file size
Error while mapping shared library sections:
`/home/david/dev/pyo3-scratch/.direnv/python-3.8.2/lib/python3.8/site-packages/mxnet/libmxnet.so': not in executable format: file format not recognized

To Reproduce

Run readelf -a path/to/libmxnet.so | grep -i warning.

Alternatively request and I can write tutorial how to install & run the linked Rust code under rust-gdb.

Environment

I'm using Ubuntu 20.04 on WSL2. According to pip, mxnet was installed via the wheel mxnet-1.6.0-py2.py3-none-any.whl.

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

----------Python Info----------
Version      : 3.8.2
Compiler     : GCC 9.3.0
Build        : ('default', 'Apr 27 2020 15:53:34')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 20.0.2
Directory    : /home/david/dev/pyo3-scratch/.direnv/python-3.8.2/lib/python3.8/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /home/david/dev/pyo3-scratch/.direnv/python-3.8.2/lib/python3.8/site-packages/mxnet
Num GPUs     : 0
Commit Hash   : 6eec9da55c5096079355d1f1a5fa58dcf35d6752
----------System Info----------
Platform     : Linux-4.19.104-microsoft-standard-x86_64-with-glibc2.29
system       : Linux
node         : david-laptop
release      : 4.19.104-microsoft-standard
version      : #1 SMP Wed Feb 19 06:37:35 UTC 2020
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          12
On-line CPU(s) list:             0-11
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           165
Model name:                      Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
Stepping:                        2
CPU MHz:                         2592.007
BogoMIPS:                        5184.01
Hypervisor vendor:               Microsoft
Virtualization type:             full
L1d cache:                       192 KiB
L1i cache:                       192 KiB
L2 cache:                        1.5 MiB
L3 cache:                        12 MiB
Vulnerability Itlb multihit:     KVM: Vulnerable
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdt
                                 scp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f
                                 16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2
                                  erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves flush_l1d arch_capabilities
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0236 sec, LOAD: 0.7427 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0020 sec, LOAD: 0.6102 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0815 sec, LOAD: 0.5328 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0247 sec, LOAD: 0.0889 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0254 sec, LOAD: 0.2696 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0943 sec, LOAD: 0.7244 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0236 sec, LOAD: 0.5804 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.02408432960510254 sec.
leezu commented 4 years ago

@davidhewitt can you follow https://mxnet.apache.org/get_started/build_from_source and verify if the same warnings apply to the libmxnet.so generated when following the guide?

davidhewitt commented 4 years ago

When following the guide, I was able to build a libmxnet.so which did not have these warnings. Using Ubuntu 20.04, WSL2. I had to disable CUDA support.

leezu commented 4 years ago

Thank you for trying that @davidhewitt. The difference between the libmxnet.so in the pip wheel and the one obtained via the build_from_source guide is that in the pip wheel a number of dependencies are statically linked to enable portability across various linux distributions and versions thereof. It's built on a CentOS 7 host for compliance with https://www.python.org/dev/peps/pep-0599/

Downgrading to a slightly older version of mxnet seems to work:

$ pip install mxnet==1.6.0b20200127

The libmxnet.so file in 1.6.0 is very broken. Not sure why the python3.8 interpreter doesn't crash. It also calls Py_FinalizeEx.

https://github.com/PyO3/pyo3/issues/1044#issuecomment-660469745

@m-ou-se can you provide more details on why the file is broken? Specifically, I'm not sure if the issue at hand is really a build issue or rather a bug in the MXNet backend. Does your statement apply to the latest versions at https://dist.mxnet.io/python/cpu , for example https://repo.mxnet.io/dist/python/cpu/mxnet-2.0.0b20200720-py2.py3-none-manylinux2014_x86_64.whl

leezu commented 4 years ago

cc @eric-haibin-lin as the issue may be related engine shutdown

Debugging leads us to this line: https://github.com/apache/incubator-mxnet/blob/a0e67353fe81ed97fc7aef2d8429a93dc035a394/src/c_api/c_api.cc#L1318.

Running import mxnet directly in python (and exiting the interpreter) exits normally.

Strangely, on Python 3.8 we get a corrupted double-linked list instead.

I realize that mxnet is absolutely huge, but thought you guys might be interested in knowing about a specific edge-case which causes problems for pyo3. Of course, if you guys are able to provide a bit of insight that'd be super helpful. :) Thank you!

Via https://github.com/PyO3/pyo3/issues/1044

leezu commented 4 years ago

The crash is fixed by https://github.com/apache/incubator-mxnet/pull/18768

I'm not sure about the readelf warnings, but currently I don't see any evidence for them being harmful (and we're just using cmake to statically link in a bunch of libraries into the libmxnet.so). So I'll close this issue. Please comment and reopen if any action needs to be taken from the build perspective.