apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Out of memory error is not propagated as an error to python environment #18760

Closed adamsvystun closed 3 years ago

adamsvystun commented 4 years ago

Description

Out of memory error is not propagated as an error to python environment. This prevents the ability to properly handle the error.

Why is this important?

When using the model in production environment I can't catch the error. This means that the service cannot be restarted as it just hangs (so the service does not process subsequent requests). This also means that I am not notified when error occurs and am unable to properly respond to it.

Error Message

terminate called after throwing an instance of 'dmlc::Error'
  what():  [09:36:49] src/storage/./pooled_storage_manager.h:157: cudaMalloc failed: out of memory
Stack trace:
  [bt] (0) /root/.local/share/virtualenvs/app-4PlAip0Q/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x4b09db) [0x7fb97fd5d9db]
  [bt] (1) /root/.local/share/virtualenvs/app-4PlAip0Q/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2e656c9) [0x7fb9827126c9]
  [bt] (2) /root/.local/share/virtualenvs/app-4PlAip0Q/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2e6b1ef) [0x7fb9827181ef]
  [bt] (3) /root/.local/share/virtualenvs/app-4PlAip0Q/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::NDArray::CheckAndAlloc() const+0x1cc) [0x7fb97fdd5fbc]
  [bt] (4) /root/.local/share/virtualenvs/app-4PlAip0Q/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x26685fd) [0x7fb981f155fd]
  [bt] (5) /root/.local/share/virtualenvs/app-4PlAip0Q/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x20f) [0x7fb981f15acf]
  [bt] (6) /root/.local/share/virtualenvs/app-4PlAip0Q/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x25b5b89) [0x7fb981e62b89]
  [bt] (7) /root/.local/share/virtualenvs/app-4PlAip0Q/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x25c2301) [0x7fb981e6f301]
  [bt] (8) /root/.local/share/virtualenvs/app-4PlAip0Q/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x25c5810) [0x7fb981e72810]

Segmentation fault: 11

Stack trace:
  [bt] (0) /root/.local/share/virtualenvs/app-4PlAip0Q/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2e6b420) [0x7fb982718420]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3efd0) [0x7fba6a88bfd0]
  [bt] (2) /lib/x86_64-linux-gnu/libc.so.6(abort+0x230) [0x7fba6a88d9a0]
  [bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c957) [0x7fba66b23957]
  [bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92ae6) [0x7fba66b29ae6]
  [bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92b21) [0x7fba66b29b21]
  [bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92ada) [0x7fba66b29ada]
  [bt] (7) /root/.local/share/virtualenvs/app-4PlAip0Q/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x25be930) [0x7fb981e6b930]
  [bt] (8) /root/.local/share/virtualenvs/app-4PlAip0Q/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x25c79ba) [0x7fb981e749ba]

To Reproduce

from mxnet import np, npx
a = np.ones((2000, 2000, 330), ctx=npx.gpu())
b = np.ones((2000, 2000, 330), ctx=npx.gpu())
c = np.ones((2000, 2000, 330), ctx=npx.gpu())
d = np.ones((2000, 2000, 330), ctx=npx.gpu())
e = np.ones((2000, 2000, 330), ctx=npx.gpu())
f = np.ones((2000, 2000, 330), ctx=npx.gpu())
g = np.ones((2000, 2000, 330), ctx=npx.gpu())

Steps to reproduce

  1. Run the code provided on a GPU

What have you tried to solve it?

  1. Tried catching it with signals, unsuccessfully.
    
    import signal

def sig_handler(signum, frame): print("segfault")

signal.signal(signal.SIGSEGV, sig_handler)

But even if I can catch it, it is inconvenient to catch seperate from the code execution flow. Prevents proper error managmenent.

## Environment

----------Python Info---------- Version : 3.6.9 Compiler : GCC 8.4.0 Build : ('default', 'Apr 18 2020 01:56:04') Arch : ('64bit', '') ------------Pip Info----------- Version : 19.3.1 Directory : /usr/local/lib/python3.6/dist-packages/pip ----------MXNet Info----------- Version : 1.6.0 Directory : /usr/local/lib/python3.6/dist-packages/mxnet Num GPUs : 1 Commit Hash : 6eec9da55c5096079355d1f1a5fa58dcf35d6752 ----------System Info---------- Platform : Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic system : Linux node : 62b2ed3365af release : 4.19.104+ version : #1 SMP Wed Feb 19 05:26:34 PST 2020 ----------Hardware Info---------- machine : x86_64 processor : x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Thread(s) per core: 2 Core(s) per socket: 1 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU @ 2.30GHz Stepping: 0 CPU MHz: 2300.000 BogoMIPS: 4600.00 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 46080K NUMA node0 CPU(s): 0,1 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear arch_capabilities ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0043 sec, LOAD: 0.4843 sec. Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0010 sec, LOAD: 0.4399 sec. Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0787 sec, LOAD: 0.2977 sec. Timing for D2L: http://d2l.ai, DNS: 0.0259 sec, LOAD: 0.1943 sec. Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0309 sec, LOAD: 0.2189 sec. Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0542 sec, LOAD: 0.3888 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0027 sec, LOAD: 0.2930 sec. Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.04941534996032715 sec.

leezu commented 4 years ago

When using the model in production environment I can't catch the error. This means that the service cannot be restarted as it just hangs (so the service does not process subsequent requests).

The process in which MXNet is running is terminated due to memory error. It seems that your service is using multiple processes and your main process does not correctly handle the case where child processes are terminated. As a workaround, I suggest you ensure to monitor if your child processes are still alive and restart them if needed.

adamsvystun commented 3 years ago

I used to be able to reproduce this in a Google Colab notebook without any processes. The notebook used to crash without any error. Was that expected behaviour? Can't reproduce as of now, so I am closing the issue.