mx.nd.Custom conflicts with memory management

YutingZhang commented 5 years ago

When training/running a large neural network with CustomOP, mxnet can get stuck. My speculation is that if memory management (e.g., releasing/reallocating GPU memory, raising an "out of memory error") is needed while running CustomOP, mxnet can get stuck.

A minimum piece code to show that CustomOP can deadlock with memory management ("out-of-memory in this case"): As expected, main(1) should work. main(100) should give "out of GPU memory" error. However, it just got stuck.

The real-world problem I met is not just about inability to give an "out of memory" error. It seems MxNet can release/reallocate memory dynamically, but this probably also deadlock with CustomOP, so my program, which could fit into GPU memory, can also get stuck

import mxnet as mx

class MyMulMax(mx.operator.CustomOp):

    def __init__(self):
        super().__init__()

    def forward(self, is_train, req, in_data, out_data, aux):
        a, b = in_data[0:2]
        c = mx.nd.batch_dot(a, b)
        d = mx.nd.max(c, axis=-1, keepdims=True)
        self.assign(out_data[0], req[0], d)

    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        self.assign(in_grad[0], req[0], 0)
        self.assign(in_grad[1], req[1], 0)

@mx.operator.register("MyMulMax")
class MyMulMaxProp(mx.operator.CustomOpProp):

    def __init__(self):
        super().__init__()

    def list_arguments(self):
        return ['a', 'b']

    def list_outputs(self):
        return ['d']

    def infer_shape(self, in_shape):
        return in_shape, [list(in_shape[0][:-1] + [1])]

    def create_operator(self, ctx, shapes, dtypes):
        return MyMulMax()

def main(n):
    with mx.Context(mx.gpu(0)):
        a = mx.nd.random.uniform(shape=(n, 6000, 1))
        b = mx.nd.random.uniform(shape=(n, 1, 7000))
        d = mx.nd.Custom(a, b, op_type="MyMulMax")
        d_np = d.asnumpy()

if __name__ == "__main__":
    main(1)
    print("DONE -- 1")
    main(100)
    print("DONE -- 2")

Tested with nightly build of mxnet-cu90mkl with Python3.6

mxnet-label-bot commented 5 years ago

Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: Bug

YutingZhang commented 5 years ago

@arcadiaphy @wkcn @anirudh2290 Interested in looking into this?

wkcn commented 5 years ago

Does it reproduce the bug in old version of MXNet?

YutingZhang commented 5 years ago

@wkcn "the bug in old version": Do you mean the deadlock in subprocess? I thought https://github.com/apache/incubator-mxnet/pull/14451 solved this problem (I have not tried it out myself).

This thread is about a different problem. It happens in the main process on a single GPU.

wkcn commented 5 years ago

I mean that does this issue exist before the PR https://github.com/apache/incubator-mxnet/pull/14363 ?

wkcn commented 5 years ago

reproduce the bug in CPU environment, it get stuck too.

I use GDB to print the backtrace:

#0  0x00007ffff7f6aafc in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0                                      
#1  0x00007fffe28cacd1 in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>)                              
    at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:864         
#2  0x00007fffe28cacd1 in std::condition_variable::wait(std::unique_lock<std::mutex>&) (this=<optimized out>, __lock=...)    
    at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53                                                    
#3  0x00007fffead27065 in mxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) ()                                  
    at /home/wkcn/proj/incubator-mxnet/python/mxnet/../../lib/libmxnet.so                                                    
#4  0x00007fffea739052 in mxnet::NDArray::SyncCopyToCPU(void*, unsigned long) const ()                                       
    at /home/wkcn/proj/incubator-mxnet/python/mxnet/../../lib/libmxnet.so                                                    
#5  0x00007fffead95697 in MXNDArraySyncCopyToCPU () at /home/wkcn/proj/incubator-mxnet/python/mxnet/../../lib/libmxnet.so    
#6  0x00007ffff6b386d0 in ffi_call_unix64 () at /usr/lib/libffi.so.6                                                         
#7  0x00007ffff6b380a0 in ffi_call () at /usr/lib/libffi.so.6
#8  0x00007ffff6b9e615 in _ctypes_callproc () at /usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so      
#9  0x00007ffff6b9efa0 in  () at /usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so
#10 0x00007ffff7ba163c in _PyObject_FastCallKeywords () at /usr/lib/libpython3.7m.so.1.0
#11 0x00007ffff7beccca in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
#12 0x00007ffff7b752eb in _PyFunction_FastCallKeywords () at /usr/lib/libpython3.7m.so.1.0
#13 0x00007ffff7be7c42 in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
#14 0x00007ffff7b752eb in _PyFunction_FastCallKeywords () at /usr/lib/libpython3.7m.so.1.0
#15 0x00007ffff7be7dfa in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
--Type <RET> for more, q to quit, c to continue without paging--bt
\#16 0x00007ffff7b2eb99 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.7m.so.1.0
#17 0x00007ffff7b2fab4 in PyEval_EvalCodeEx () at /usr/lib/libpython3.7m.so.1.0
#18 0x00007ffff7b2fadc in PyEval_EvalCode () at /usr/lib/libpython3.7m.so.1.0
#19 0x00007ffff7c59c94 in  () at /usr/lib/libpython3.7m.so.1.0
#20 0x00007ffff7c5b8be in PyRun_FileExFlags () at /usr/lib/libpython3.7m.so.1.0
#21 0x00007ffff7c5cc75 in PyRun_SimpleFileExFlags () at /usr/lib/libpython3.7m.so.1.0
#22 0x00007ffff7c5eeb7 in  () at /usr/lib/libpython3.7m.so.1.0
#23 0x00007ffff7c5f0fc in _Py_UnixMain () at /usr/lib/libpython3.7m.so.1.0
#24 0x00007ffff7dbd223 in __libc_start_main () at /usr/lib/libc.so.6
#25 0x000055555555505e in _start ()

wkcn commented 5 years ago

I see. In python/mxnet/ndarray/ndarray.py, the function def copyto(self, other): calls _internal._copyto(self, out=other), which gets stuck.

YutingZhang commented 5 years ago

I mean that does this issue exist before the PR #14363 ?

Both before an after. I believe it should be a different issue.

arcadiaphy commented 5 years ago

MXNET_ENGINE_TYPE=ThreadedEngine gdb python -ex 'b src/storage/pooled_storage_manager.h:145' -ex 'r reproduce.py'

Pretty interesting bug, seems like cuda out of memory gets stuck and cannot be triggered in custom op.

wkcn commented 5 years ago

OOM on CPU gets stuck too.

arcadiaphy commented 5 years ago

@wkcn I think it's not related to OOM, looks more like LOG(FATAL) gets stuck.

YutingZhang commented 5 years ago

@arcadiaphy if LOG(FATAL) gets stuck, then there might be more stuff that can get stuck, which are harder to figure out.

anirudh2290 commented 5 years ago

I think worth trying would be checking if cudaGetErrorString is not called does it still get stuck. Since custom op uses its own custom threads it may not handle exceptions very well but still std::terminate should be called for unhandled exception. I am surprised that its not happening here for the unhandled exception.

larroy commented 5 years ago

Hi

I'm investigating this one. With the provided example I believe the cause of the error is the following at least in CPU context (changing gpu to cpu context in the example above).

I believe the current example crashes in Gemm due to an integer overflow in GEMM in the following code in dot_engine-inl.h

    for (int i = 0; i < batch_count; ++i) {
      gemm(stream, transa, transb, m, n, k, alpha,
           A + i * m * k, lda, B + i * k * n, ldb,
           beta, C + i * m * n, ldc);
    }

When having a gemm with dimension 6000 * 7000 * 100 which is > 2^32 we have an overflow which leads to the crash.
 I would suggest checking matrix dimensions for overflow before dispatching to GEMM.

arcadiaphy commented 5 years ago

@anirudh2290 @wkcn @YutingZhang Finally figure out the reason:

Normally, when a exception is thrown in spawn thread, it should be caught in std::exception_ptr and re-thrown in main thread to ensure proper except handling and avoid std::terminate. This mechanism is introduced in #9681 to handle exceptions in operator.

But there are still two problems in the except handling of custom op:

The exception thrown in custom thread cannot be caught in main thread, causing program crash.
When exception happens in custom op, it will be caught and re-thrown in sync function WaitForVar. But the WaitForVar is deadlocked because the pushed operation CustomOperator is skipped running, making the write dependency of waiting ndarray not completed forever. https://github.com/apache/incubator-mxnet/blob/master/src/operator/custom/custom-inl.h#L128 By adding c.wait_to_read(), the exception is forced to be re-thrown, but the program still crashes due to Problem 1.

import mxnet as mx

class MyMulMax(mx.operator.CustomOp):
    def __init__(self):
        super(MyMulMax, self).__init__()

    def forward(self, is_train, req, in_data, out_data, aux):
        a, b = in_data[0:2]
        c = mx.nd.batch_dot(a, b)
        # force re-throw exception, program still crash due to uncaught exception
        # c.wait_to_read()
        d = mx.nd.max(c, axis=-1, keepdims=True)
        self.assign(out_data[0], req[0], d)

    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        self.assign(in_grad[0], req[0], 0)
        self.assign(in_grad[1], req[1], 0)

@mx.operator.register("MyMulMax")
class MyMulMaxProp(mx.operator.CustomOpProp):
    def __init__(self):
        super(MyMulMaxProp, self).__init__()

    def list_arguments(self):
        return ['a', 'b']

    def list_outputs(self):
        return ['d']

    def infer_shape(self, in_shape):
        return in_shape, [list(in_shape[0][:-1] + [1])]

    def create_operator(self, ctx, shapes, dtypes):
        return MyMulMax()

def custom(n):
    with mx.Context(mx.gpu(0)):
        a = mx.nd.random.uniform(shape=(n, 6000, 1))
        b = mx.nd.random.uniform(shape=(n, 1, 7000))
        d = mx.nd.Custom(a, b, op_type="MyMulMax")
        d.wait_to_read()

def direct(n):
    with mx.Context(mx.gpu(0)):
        a = mx.nd.random.uniform(shape=(n, 6000, 1))
        b = mx.nd.random.uniform(shape=(n, 1, 7000))
        c = mx.nd.batch_dot(a, b)
        d = mx.nd.max(c, axis=-1, keepdims=True)
        # deadlock due to skipped CustomOperation function
        d.wait_to_read()

if __name__ == "__main__":
    custom(100)

arcadiaphy commented 5 years ago

@wkcn @anirudh2290 Like what I've mentioned before, adding on_complete callback in CustomOperator is not a good design, since the calling may be skipped when exception happens: https://github.com/apache/incubator-mxnet/blob/master/src/operator/custom/custom-inl.h#L128

Any suggestions on how to avoid it?

larroy commented 5 years ago

In CPU I see a crash in sgemm. I don't see the relation with exceptions in CPU.

(lldb) r repro2.py
Process 6491 launched: '/Users/pllarroy/devel/mxnet/py3_venv/bin/python' (x86_64)
Process 6491 stopped
* thread #4, stop reason = EXC_BAD_ACCESS (code=1, address=0xffffffff326c4800)
    frame #0: 0x0000000113fb3b4d libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL + 107
libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL:
->  0x113fb3b4d <+107>: movl   $0x0, (%rdx,%rax,4)
    0x113fb3b54 <+114>: incq   %rax
    0x113fb3b57 <+117>: jmp    0x113fb3b42               ; <+96>
    0x113fb3b59 <+119>: decq   %rsi
  thread #10, stop reason = EXC_BAD_ACCESS (code=1, address=0xffffffff326c6c78)
    frame #0: 0x0000000113fb3b4d libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL + 107
libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL:
->  0x113fb3b4d <+107>: movl   $0x0, (%rdx,%rax,4)
    0x113fb3b54 <+114>: incq   %rax
    0x113fb3b57 <+117>: jmp    0x113fb3b42               ; <+96>
    0x113fb3b59 <+119>: decq   %rsi
  thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0xffffffff326c90ec)
    frame #0: 0x0000000113fb3b4d libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL + 107
libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL:
->  0x113fb3b4d <+107>: movl   $0x0, (%rdx,%rax,4)
    0x113fb3b54 <+114>: incq   %rax
    0x113fb3b57 <+117>: jmp    0x113fb3b42               ; <+96>
    0x113fb3b59 <+119>: decq   %rsi
Target 0: (python) stopped.
(lldb)

larroy commented 5 years ago

In GPU this is the segmentation fault that I get. Are we talking about the same problem?

Thread 297 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffea9371700 (LWP 3786)]
0x00007fff31cdb240 in sgemm_beta_HASWELL () from /usr/lib/x86_64-linux-gnu/libopenblas.so.0
(gdb) bt 3
#0  0x00007fff31cdb240 in sgemm_beta_HASWELL () from /usr/lib/x86_64-linux-gnu/libopenblas.so.0
#1  0x00007fff30821d95 in ?? () from /usr/lib/x86_64-linux-gnu/libopenblas.so.0
#2  0x00007fff30969b52 in ?? () from /usr/lib/x86_64-linux-gnu/libopenblas.so.0
(More stack frames follow...)
(gdb)

larroy commented 5 years ago

I talked to @anirudh2290 and I understand we are talking about two separate issues here:

Exception handling in custom ops
The integer overflow problem that I described.

@arcadiaphy do you plan to send a fix for the handling of exceptions in custom ops?

I can work on the integer overflow problem.

larroy commented 5 years ago

This repro in imperative mode isolates from the exception handling problem in custom ops:

import mxnet as mx
n = 100
a = mx.nd.random.uniform(shape=(n, 6000, 1))
b = mx.nd.random.uniform(shape=(n, 1, 7000))
c = mx.nd.batch_dot(a, b)
c.wait_to_read()
print(c.shape)

arcadiaphy commented 5 years ago

@larroy Yes, two different issues here.

Reproduced the gemm bug too, but I'm focusing on the exception handling issue.

anirudh2290 commented 5 years ago

Thanks @arcadiaphy ! Nice work! We can do this incrementally. Would you be willing to start with a PR for 1 ?

The exception thrown in custom thread cannot be caught in main thread, causing program crash.

arcadiaphy commented 5 years ago

@anirudh2290 Have started a PR for 1, but the really tricky one is 2.

wkcn commented 5 years ago

Great! Thanks fo your analysis! I will read the comments and review the PR : )

larroy commented 5 years ago

nice! I will work on a fix for tensor shapes in the Blas Engine, unless somebody else has a strong desire to help there.

larroy commented 5 years ago

@mxnet-label-bot add [C++, Backend, Bug, Exception Handling]

apache / mxnet

mx.nd.Custom conflicts with memory management #14522