apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.8k forks source link

Deadlock with ThreadedEngine #18090

Open RuRo opened 4 years ago

RuRo commented 4 years ago

Description

There currently exists some weird behaviour with unix-gpu CI jobs, where the build sometimes gets aborted and other times completes fine. I've seen this multiple times on different PRs.

Until today, I thought, that this is caused by limited available GPU executors and the jobs are getting manually aborted or aborted by some automatic priority setup in Jenkins (maybe priority goes to CI/CD for master or something).

However, I've noticed a few weird consistent things about these aborted jobs, so I wanted to make sure, that the current behaviour is intentional.

1) Of the cases I've seen, unix-gpu getting aborted, it's almost always in a situation, where all the build steps and all the other tests were completed, but there is just a single Python 3: GPU or Python 3: GPU (TVM_OP OFF) test step that was aborted.

image image
2) Normally, these steps seem to take around 1 hour to complete. But in the cases, where they were aborted, it was after 3 hours. Additionally, there is a weird jump in the logs, between the time of the last log message from the test and the first message from shutting down due to the interrupt signal.
[2020-04-16T14:57:56.542Z] test_operator_gpu.test_np_diag ... ok (2.9642s)
[2020-04-16T14:57:56.797Z] test_operator_gpu.test_np_diag_indices_from ... ok (0.2669s)
[2020-04-16T14:58:00.957Z] test_operator_gpu.test_np_diagflat ... ok (3.5951s)
[2020-04-16T14:58:01.882Z] test_operator_gpu.test_np_diagonal ... ok (1.4132s)
[2020-04-16T14:58:04.397Z] test_operator_gpu.test_np_diff ... ok (2.0127s)
[2020-04-16T14:58:05.758Z] test_operator_gpu.test_np_dot ... ok (1.8446s)
[2020-04-16T14:58:05.758Z] test_operator_gpu.test_np_dsplit ... ok (0.0832s)
[2020-04-16T14:58:06.013Z] test_operator_gpu.test_np_dstack ... ok (0.0664s)
[2020-04-16T14:58:15.936Z] test_operator_gpu.test_np_ediff1d ... ok (8.6182s)
[2020-04-16T14:58:17.295Z] test_operator_gpu.test_np_einsum ... ok (2.4994s)
[2020-04-16T14:58:17.295Z] test_operator_gpu.test_np_empty ... ok (0.0161s)
[2020-04-16T17:33:48.723Z] Sending interrupt signal to process
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:48,990 - root - WARNING - Signal 15 received, cleaning up...
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:48,991 - root - WARNING - Cleaning up containers
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,155 - root - INFO - ☠: stopped container 89c8acd3217d
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - INFO - 🚽: removed container 89c8acd3217d
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - INFO - Cleaning up containers finished.
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - WARNING - done. Exiting with error.
[2020-04-16T17:33:57.549Z] script returned exit code 1
These are consecutive log messages, but you can see a huge time skip between test_operator_gpu.test_np_empty ... ok at 14:58 and Sending interrupt signal to process at 17:33.

Occurrences

Here are the 2 examples from the screenshots: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18055/4/pipeline http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18054/6/pipeline and a random example, not from my PRs: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18081/1/pipeline

I am aware, that we can just restart the job via mxnet-bot, but this is annoying since the job takes a long time to complete even without this issue. Can somebody clarify, if 1) unix-gpu CI jobs getting aborted is intentional (and what is the current policy on aborting CI jobs etc) 2) if it is intentional, is there something we can do to at the very least abort the tests faster or maybe not even fail these jobs, but automatically reschedule them (or preferably reschedule just the aborted step, not the whole pipeline)

leezu commented 4 years ago

@RuRo this is an unfortunate issue. The working hypothesis is that there is a bug in one of the numpy operators, causing some later test to hang. @szha is helping to switch CI to run all unittests via naive engine instead of threadedengine, which may help prevent buggy operator A to cause a hang of some later test. There is also the idea of timing out the job if there is no new output for 30 minutes, but nobody owns implementing this yet AFAIK. Any help would be appreciated.

szha commented 4 years ago

@josephevans is looking into a 30m auto timeout.

marcoabreu commented 4 years ago

The 3 hours you see it the global timeout we are setting for jobs. So that abort is intentional

RuRo commented 4 years ago

Okay! I think I was able to track down the source of this problem. I noticed, that the last 3 times the pipeline froze for me, the last output was test_operator_gpu.test_np_true_divide. The next test after that, that is supposed to run is test_operator_gpu.test_np_unary_bool_funcs.

After ducking around with test_operator_gpu.test_np_unary_bool_funcs a bit, I was able to reproduce the hanging behaviour locally. I just ran the test in a loop, printing something in between runs and after a while, the process got stuck and stopped printing.

It didn't happen every time and setting PRNG seeds didn't make it fail deterministically, so the issue is probably thread-related. Also, it definitely doesn't happen with MXNET_ENGINE_TYPE=NaiveEngine.

After making a reproducible setup, I spent some time bisecting the test_operator_gpu.test_np_unary_bool_funcs test. In the end, I was able to narrow it down to the np.empty_like operator:

import mxnet as mx

with mx.Context(mx.gpu()):
    while True:
        for _ in range(100):
            mx_data = mx.np.array(0)
            mx_x = mx.np.empty_like(mx_data)

        print(end='.', flush=True)

The above piece of code executes the empty_like operator and prints a dot once every 100 iterations and after ~20 seconds it gets stuck. I took a quick look at how empty_like is implemented but wasn't able to figure out, what exactly is wrong with it. It looks like a really thin wrapper around CustomOp, which is used from C++, so I am giving up for now.

leezu commented 4 years ago

@RuRo Thank you for investigating the issue! Unfortunately I can't repro based on your script. Can you provide more details on your environment (mxnet version, cuda version, gpu architecture, etc)?

leezu commented 4 years ago

However, interrupting (ctrl+c) the repro script that you shared would sometimes print Segmentation fault: 11 but sometimes also provide a meaningful backtrace

Traceback (most recent call last):
  File "test.py", line 7, in <module>
    mx_x = mx.np.empty_like(mx_data)
  File "/home/ubuntu/src/mxnet-master/python/mxnet/numpy/multiarray.py", line 2579, in empty_like
    return _mx_nd_np.empty_like(prototype, dtype=dtype, order=order, subok=subok, shape=shape)
  File "/home/ubuntu/src/mxnet-master/python/mxnet/ndarray/numpy/_op.py", line 512, in empty_like
    return _npi.empty_like_fallback(prototype, dtype=dtype, order=order, subok=subok, shape=shape)
  File "<string>", line 40, in Custom
  File "/home/ubuntu/src/mxnet-master/python/mxnet/_ctypes/ndarray.py", line 82, in _imperative_invoke
    check_call(_LIB.MXImperativeInvokeEx(
  File "/home/ubuntu/src/mxnet-master/python/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "../include/mxnet/./tuple.h", line 141
MXNetError: Check failed: ndim() >= 0 (-1 vs. 0) :

You need some lucky timing to get this backtrace.. So try a few times if you can't see it

RuRo commented 4 years ago

Here is my environment:

``` ----------Python Info---------- Version : 3.8.2 Compiler : GCC 9.2.1 20200130 Build : ('default', 'Feb 26 2020 22:21:03') Arch : ('64bit', 'ELF') ------------Pip Info----------- Version : 20.0.2 Directory : /home/ruro/.local/lib/python3.8/site-packages/pip ----------MXNet Info----------- Version : 2.0.0 Directory : /usr/lib/python3.8/site-packages/mxnet Num GPUs : 1 Hashtag not found. Not installed from pre-built package. ----------System Info---------- Platform : Linux-5.4.31-1-MANJARO-x86_64-with-glibc2.2.5 system : Linux node : ruro-laptop release : 5.4.31-1-MANJARO version : #1 SMP PREEMPT Wed Apr 8 10:25:32 UTC 2020 ----------Hardware Info---------- machine : x86_64 processor : Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz Stepping: 9 CPU MHz: 3300.060 CPU max MHz: 3800.0000 CPU min MHz: 800.0000 BogoMIPS: 5602.18 Virtualization: VT-x L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 1 MiB L3 cache: 6 MiB NUMA node0 CPU(s): 0-7 Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rd tscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx e st tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowpre fetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_wind ow hwp_epp md_clear flush_l1d ```

My mxnet install is built from source of the latest master a couple of days ago. CUDA version is 10.2. GPU is GTX1050.

The timing for the lockup to happen can vary quite a lot. I'd recommend starting a couple copies of this script and leaving them running for a minute or so to be absolutely sure.

Interrupting the script before it hangs indeed sometimes results in a Segmentation Fault. After the script hangs, it seems to ignore SIGINT completely and has to be killed by SIGTERM or something like that.

Also, here is the feature list for my mxnet install

``` [✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✔ CPU_AVX2, ✔ OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✖ BLAS_OPEN, ✖ BLAS_ATLAS, ✔ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✔ MKLDNN, ✖ OPENCV, ✖ CAFFE, ✖ PROFILER, ✔ DIST_KVSTORE, ✖ CXX14, ✖ INT64_TENSOR_SIZE, ✔ SIGNAL_HANDLER, ✖ DEBUG, ✖ TVM_OP] ```

Update: some further testing revealed that 1) It's not a GPU specific problem, I removed with mx.Context(mx.gpu()): and the problem persist. 2) Running the test with MXNET_ENABLE_CYTHON=0 python test.py fixes the issue. (Both the Python 3: GPU and Python 3: GPU TVM_OP OFF test steps run the python3_gpu_ut_cython pipeline, which uses cython)

Also, interestingly enough, all the CPU pipelines set export MXNET_ENABLE_CYTHON=0.

RuRo commented 4 years ago

@leezu can you try to reproduce with a recent mxnet version that is built with cython support and run the script with MXNET_ENFORCE_CYTHON=1 python test.py? Unfortunately, I wasn't able to find any prebuilt wheels/artifacts, that were built with cython, so you might have to build from source yourself.

RuRo commented 4 years ago

While we are waiting for somebody to reproduce, I've tried playing with the stuck process in gdb. Here is the backtrace of all the threads after they get stuck.

```c Thread 16 (Thread 0x7ffef27fc700 (LWP 1388714)): #0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0 #1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=, __cond=) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865 #2 std::condition_variable::wait(std::unique_lock&) (this=, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53 #3 0x00007fffc64fa7d1 in std::thread::_State_impl > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #4 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557d5d620) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80 #5 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0 #6 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6 Thread 15 (Thread 0x7ffef2ffd700 (LWP 1388687)): #0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0 #1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=, __cond=) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865 #2 std::condition_variable::wait(std::unique_lock&) (this=, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53 #3 0x00007fffc64fa7d1 in std::thread::_State_impl > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #4 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557d6a8a0) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80 #5 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0 #6 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6 Thread 14 (Thread 0x7ffef37fe700 (LWP 1388625)): #0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0 #1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=, __cond=) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865 #2 std::condition_variable::wait(std::unique_lock&) (this=, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53 #3 0x00007fffc64fa7d1 in std::thread::_State_impl > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #4 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557d3f910) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80 #5 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0 #6 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6 Thread 13 (Thread 0x7ffef3fff700 (LWP 1388624)): #0 0x00007ffff79c401a in pthread_cond_timedwait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0 #1 0x00007ffff7b020c3 in PyEval_RestoreThread () at /usr/lib/libpython3.8.so.1.0 #2 0x00007ffff7aa57d7 in () at /usr/lib/libpython3.8.so.1.0 #3 0x00007ffff72ff168 in () at /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so #4 0x00007ffff727d8c2 in () at /usr/lib/libffi.so.7 #5 0x00007ffff727dc20 in () at /usr/lib/libffi.so.7 #6 0x00007fffc6bf3782 in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #7 0x00007fffc64840b7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #8 0x00007fffc6bf433e in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #9 0x00007fffc6bf8eca in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #10 0x00007fffc64fa794 in std::thread::_State_impl > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #11 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557c4f8b0) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80 #12 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0 #13 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6 Thread 12 (Thread 0x7fff10ff9700 (LWP 1388623)): #0 0x00007ffff79c401a in pthread_cond_timedwait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0 #1 0x00007ffff7b020c3 in PyEval_RestoreThread () at /usr/lib/libpython3.8.so.1.0 #2 0x00007ffff7300f24 in () at /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so #3 0x00007ffff730570d in () at /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so #4 0x00007ffff7b0c3d2 in _PyObject_MakeTpCall () at /usr/lib/libpython3.8.so.1.0 #5 0x00007ffff7bc9c51 in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0 #6 0x00007ffff7bb6a9d in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0 #7 0x00007ffff7bc558e in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0 #8 0x00007ffff7bb6a9d in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0 #9 0x00007ffff7b5acda in () at /usr/lib/libpython3.8.so.1.0 #10 0x00007ffff7b5bc78 in () at /usr/lib/libpython3.8.so.1.0 #11 0x00007ffff7bc5e1c in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0 #12 0x00007ffff7bb58f4 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0 #13 0x00007ffff7bb6c7b in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0 #14 0x00007ffff7bc5f5a in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0 #15 0x00007ffff7bb58f4 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0 #16 0x00007ffff7bb6c7b in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0 #17 0x00007ffff7b12508 in PyObject_Call () at /usr/lib/libpython3.8.so.1.0 #18 0x00007ffff7bc70c4 in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0 #19 0x00007ffff7bb58f4 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0 #20 0x00007ffff7bb6c7b in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0 #21 0x00007ffff7b12508 in PyObject_Call () at /usr/lib/libpython3.8.so.1.0 #22 0x00007ffff7bc70c4 in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0 #23 0x00007ffff7bb58f4 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0 #24 0x00007ffff7bb72c2 in () at /usr/lib/libpython3.8.so.1.0 #25 0x00007ffff7bc5f5a in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0 #26 0x00007ffff7bb6154 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0 #27 0x00007ffff7bb6c7b in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0 #28 0x00007ffff7b123fd in PyObject_Call () at /usr/lib/libpython3.8.so.1.0 #29 0x00007ffff72ff2a0 in () at /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so #30 0x00007ffff727d8c2 in () at /usr/lib/libffi.so.7 #31 0x00007ffff727dc20 in () at /usr/lib/libffi.so.7 #32 0x00007fffc6bf3c74 in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #33 0x00007fffc6bfa4d6 in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #34 0x00007fffc64fa748 in std::thread::_State_impl > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #35 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557bf09e0) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80 #36 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0 #37 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6 Thread 11 (Thread 0x7fff117fa700 (LWP 1388622)): #0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0 #1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=, __cond=) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865 #2 std::condition_variable::wait(std::unique_lock&) (this=, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53 #3 0x00007fffc64fa7d1 in std::thread::_State_impl > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #4 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557d0b310) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80 #5 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0 #6 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6 Thread 10 (Thread 0x7fff11ffb700 (LWP 1388621)): #0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0 #1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=, __cond=) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865 #2 std::condition_variable::wait(std::unique_lock&) (this=, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53 #3 0x00007fffc64fa7d1 in std::thread::_State_impl > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #4 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557b63970) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80 #5 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0 #6 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6 Thread 9 (Thread 0x7fff127fc700 (LWP 1388620)): #0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0 #1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=, __cond=) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865 #2 std::condition_variable::wait(std::unique_lock&) (this=, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53 #3 0x00007fffc65d9181 in std::_Function_handler), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr&&) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #4 0x00007fffc65d76ba in std::thread::_State_impl)>, std::shared_ptr > > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #5 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557d093c0) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80 #6 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0 #7 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6 Thread 8 (Thread 0x7fff12ffd700 (LWP 1388619)): #0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0 #1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=, __cond=) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865 #2 std::condition_variable::wait(std::unique_lock&) (this=, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53 #3 0x00007fffc65db716 in dmlc::ConcurrentBlockingQueue::Pop(mxnet::engine::OprBlock**) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #4 0x00007fffc65dbe99 in std::_Function_handler), mxnet::engine::ThreadedEnginePerDevice::Start()::{lambda(std::shared_ptr)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr&&) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #5 0x00007fffc65d76ba in std::thread::_State_impl)>, std::shared_ptr > > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #6 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555556ca9940) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80 #7 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0 #8 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6 Thread 7 (Thread 0x7fff137fe700 (LWP 1388618)): #0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0 #1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=, __cond=) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865 #2 std::condition_variable::wait(std::unique_lock&) (this=, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53 #3 0x00007fffc65db716 in dmlc::ConcurrentBlockingQueue::Pop(mxnet::engine::OprBlock**) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #4 0x00007fffc65dbe99 in std::_Function_handler), mxnet::engine::ThreadedEnginePerDevice::Start()::{lambda(std::shared_ptr)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr&&) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #5 0x00007fffc65d76ba in std::thread::_State_impl)>, std::shared_ptr > > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #6 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555556ca9a50) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80 #7 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0 #8 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6 Thread 6 (Thread 0x7fff13fff700 (LWP 1388617)): #0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0 #1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=, __cond=) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865 #2 std::condition_variable::wait(std::unique_lock&) (this=, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53 #3 0x00007fffc65db716 in dmlc::ConcurrentBlockingQueue::Pop(mxnet::engine::OprBlock**) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #4 0x00007fffc65dbe99 in std::_Function_handler), mxnet::engine::ThreadedEnginePerDevice::Start()::{lambda(std::shared_ptr)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr&&) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #5 0x00007fffc65d76ba in std::thread::_State_impl)>, std::shared_ptr > > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #6 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557b9fea0) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80 #7 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0 #8 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6 Thread 5 (Thread 0x7fff188cf700 (LWP 1388616)): #0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0 #1 0x00007fff7e14ce71 in __gthread_cond_wait (__mutex=, __cond=) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865 #2 std::condition_variable::wait(std::unique_lock&) (this=, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53 #3 0x00007fffc65db716 in dmlc::ConcurrentBlockingQueue::Pop(mxnet::engine::OprBlock**) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #4 0x00007fffc65dbe99 in std::_Function_handler), mxnet::engine::ThreadedEnginePerDevice::Start()::{lambda(std::shared_ptr)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr&&) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #5 0x00007fffc65d76ba in std::thread::_State_impl)>, std::shared_ptr > > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #6 0x00007fff7e152b24 in std::execute_native_thread_routine(void*) (__p=0x555557d34010) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80 #7 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0 #8 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6 Thread 4 (Thread 0x7fff7a60c880 (LWP 1388611)): #0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0 #1 0x00007ffff14570e1 in () at /opt/intel/mkl/lib/intel64/libiomp5.so #2 0x00007ffff138f9a1 in () at /opt/intel/mkl/lib/intel64/libiomp5.so #3 0x00007ffff139223d in () at /opt/intel/mkl/lib/intel64/libiomp5.so #4 0x00007ffff139926b in () at /opt/intel/mkl/lib/intel64/libiomp5.so #5 0x00007ffff13d5170 in () at /opt/intel/mkl/lib/intel64/libiomp5.so #6 0x00007ffff145119c in () at /opt/intel/mkl/lib/intel64/libiomp5.so #7 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0 #8 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6 Thread 3 (Thread 0x7fff7ae0e800 (LWP 1388610)): #0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0 #1 0x00007ffff14570e1 in () at /opt/intel/mkl/lib/intel64/libiomp5.so #2 0x00007ffff138f9a1 in () at /opt/intel/mkl/lib/intel64/libiomp5.so #3 0x00007ffff139223d in () at /opt/intel/mkl/lib/intel64/libiomp5.so #4 0x00007ffff139926b in () at /opt/intel/mkl/lib/intel64/libiomp5.so #5 0x00007ffff13d5170 in () at /opt/intel/mkl/lib/intel64/libiomp5.so #6 0x00007ffff145119c in () at /opt/intel/mkl/lib/intel64/libiomp5.so #7 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0 #8 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6 Thread 2 (Thread 0x7fff7b610780 (LWP 1388609)): #0 0x00007ffff79c3cf5 in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0 #1 0x00007ffff14570e1 in () at /opt/intel/mkl/lib/intel64/libiomp5.so #2 0x00007ffff138f9a1 in () at /opt/intel/mkl/lib/intel64/libiomp5.so #3 0x00007ffff139223d in () at /opt/intel/mkl/lib/intel64/libiomp5.so #4 0x00007ffff139926b in () at /opt/intel/mkl/lib/intel64/libiomp5.so #5 0x00007ffff13d5170 in () at /opt/intel/mkl/lib/intel64/libiomp5.so #6 0x00007ffff145119c in () at /opt/intel/mkl/lib/intel64/libiomp5.so #7 0x00007ffff79bd46f in start_thread () at /usr/lib/libpthread.so.0 #8 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6 Thread 1 (Thread 0x7ffff785f740 (LWP 1388524)): #0 0x00007ffff79c74cf in __lll_lock_wait () at /usr/lib/libpthread.so.0 #1 0x00007ffff79bfe03 in pthread_mutex_lock () at /usr/lib/libpthread.so.0 #2 0x00007fffc6bf9b4e in mxnet::op::custom::AttrParser(nnvm::NodeAttrs*) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #3 0x00007fffc65053db in MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, char const**) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #4 0x00007fffc65061da in MXImperativeInvokeEx () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #5 0x00007fff79dc143b in __pyx_pf_5mxnet_4_cy3_7ndarray_2_imperative_invoke (__pyx_self=, __pyx_v_output_is_list=, __pyx_v_is_np_op=, __pyx_v_out=, __pyx_v_vals=, __pyx_v_keys=, __pyx_v_ndargs=, __pyx_v_handle=) at /usr/include/c++/9.3.0/bits/stl_vector.h:915 #6 __pyx_pw_5mxnet_4_cy3_7ndarray_3_imperative_invoke(PyObject*, PyObject*, PyObject*) (__pyx_self=, __pyx_args=, __pyx_kwds=) at mxnet/cython/ndarray.cpp:5334 #7 0x00007ffff7b18ff6 in PyCFunction_Call () at /usr/lib/libpython3.8.so.1.0 #8 0x00007ffff7b0c3d2 in _PyObject_MakeTpCall () at /usr/lib/libpython3.8.so.1.0 #9 0x00007ffff7bc979c in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0 #10 0x00007ffff7bb58f4 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0 #11 0x00007ffff7bb6c7b in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0 #12 0x00007ffff7b0ef60 in _PyObject_FastCallDict () at /usr/lib/libpython3.8.so.1.0 #13 0x00007ffff7c2e63d in () at /usr/lib/libpython3.8.so.1.0 #14 0x00007ffff7b0c3d2 in _PyObject_MakeTpCall () at /usr/lib/libpython3.8.so.1.0 #15 0x00007ffff7bc9e8b in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0 #16 0x00007ffff7bb58f4 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0 #17 0x00007ffff7bb6c7b in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0 #18 0x00007ffff7bc5f5a in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0 #19 0x00007ffff7bb58f4 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0 #20 0x00007ffff7bb6c7b in _PyFunction_Vectorcall () at /usr/lib/libpython3.8.so.1.0 #21 0x00007ffff7bc980a in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.8.so.1.0 #22 0x00007ffff7bb58f4 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.8.so.1.0 #23 0x00007ffff7c3cd73 in PyEval_EvalCode () at /usr/lib/libpython3.8.so.1.0 #24 0x00007ffff7c3cdc8 in () at /usr/lib/libpython3.8.so.1.0 #25 0x00007ffff7c41063 in () at /usr/lib/libpython3.8.so.1.0 #26 0x00007ffff7adbdf0 in PyRun_FileExFlags () at /usr/lib/libpython3.8.so.1.0 #27 0x00007ffff7ae5aa4 in PyRun_SimpleFileExFlags () at /usr/lib/libpython3.8.so.1.0 #28 0x00007ffff7c4d81e in Py_RunMain () at /usr/lib/libpython3.8.so.1.0 #29 0x00007ffff7c4d909 in Py_BytesMain () at /usr/lib/libpython3.8.so.1.0 #30 0x00007ffff7dbd023 in __libc_start_main () at /usr/lib/libc.so.6 #31 0x000055555555505e in _start () ```

As you can see, all the threads seem to be either in pthread_cond_wait, pthread_cond_timedwait or pthread_mutex_lock, so this is indeed probably a deadlock situation.

leezu commented 4 years ago

https://github.com/apache/incubator-mxnet/issues/18014 appears to be due to wrong mutex handling and may be related.

leezu commented 4 years ago

@RuRo wrt to reproducing, we should first fix the Check failed: ndim() >= 0 (-1 vs. 0) : error triggered by the CustomOp. It appears that CustomOp swallows the error which may cause the subsequent hang

RuRo commented 4 years ago

Are you sure, that the backtrace you got after interrupting is in any way related to this issue? It doesn't seem likely to me. 1) The backtrace you got is in _ctypes/ndarray.py. The issue I was able to reproduce is in the cython bindings. If I understand correctly, when cython is used _ctypes isn't even imported. Instead, _cy3.ndarray would be used. I mentioned, that I also sometimes got Segmentation Fault:11 after interrupting the script, but I don't get the Check failed: ndim() >= 0 (-1 vs. 0) error. 2) I don't see anything too incriminating in that backtrace? In my experience mxnet (or any other package with C bindings) often spews random errors, if interrupted. For example, you can also get

this backtrace

```
Traceback (most recent call last):
  File "_ctypes/callbacks.c", line 237, in 'calling callback function'
  File "/usr/lib/python3.8/site-packages/mxnet/operator.py", line 973, in declare_backward_dependency_entry
    rdeps = cast(c_array_buf(c_int, array('i', rdeps)), c_int_p)
  File "/usr/lib/python3.8/site-packages/mxnet/base.py", line 474, in c_array_buf
    return (ctype * len(buf)).from_buffer(buf)
KeyboardInterrupt
Traceback (most recent call last):
  File "test.py", line 6, in <module>
    mx_x = mx.np.empty_like(mx_data)
  File "/usr/lib/python3.8/site-packages/mxnet/numpy/multiarray.py", line 2582, in empty_like
    return _mx_nd_np.empty_like(prototype, dtype=dtype, order=order, subok=subok, shape=shape)
  File "/usr/lib/python3.8/site-packages/mxnet/ndarray/numpy/_op.py", line 512, in empty_like
    return _npi.empty_like_fallback(prototype, dtype=dtype, order=order, subok=subok, shape=shape)
  File "<string>", line 40, in Custom
  File "mxnet/cython/ndarray.pyx", line 219, in mxnet._cy3.ndarray._imperative_invoke
  File "mxnet/cython/./base.pyi", line 41, in mxnet._cy3.ndarray.CALL
mxnet.base.MXNetError: Traceback (most recent call last):
  File "/home/custompkgs/PKGBUILDS/mxnet-ruro-git/src/mxnet-ruro-git/src/operator/custom/custom.cc", line 121
MXNetError: Check failed: reinterpret_cast<CustomOpBwdDepFunc>( params.info->callbacks[kCustomOpPropDeclareBackwardDependency])( out_grad.data(), in_data.data(), out_data.data(), &num_dep, &rdeps, params.info->contexts[kCustomOpPropDeclareBackwardDependency]): 
```

</details> or <details><summary>this backtrace with a segfault</summary>

```
    Traceback (most recent call last):
  File "_ctypes/callbacks.c", line 237, in 'calling callback function'
  File "/usr/lib/python3.8/site-packages/mxnet/operator.py", line 935, in list_arguments_entry
    try:
KeyboardInterrupt

Segmentation fault: 11

terminate called without an active exception
[1]    1437249 abort (core dumped)  python test.py
```

</details>.

3) I grepped around a little bit and it seems to me that every call to cffi/cython is properly wrapped with check_call or the equivalent and I don't see any suspicious try/except clauses in the python part of CustomOp implementation.

Also, I don't quite understand, even if there are unhandled errors somewhere in CustomOp, I still don't see, how would that cause a deadlock?

leezu commented 4 years ago

@RuRo, @reminisce helped point out an error in your reproducer script. You use zero-shape tensor, but didn't enable numpy shape semantics. It should be

import mxnet as mx

mx.npx.set_np()

with mx.Context(mx.gpu()):
    while True:
        for _ in range(100):
            mx_data = mx.np.array(0)
            mx_x = mx.np.empty_like(mx_data)

        mx.nd.waitall()
        print(end='.', flush=True)

This explains the error in https://github.com/apache/incubator-mxnet/issues/18090#issuecomment-616025939

You're right, it shouldn't cause a deadlock.

leezu commented 4 years ago

@leezu can you try to reproduce with a recent mxnet version that is built with cython support and run the script with MXNET_ENFORCE_CYTHON=1 python test.py? Unfortunately, I wasn't able to find any prebuilt wheels/artifacts, that were built with cython, so you might have to build from source yourself.

I can reproduce the hang with the cmake -DBUILD_CYTHON_MODULES=1 build.

RuRo commented 4 years ago

@RuRo, @reminisce helped point out an error in your reproducer script. You use zero-shape tensor, but didn't enable numpy shape semantics. It should be

Ah, I see. I had the npx.set_np() call, when I started reproducing this issue, but I removed it while trying to simplify the bug since it made no difference for the hanging behaviour. I was under the impression, that scalar arrays are enabled by default for the mx.np kind of array. My bad.

Also, the hang happens with or without the mx.nd.waitall() call. Also, as previously mentioned, you don't need the GPU context. So the short version is

import mxnet as mx
mx.npx.set_np()
while True:
    for _ in range(100):
        mx_data = mx.np.array(0)
        mx_x = mx.np.empty_like(mx_data)
    print(end='.', flush=True)
zhreshold commented 4 years ago

Any action item to unblock the ci now?

leezu commented 4 years ago

With respect to the numpy op causing triggering the issue: We can replace those based on CustomOp with native operators. There are only 4 in this file https://github.com/apache/incubator-mxnet/blob/f619c52dfc3787e3097b37ba28f7065bca910b97/python/mxnet/numpy_op_fallback.py

For CustomOp itself, we may want to rewrite it based on the PackedFunc or drop it.

RuRo commented 4 years ago

Are we sure, that this is 100% a CustomOp specific problem and not a deeper issue with the current cython implementation, that CustomOp just happened to bring to light? After all, the issue doesn't occur with MXNET_ENABLE_CYTHON=0.

Also, dropping CustomOp doesn't seem to be a good idea, since it's the current public API for implementing custom operators in higher-level languages.

Also, @zhreshold the CI is technically not blocked since the test is just unstable and not completely impossible. This issue was around for a while and until now the solution was just to restart unix-gpu with the mxnet-bot until it passes. Not a pretty solution, but if you are in a hurry, it's an option.

Also, @leezu if fixing CI is a priority, we could temporarily disable all the CI tests, which use the CustomOp based operators, while a proper fix is being worked on.

By the way, the official tutorial for CustomOp claims that you must wait for all CustomOps to complete before forking or else "the program will be blocked because of python GIL". I am pretty sure, I saw new threads being created while the main loop was already running, while I was debugging with GDB. Could this be the problem?

RuRo commented 4 years ago

A little further investigation in GDB with some debug symbols revealed the following information:

Thread 1 (main thread) is in CustomOperator::Find. (it shows up as mxnet::op::custom::AttrParser in the backtrace because of inlining). Thread 1 is waiting for the CustomOperator::mutex_, which is currently held by Thread 13.

Thread 13 is in _CallPythonObject (python internals called from libffi), trying to call register.do_register.creator.create_operator_entry.delete_entry which is a callback for CustomOp::del. To do that, it tries to acquire the Global Interpreter Lock, which is currently held by Thread 1.

I am really not sure, how could Thread 13 arrive at the CustomOp deleter, while holding the CustomOperator::mutex.

GDB logs with evidence ```c (gdb) run test.py (gdb) backtrace #0 0x00007ffff79a14cf in __lll_lock_wait () at /usr/lib/libpthread.so.0 #1 0x00007ffff7999e03 in pthread_mutex_lock () at /usr/lib/libpthread.so.0 #2 0x00007fffc69c0b4e in mxnet::op::custom::AttrParser(nnvm::NodeAttrs*) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #3 0x00007fffc62cc3db in MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, char const**) () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #4 0x00007fffc62cd1da in MXImperativeInvokeEx () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #5 0x00007fff79b8f43b in __pyx_pf_5mxnet_4_cy3_7ndarray_2_imperative_invoke (__pyx_self=, __pyx_v_output_is_list=, __pyx_v_is_np_op=, __pyx_v_out=, __pyx_v_vals=, __pyx_v_keys=, __pyx_v_ndargs=, __pyx_v_handle=) at /usr/include/c++/9.3.0/bits/stl_vector.h:915 #6 __pyx_pw_5mxnet_4_cy3_7ndarray_3_imperative_invoke(PyObject*, PyObject*, PyObject*) (__pyx_self=, __pyx_args=, __pyx_kwds=) at mxnet/cython/ndarray.cpp:5334 #7 0x00007ffff7b59968 in cfunction_call_varargs (kwargs=0x0, args=(93825001767984, [], ['op_type', 'dtype', 'order', 'subok', 'shape'], ['empty_like_fallback', 'None', 'C', False, None], None, True, False), func=) at Objects/call.c:742 #8 PyCFunction_Call (func=, args=(93825001767984, [], ['op_type', 'dtype', 'order', 'subok', 'shape'], ['empty_like_fallback', 'None', 'C', False, None], None, True, False), kwargs=0x0) at Objects/call.c:772 #9 0x00007ffff7b1325f in _PyObject_MakeTpCall (callable=, args=, nargs=, keywords=) at Objects/call.c:168 #10 0x00007ffff7b7e4a8 in _PyObject_Vectorcall (kwnames=0x0, nargsf=9223372036854775815, args=0x555557e486f0, callable=) at Python/ceval.c:3493 #11 call_function (kwnames=0x0, oparg=, pp_stack=, tstate=0x55555557e5f0) at Python/ceval.c:4987 #12 _PyEval_EvalFrameDefault (f=, throwflag=) at Python/ceval.c:3500 #13 0x00007ffff7b4bc9a in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=, kwnames=, kwargs=0x7fff142e60e8, kwcount=, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name='Custom', qualname='Custom') at Python/ceval.c:4298 #14 0x00007ffff7b4c969 in _PyFunction_Vectorcall (func=, stack=0x7fff142e60e0, nargsf=, kwnames=) at Objects/call.c:435 #15 0x00007ffff7b476da in _PyObject_FastCallDict (callable=, args=0x7ffef55f16f8, nargsf=, kwargs=) at Objects/call.c:104 #16 0x00007ffff7bf2fbe in partial_fastcall (pto=, pto=, kwargs=Python Exception Attempt to extract a component of a value that is not a struct/class/union.: , nargs=1, args=) at ./Modules/_functoolsmodule.c:169 #17 partial_call (pto=0x7fff17ec30c0, args=, kwargs=) at ./Modules/_functoolsmodule.c:224 #18 0x00007ffff7b1325f in _PyObject_MakeTpCall (callable=, args=, nargs=, keywords=) at Objects/call.c:168 #19 0x00007ffff7b83429 in _PyObject_Vectorcall (kwnames=, nargsf=, args=0x555557e4ce90, callable=) at Python/ceval.c:1555 #20 call_function (kwnames=, oparg=, pp_stack=, tstate=0x55555557e5f0) at Python/ceval.c:4987 #21 _PyEval_EvalFrameDefault (f=, throwflag=) at Python/ceval.c:3515 #22 0x00007ffff7b4bc9a in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=, kwnames=, kwargs=0x7fff17ed41f0, kwcount=, kwstep=1, defs=0x7fff78fb2488, defcount=4, kwdefs=0x0, closure=0x0, name='empty_like', qualname='empty_like') at Python/ceval.c:4298 #23 0x00007ffff7b4c969 in _PyFunction_Vectorcall (func=, stack=0x7fff17ed41e8, nargsf=, kwnames=) at Objects/call.c:435 #24 0x00007ffff7b7ecb9 in _PyObject_Vectorcall (kwnames=('dtype', 'order', 'subok', 'shape'), nargsf=, args=0x7fff17ed41e8, callable=) at ./Include/cpython/abstract.h:92 #25 call_function (kwnames=('dtype', 'order', 'subok', 'shape'), oparg=, pp_stack=, tstate=0x55555557e5f0) at Python/ceval.c:4987 #26 _PyEval_EvalFrameDefault (f=, throwflag=) at Python/ceval.c:3515 #27 0x00007ffff7b4bc9a in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=, kwnames=, kwargs=0x5555555e49e8, kwcount=, kwstep=1, defs=0x7fff78be8a88, defcount=4, kwdefs=0x0, closure=0x0, name='empty_like', qualname='empty_like') at Python/ceval.c:4298 #28 0x00007ffff7b4c969 in _PyFunction_Vectorcall (func=, stack=0x5555555e49e0, nargsf=, kwnames=) at Objects/call.c:435 #29 0x00007ffff7b7e2ea in _PyObject_Vectorcall (kwnames=0x0, nargsf=9223372036854775809, args=0x5555555e49e0, callable=) at ./Include/cpython/abstract.h:92 #30 call_function (kwnames=0x0, oparg=, pp_stack=, tstate=0x55555557e5f0) at Python/ceval.c:4987 #31 _PyEval_EvalFrameDefault (f=, throwflag=) at Python/ceval.c:3469 #32 0x00007ffff7b4bc9a in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=, kwnames=, kwargs=0x0, kwcount=, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4298 #33 0x00007ffff7be52da in PyEval_EvalCodeEx (_co=, globals=, locals=, args=, argcount=, kws=, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4327 #34 0x00007ffff7be52fc in PyEval_EvalCode (co=co@entry=, globals=globals@entry=Python Exception Attempt to extract a component of a value that is not a struct/class/union.: , locals=locals@entry=Python Exception Attempt to extract a component of a value that is not a struct/class/union.: ) at Python/ceval.c:718 #35 0x00007ffff7be53aa in run_eval_code_obj (co=0x7ffff73952b0, globals=Python Exception Attempt to extract a component of a value that is not a struct/class/union.: , locals=Python Exception Attempt to extract a component of a value that is not a struct/class/union.: ) at Python/pythonrun.c:1125 #36 0x00007ffff7c2a134 in run_mod (mod=, filename=, globals=Python Exception Attempt to extract a component of a value that is not a struct/class/union.: , locals=Python Exception Attempt to extract a component of a value that is not a struct/class/union.: , flags=, arena=) at Python/pythonrun.c:1147 #37 0x00007ffff7af1a95 in PyRun_FileExFlags (fp=fp@entry=0x555555621020, filename_str=filename_str@entry=0x7ffff742aee0 "test.py", start=start@entry=257, globals=globals@entry=Python Exception Attempt to extract a component of a value that is not a struct/class/union.: , locals=locals@entry=Python Exception Attempt to extract a component of a value that is not a struct/class/union.: , closeit=closeit@entry=1, flags=0x7ffffffe0578) at Python/pythonrun.c:1063 #38 0x00007ffff7af4600 in PyRun_SimpleFileExFlags (fp=fp@entry=0x555555621020, filename=, closeit=closeit@entry=1, flags=flags@entry=0x7ffffffe0578) at Python/pythonrun.c:428 #39 0x00007ffff7af4995 in PyRun_AnyFileExFlags (fp=fp@entry=0x555555621020, filename=0x0, closeit=closeit@entry=1, flags=flags@entry=0x7ffffffe0578) at Python/pythonrun.c:86 #40 0x00007ffff7c2c796 in pymain_run_file (cf=0x7ffffffe0578, config=0x55555557d8e0) at Modules/main.c:381 #41 pymain_run_python (exitcode=0x7ffffffe0570) at Modules/main.c:565 #42 Py_RunMain () at Modules/main.c:644 #43 0x00007ffff7c2c889 in Py_BytesMain (argc=, argv=) at Modules/main.c:698 #44 0x00007ffff7dbd023 in __libc_start_main () at /usr/lib/libc.so.6 #45 0x000055555555505e in _start () (gdb) print (**(pthread_mutex_t **)($sp+0x10)).__data.__owner 2252813 (gdb) info thread Id Target Id Frame ... 13 Thread 0x7ffef77fe700 (LWP 2252813) "python" 0x00007ffff799e01a in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0 (gdb) thread 13 (gdb) backtrace #0 0x00007ffff799e01a in pthread_cond_timedwait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0 #1 0x00007ffff7b0a9fb in PyCOND_TIMEDWAIT (us=, mut=0x7ffff7d759b0 <_PyRuntime+1232>, cond=0x7ffff7d75980 <_PyRuntime+1184>) at Python/condvar.h:73 #2 take_gil (tstate=0x7ffee80022b0, ceval=0x7ffff7d75728 <_PyRuntime+584>) at Python/ceval_gil.h:206 #3 PyEval_RestoreThread (tstate=0x7ffee80022b0) at Python/ceval.c:399 #4 0x00007ffff7a30b8d in PyGILState_Ensure () at Python/pystate.c:1298 #5 0x00007ffff7217d4c in _CallPythonObject (pArgs=0x7ffef77fdc50, flags=, converters=(<_ctypes.PyCSimpleType at remote 0x5555556405f0>,), callable=, setfunc=, restype=0x7ffff725b638, mem=0x7ffef77fddd0) at /home/custompkgs/PKGBUILDS/python-dbg/src/Python-3.8.2/Modules/_ctypes/callbacks.c:145 #6 closure_fcn (cif=, resp=0x7ffef77fddd0, args=0x7ffef77fdc50, userdata=) at /home/custompkgs/PKGBUILDS/python-dbg/src/Python-3.8.2/Modules/_ctypes/callbacks.c:297 #7 0x00007ffff71938c2 in () at /usr/lib/libffi.so.7 #8 0x00007ffff7193c20 in () at /usr/lib/libffi.so.7 #9 0x00007fffc69ba782 in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #10 0x00007fffc624b0b7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #11 0x00007fffc69bb33e in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #12 0x00007fffc69bfeca in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #13 0x00007fffc62c1794 in std::thread::_State_impl > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so #14 0x00007fff7df19b24 in std::execute_native_thread_routine(void*) (__p=0x555557d8b890) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80 #15 0x00007ffff799746f in start_thread () at /usr/lib/libpthread.so.0 #16 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6 (gdb) print *(PyFunctionObject *)0x7ffef568e050 { ... func_doc = 'C Callback for CustomOp::del', func_name = 'delete_entry', func_dict = 0x0, func_weakreflist = 0x0, func_module = 'mxnet.operator', func_annotations = 0x0, func_qualname = 'register..do_register..creator..create_operator_entry..delete_entry', vectorcall = 0x7ffff7b4c750 <_PyFunction_Vectorcall> } (gdb) print/x ((PyThreadState*)_PyRuntime.ceval.gil.last_holder)->thread_id 0x7ffff7839740 (gdb) info thread Id Target Id Frame 1 Thread 0x7ffff7839740 (LWP 2252671) "python" 0x00007ffff79a14cf in __lll_lock_wait () from /usr/lib/libpthread.so.0 ... ```
RuRo commented 4 years ago

Just FYI, I am currently not working on a fix for this. Unfortunately, I am currently very busy and will probably be so for quite a while, also I am not very familiar with cffi/cython. So I won't be able to help much, apart from maybe a little bit of testing/discussion.

Just wanted to point that out, in case I created a false expectation, that I am working on this issue and therefore it doesn't need somebody else to come in and fix it.