apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.73k stars 6.81k forks source link

test_sparse_operator.py::test_elemwise_binary_ops #18740

Open leezu opened 3 years ago

leezu commented 3 years ago

Description

Tests crashes affecting multiple PRs: https://github.com/apache/incubator-mxnet/pull/18711 https://github.com/apache/incubator-mxnet/pull/18694 https://github.com/apache/incubator-mxnet/pull/18722 https://github.com/apache/incubator-mxnet/pull/18733

[2020-07-15T23:41:55.453Z] Fatal Python error: Aborted
[2020-07-15T23:41:55.453Z] 
[2020-07-15T23:41:55.453Z] Thread 0x00007f6de68a6700 (most recent call first):
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 400 in read
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 432 in from_io
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 967 in _thread_receiver
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 220 in run
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 285 in _perform_spawn
[2020-07-15T23:41:55.453Z] 
[2020-07-15T23:41:55.453Z] Current thread 0x00007f6de857a740 (most recent call first):
[2020-07-15T23:41:55.453Z]   File "/work/mxnet/python/mxnet/_ctypes/ndarray.py", line 178 in __call__
[2020-07-15T23:41:55.453Z]   File "/work/mxnet/python/mxnet/executor.py", line 184 in forward
[2020-07-15T23:41:55.453Z]   File "/work/mxnet/python/mxnet/test_utils.py", line 937 in numeric_grad
[2020-07-15T23:41:55.453Z]   File "/work/mxnet/python/mxnet/test_utils.py", line 1088 in check_numeric_gradient
[2020-07-15T23:41:55.453Z]   File "/work/mxnet/tests/python/unittest/test_sparse_operator.py", line 312 in test_elemwise_binary_op
[2020-07-15T23:41:55.453Z]   File "/work/mxnet/tests/python/unittest/test_sparse_operator.py", line 417 in check_elemwise_binary_ops
[2020-07-15T23:41:55.453Z]   File "/work/mxnet/tests/python/unittest/test_sparse_operator.py", line 520 in test_elemwise_binary_ops
[2020-07-15T23:41:55.453Z]   File "/work/mxnet/tests/python/unittest/common.py", line 223 in test_new
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/_pytest/python.py", line 167 in pytest_pyfunc_call
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/manager.py", line 87 in <lambda>
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/_pytest/python.py", line 1445 in runtest
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/_pytest/runner.py", line 134 in pytest_runtest_call
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/manager.py", line 87 in <lambda>
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/_pytest/runner.py", line 210 in <lambda>
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/_pytest/runner.py", line 237 in from_call
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/_pytest/runner.py", line 210 in call_runtest_hook
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/flaky/flaky_pytest_plugin.py", line 129 in call_and_report
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/_pytest/runner.py", line 99 in runtestprotocol
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/_pytest/runner.py", line 84 in pytest_runtest_protocol
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/flaky/flaky_pytest_plugin.py", line 92 in pytest_runtest_protocol
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/manager.py", line 87 in <lambda>
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/xdist/remote.py", line 87 in run_one_test
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/xdist/remote.py", line 70 in pytest_runtestloop
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/manager.py", line 87 in <lambda>
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/_pytest/main.py", line 247 in _main
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/_pytest/main.py", line 197 in wrap_session
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/_pytest/main.py", line 240 in pytest_cmdline_main
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/manager.py", line 87 in <lambda>
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/xdist/remote.py", line 258 in <module>
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 1084 in executetask
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 220 in run
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 285 in _perform_spawn
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 267 in integrate_as_primary_thread
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 1060 in serve
[2020-07-15T23:41:55.453Z]   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 1554 in serve
[2020-07-15T23:41:55.453Z]   File "<string>", line 8 in <module>
[2020-07-15T23:41:55.453Z]   File "<string>", line 1 in <module>
[2020-07-15T23:41:55.453Z] [gw0] [ 95%] PASSED tests/python/unittest/test_sparse_ndarray.py::test_sparse_getnnz 
[2020-07-15T23:41:55.707Z] tests/python/unittest/test_sparse_operator.py::test_elemwise_binary_ops 
[2020-07-15T23:41:55.707Z] [gw0] node down: Not properly terminated
[2020-07-15T23:41:55.707Z] [gw0] [ 95%] FAILED tests/python/unittest/test_sparse_operator.py::test_elemwise_binary_ops 
[2020-07-15T23:41:55.707Z] 
[2020-07-15T23:41:55.707Z] replacing crashed worker gw0
[2020-07-15T23:41:56.266Z] 
[gw4] linux Python 3.6.9 cwd: /work/mxnet
[2020-07-15T23:41:58.149Z] 
DickJC123 commented 3 years ago

I had some success with marking this test with @pytest.mark.serial without understanding the underlying issue, or why this action fixed it. Could someone enlighten me, what do the serial-marked tests do that force them to be run serially? Are the pytest workers all in the same process?

leezu commented 3 years ago

The pytest workers are all in separate processes. I'm only aware of the difference that OMP_NUM_THREADS=$(expr $(nproc) / 4) is exported before running the parallel pytest processes for non-serial tests. Serial tests will be run in a separate process without the OMP_NUM_THREADS variable after all non-serial tests finished.