apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.73k stars 6.81k forks source link

test_conv2d_16c[224-256] fails for cu101, cu110, cu112 #20978

Open barry-jin opened 2 years ago

barry-jin commented 2 years ago

Description

tests/python/gpu/test_gluon_gpu.py::test_conv2d_16c[224-256] will cause worker thread to crash on cu101, cu110 and cu112 but will pass on cu102

https://jenkins.mxnet-ci.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/4064/pipeline/340

error message:

[2022-03-20T11:07:01.372Z] Thread 0x00007fed6143a700 (most recent call first):
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/execnet/gateway_base.py", line 400 in read
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/execnet/gateway_base.py", line 432 in from_io
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/execnet/gateway_base.py", line 967 in _thread_receiver
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/execnet/gateway_base.py", line 220 in run
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/execnet/gateway_base.py", line 285 in _perform_spawn
[2022-03-20T11:07:01.372Z] 
[2022-03-20T11:07:01.372Z] Thread 0x00007fed64bca740 (most recent call first):
[2022-03-20T11:07:01.372Z]   File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 2640 in asnumpy
[2022-03-20T11:07:01.372Z]   File "/work/mxnet/tests/python/unittest/test_gluon.py", line 1808 in check_layer_forward_withinput
[2022-03-20T11:07:01.372Z]   File "/work/mxnet/tests/python/unittest/test_gluon.py", line 1830 in test_conv2d_16c
[2022-03-20T11:07:01.372Z]   File "/work/mxnet/python/mxnet/util.py", line 486 in _with_np_array
[2022-03-20T11:07:01.372Z]   File "/work/mxnet/python/mxnet/util.py", line 304 in _with_np_shape
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/_pytest/python.py", line 184 in pytest_pyfunc_call
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/_pytest/python.py", line 1627 in runtest
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/_pytest/runner.py", line 163 in pytest_runtest_call
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/_pytest/runner.py", line 256 in <lambda>
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/_pytest/runner.py", line 310 in from_call
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/_pytest/runner.py", line 255 in call_runtest_hook
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/flaky/flaky_pytest_plugin.py", line 138 in call_and_report
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/_pytest/runner.py", line 127 in runtestprotocol
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/_pytest/runner.py", line 110 in pytest_runtest_protocol
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/flaky/flaky_pytest_plugin.py", line 94 in pytest_runtest_protocol
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/xdist/remote.py", line 87 in run_one_test
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/xdist/remote.py", line 70 in pytest_runtestloop
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/_pytest/main.py", line 313 in _main
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/_pytest/main.py", line 257 in wrap_session
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/_pytest/main.py", line 306 in pytest_cmdline_main
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/xdist/remote.py", line 237 in <module>
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/execnet/gateway_base.py", line 1084 in executetask
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/execnet/gateway_base.py", line 220 in run
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/execnet/gateway_base.py", line 285 in _perform_spawn
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/execnet/gateway_base.py", line 267 in integrate_as_primary_thread
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/execnet/gateway_base.py", line 1060 in serve
[2022-03-20T11:07:01.372Z]   File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/execnet/gateway_base.py", line 1554 in serve
[2022-03-20T11:07:01.372Z]   File "<string>", line 8 in <module>
[2022-03-20T11:07:01.372Z]   File "<string>", line 1 in <module>
[2022-03-20T11:07:01.372Z] 
[2022-03-20T11:07:01.372Z] [gw0] [ 31%] PASSED tests/python/gpu/test_gluon_gpu.py::test_export 
[2022-03-20T11:07:01.372Z] tests/python/gpu/test_gluon_gpu.py::test_import 
[2022-03-20T11:07:01.372Z] [gw2] node down: Not properly terminated
[2022-03-20T11:07:01.372Z] [gw2] [ 31%] FAILED tests/python/gpu/test_gluon_gpu.py::test_conv2d_16c[224-256] 
[2022-03-20T11:07:01.372Z] 
[2022-03-20T11:07:01.372Z] replacing crashed worker gw2
bartekkuncer commented 2 years ago

@barry-jin any progress on that? It seems to be failing almost every single time now and completely shuts down the CI on master branch.

barry-jin commented 2 years ago

I think we may need to temporarily skip this test and take more investigation.