apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.75k stars 6.8k forks source link

[Flaky] flaky test in test_operator_gpu.test_convolution_multiple_streams #14329

Open szha opened 5 years ago

szha commented 5 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14324/3/pipeline#step-682-log-1293

@DickJC123

mxnet-label-bot commented 5 years ago

Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: Test, Flaky

piyushghai commented 5 years ago

@mxnet-label-bot Add [Test, Flaky]

DickJC123 commented 5 years ago

Looking at this now. The problem occurs during shutdown of the system when MXNET_ENGINE_TYPE=ThreadedEngine (so the ThreadedEnginePooled). I'll first try to determine if it's related to the dual stream work, or whether it's just that the dual-stream PR, in expanding the meager testing of the ThreadedEnginePooled, uncovered an issue with it.

DickJC123 commented 5 years ago

I believe I've isolated the problem. I started by checking out the commit that introduced the dual stream facility and test (5f32f32e8). I could run the test 20 times without failure. I then examined behavior under the other 19 commits to master that were made since then. The problem behavior first appears with:

053ffc7f8  2019-03-03  Wang Jiajun            fix memory-related issues to enable ASAN tests (#14223)

I ran the test_convolution_multiple_streams test with the engine list pruned down to just 'ThreadedEngine' (so ThreadedEnginePooled) for speed of testing. It failed with the /usr/bin/python3': double free or corruption (out) abort at the 6th iteration. I then backed up my testing to the previous commit:

0e23a18da  2019-03-02  Yuxi Hu                add symbolic link to mkldnn header files in include (#14300)

...and saw no failures after 40 repetitions of the full test_convolution_multiple_streams test. I even left the mshadow and dmlc-core versions the same as the 053ffc7f8 commit, which advances their versions.

In inspecting the change introduced by 053ffc7f8, I see changes to threaded_engine.cc as well as many free/delete operations added. From this I conclude that 053ffc7f8 is the likely culprit and is not compatible with the ThreadedEnginePooled. @szha @eric-haibin-lin @ptrendx

junrushao commented 5 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-gpu/branches/PR-14192/runs/5/nodes/273/log/?start=0

======================================================================
FAIL: test_operator_gpu.test_convolution_multiple_streams
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 173, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 570, in test_convolution_multiple_streams
    {'MXNET_GPU_WORKER_NSTREAMS' : num_streams, 'MXNET_ENGINE_TYPE' : engine})
  File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 540, in _test_in_separate_process
    assert p.exitcode == 0, "Non-zero exit code %d from %s()." % (p.exitcode, func.__name__)
AssertionError: Non-zero exit code 255 from _conv_with_num_streams().

-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=997087021 to reproduce.
--------------------- >> end captured logging << ---------------------
szha commented 5 years ago

Thanks, @DickJC123. Does it reveal an actual problem or is it specific to the CI environment? I'd prefer to see this fixed instead of reverting ASAN enforcement as that change fixed a number of other problems.

cc @arcadiaphy

DickJC123 commented 5 years ago

To debug this further, I checked out the ASAN PR commit, reverted my dual-stream PR, then reinserted the test_operator_gpu.py file version from that PR. Then running the test_convolution_multiple_streams test multiple times generated segmentation violations and python 'double free' aborts (see below) on the ThreadedEngine only. The test passes solidly when the ThreadedEngine is omitted. I conclude that the dual-stream facility is not part of the issue, only that the PR came with a ThreadedEngine test that is perhaps more stressful than others in the CI.

@szha, if your preference is not to revert the ASAN commit, I could PR a change to the test_convolution_multiple_streams where it tests against only the NaiveEngine and ThreadedEnginePerDevice, and prints out that the ThreadedEngine is skipped. I could then file an issue to correct the ASAN PR for the ThreadedEngine that @arcadiaphy should respond to. The ThreadedEngine should be considered 'broken' until that is done.

OK?

[INFO] 3 of 10: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=87934937 to reproduce.
FAIL

======================================================================
FAIL: test_operator_gpu.test_convolution_multiple_streams
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/opt/mxnet/tests/python/gpu/../unittest/common.py", line 173, in test_new
    orig_test(*args, **kwargs)
  File "/opt/mxnet/tests/python/gpu/test_operator_gpu.py", line 570, in test_convolution_multiple_streams
    {'MXNET_GPU_WORKER_NSTREAMS' : num_streams, 'MXNET_ENGINE_TYPE' : engine})
  File "/opt/mxnet/tests/python/gpu/test_operator_gpu.py", line 540, in _test_in_separate_process
    assert p.exitcode == 0, "Non-zero exit code %d from %s()." % (p.exitcode, func.__name__)
AssertionError: Non-zero exit code -11 from _conv_with_num_streams().
-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=124093393 to reproduce.
common: INFO: 3 of 10: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=87934937 to reproduce.
--------------------- >> end captured logging << ---------------------

After pruning the engine list to only include ThreadedEngine:

MXNET_TEST_COUNT=10 nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_convolution_multiple_streams           
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=549472273 to reproduce.
test_operator_gpu.test_convolution_multiple_streams ... [22:27:47] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:27:50] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:27:52] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:27:55] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:27:56] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:27:59] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:28:01] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:28:04] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:28:05] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:28:08] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:28:10] src/engine/engine.cc:55: MXNet start using engine: ThreadedEngine
[22:28:13] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (se
tting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
*** Error in `/usr/bin/python': double free or corruption (out): 0x00007f80d04e77d0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f817c0897e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f817c09237a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f817c09653c]
/opt/mxnet/lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManager10ReleaseAllEv+0x2f1)[0x7f816b78cef1]
/opt/mxnet/lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManagerD0Ev+0x1a)[0x7f816b79507a]
/opt/mxnet/lib/libmxnet.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0x47)[0x7f81684d0f47]
/opt/mxnet/lib/libmxnet.so(_ZN5mxnet11StorageImpl10DirectFreeENS_7Storage6HandleE+0x84)[0x7f816b790ab4]
/opt/mxnet/lib/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextEEZNS0_8resource19ResourceManagerImpl17ResourceTempSpaceILNS0_15ResourceRequest4Ty
peE1EED4EvEUlS1_E_E9_M_invokeERKSt9_Any_dataOS1_+0x10f)[0x7f816b8ca51f]
/opt/mxnet/lib/libmxnet.so(+0x40da8ba)[0x7f816b7768ba]
/opt/mxnet/lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x956)[0x7f816b770c06]
/opt/mxnet/lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN4dmlc11ManualEventEEEZN5mxnet6engine20ThreadedEnginePooled5StartEvEUlS3_E_E9_M_invoke
ERKSt9_Any_dataOS3_+0x17d)[0x7f816b77173d]
szha commented 5 years ago

@DickJC123 this plan sounds good. Thanks.

DickJC123 commented 5 years ago

Rather than open up a new issue to track any follow-up PR from @arcadiaphy I thought it best to continue this one to keep the history. My PR https://github.com/apache/incubator-mxnet/pull/14338 removes the ThreadedEngine from the engine list of the test_convolution_multiple_streams, but also enables easy testing of that problem case. After the PR is merged, I would recommend that @arcadiaphy test with:

MXNET_TEST_COUNT=50 MXNET_ENGINE_TYPE=ThreadedEngine nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_convolution_multiple_streams

...to prove that the fixes he comes up with cure the present segfaults. Also, to ensure that the previous engines still work, I recommend:

MXNET_TEST_COUNT=50 nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_convolution_multiple_streams
arcadiaphy commented 5 years ago

@DickJC123 OK. Let me find out the problem behind the threaded engine crash.

DickJC123 commented 5 years ago

I need to point out that a recently merged "op bulking" PR saw a segfault in either the NaiveEngine or the ThreadedEnginePerDevice. I could not reproduce this on my internal machines after hundreds of trials. I'm hoping that what @arcadiaphy learns in his follow-up work may have relevance past the ThreadedEngine problem. See: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14200/9/pipeline

aaronmarkham commented 5 years ago

Failed for me today: http://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/unix-gpu/job/PR-14855/5/display/redirect

nickguletskii commented 5 years ago

Failed for me as well: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14638/6/

aaronmarkham commented 4 years ago

Failed again! http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15883/40/pipeline/302

kshitij12345 commented 4 years ago

16238

ChaiBapchya commented 4 years ago

@kshitij12345 good find. Since it is the duplicate (and this one was opened first) should we close the other one? @szha or let both remain?

TaoLv commented 4 years ago

Happened again: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17313/6/pipeline

======================================================================

FAIL: test_operator_gpu.test_convolution_multiple_streams

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 215, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 770, in test_convolution_multiple_streams

    {'MXNET_GPU_WORKER_NSTREAMS' : num_streams, 'MXNET_ENGINE_TYPE' : engine})

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 372, in run_in_spawned_process

    assert p.exitcode == 0, "Non-zero exit code %d from %s()." % (p.exitcode, func.__name__)

AssertionError: Non-zero exit code 1 from _conv_with_num_streams().

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1131332049 to reproduce.

--------------------- >> end captured logging << ---------------------
szha commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-17453/1/pipeline#step-100-log-1526

kshitij12345 commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-17325/14/pipeline#step-100-log-1524

kshitij12345 commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-17325/15/pipeline#step-100-log-1523

kshitij12345 commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-17325/17/pipeline#step-100-log-1526

ChaiBapchya commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17444/3/pipeline

17444

ChaiBapchya commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17482/1/pipeline

17482

ChaiBapchya commented 4 years ago

Another one http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17407/5/pipeline

17407

ChaiBapchya commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-17494/2/pipeline #17494

kshitij12345 commented 4 years ago

Unable to recreate the failure locally with

MXNET_TEST_COUNT=10 nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_convolution_multiple_streams

Can anyone try to see if they can reproduce it locally?

@ChaiBapchya Could you also try it once?

kshitij12345 commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17325/23/pipeline#step-875-log-1043

ChaiBapchya commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17508/2/pipeline/

17508

ChaiBapchya commented 4 years ago

Unable to recreate the failure locally with

MXNET_TEST_COUNT=10 nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_convolution_multiple_streams

Can anyone try to see if they can reproduce it locally?

@ChaiBapchya Could you also try it once?

@kshitij12345 Ran the above command. Test passed

Finished engine ThreadedEnginePerDevice with 2 streams.
ok

----------------------------------------------------------------------
Ran 1 test in 958.201s
ptrendx commented 4 years ago

I don't think you will be able to repro this failure on its own unfortunately - I believe the problem with this test is that it is affected by other tests. What I believe happens is the other tests use a lot of GPU memory that is cached in the caching memory allocator. Then this tests (to test the effect of env variables that are read only during init of MXNet) launches a new MXNet process. That process then tries to allocate memory and hit OoM condition - this is not a problem for other tests, because they would just free the memory stored inside the caching allocator, but this process does not have that option, so it just fails, which is then reported as a failure.

I believe the proper way of fixing this would be to move this test to its own command during testing (so that there is no memory allocated by other tests when it runs).

kshitij12345 commented 4 years ago

@ChaiBapchya Thanks for confirming. @ptrendx That makes sense. On digging a bit deeper, I found that test_bulking had similar failures. https://github.com/apache/incubator-mxnet/issues/14970

Both test_bulking and test_convolution_multiple_streams spawn a new process for the test.

Do you think it is a good idea to move both of them into a new test file?

ptrendx commented 4 years ago

They should be fine, yes, but I don't think the new test file is enough - it actually needs to be its own nosetests instance so that they are the only thing launched from that process.

ChaiBapchya commented 4 years ago

@kshitij12345 since the fix is identified do you mind leading it? I can take it up otherwise. :)

kshitij12345 commented 4 years ago

@ChaiBapchya I'll be busy in the coming weeks. So it'll be great if you can fix this one.

ChaiBapchya commented 4 years ago

Another one - http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17475/8/pipeline #17475