apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

flaky test: check leak ndarray #18400

Open eric-haibin-lin opened 4 years ago

eric-haibin-lin commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-18394/1/pipeline

[2020-05-24T09:53:03.464Z] ==================================== ERRORS ====================================
[2020-05-24T09:53:03.464Z] _____________________ ERROR at teardown of test_function1 ______________________
[2020-05-24T09:53:03.464Z] 
[2020-05-24T09:53:03.464Z] request = <SubRequest 'check_leak_ndarray' for <Function test_function1>>
[2020-05-24T09:53:03.464Z] 
[2020-05-24T09:53:03.464Z]     @pytest.fixture(autouse=True)
[2020-05-24T09:53:03.464Z]     def check_leak_ndarray(request):
[2020-05-24T09:53:03.464Z]         garbage_expected = request.node.get_closest_marker('garbage_expected')
[2020-05-24T09:53:03.464Z]         if garbage_expected:  # Some tests leak references. They should be fixed.
[2020-05-24T09:53:03.464Z]             yield  # run test
[2020-05-24T09:53:03.464Z]             return
[2020-05-24T09:53:03.464Z]     
[2020-05-24T09:53:03.464Z]         if 'centos' in platform.platform():
[2020-05-24T09:53:03.464Z]             # Multiple tests are failing due to reference leaks on CentOS. It's not
[2020-05-24T09:53:03.464Z]             # yet known why there are more memory leaks in the Python 3.6.9 version
[2020-05-24T09:53:03.464Z]             # shipped on CentOS compared to the Python 3.6.9 version shipped in
[2020-05-24T09:53:03.464Z]             # Ubuntu.
[2020-05-24T09:53:03.464Z]             yield
[2020-05-24T09:53:03.464Z]             return
[2020-05-24T09:53:03.464Z]     
[2020-05-24T09:53:03.464Z]         del gc.garbage[:]
[2020-05-24T09:53:03.464Z]         # Collect garbage prior to running the next test
[2020-05-24T09:53:03.464Z]         gc.collect()
[2020-05-24T09:53:03.464Z]         # Enable gc debug mode to check if the test leaks any arrays
[2020-05-24T09:53:03.464Z]         gc_flags = gc.get_debug()
[2020-05-24T09:53:03.464Z]         gc.set_debug(gc.DEBUG_SAVEALL)
[2020-05-24T09:53:03.464Z]     
[2020-05-24T09:53:03.464Z]         # Run the test
[2020-05-24T09:53:03.464Z]         yield
[2020-05-24T09:53:03.464Z]     
[2020-05-24T09:53:03.464Z]         # Check for leaked NDArrays
[2020-05-24T09:53:03.464Z]         gc.collect()
[2020-05-24T09:53:03.464Z]         gc.set_debug(gc_flags)  # reset gc flags
[2020-05-24T09:53:03.464Z]     
[2020-05-24T09:53:03.464Z]         seen = set()
[2020-05-24T09:53:03.464Z]         def has_array(element):
[2020-05-24T09:53:03.464Z]             try:
[2020-05-24T09:53:03.464Z]                 if element in seen:
[2020-05-24T09:53:03.464Z]                     return False
[2020-05-24T09:53:03.464Z]                 seen.add(element)
[2020-05-24T09:53:03.464Z]             except (TypeError, ValueError):  # unhashable
[2020-05-24T09:53:03.464Z]                 pass
[2020-05-24T09:53:03.464Z]     
[2020-05-24T09:53:03.464Z]             if isinstance(element, mx.nd._internal.NDArrayBase):
[2020-05-24T09:53:03.464Z]                 return True
[2020-05-24T09:53:03.464Z]             elif isinstance(element, mx.sym._internal.SymbolBase):
[2020-05-24T09:53:03.464Z]                 return False
[2020-05-24T09:53:03.464Z]             elif hasattr(element, '__dict__'):
[2020-05-24T09:53:03.464Z]                 return any(has_array(x) for x in vars(element))
[2020-05-24T09:53:03.464Z]             elif isinstance(element, dict):
[2020-05-24T09:53:03.464Z]                 return any(has_array(x) for x in element.items())
[2020-05-24T09:53:03.464Z]             else:
[2020-05-24T09:53:03.464Z]                 try:
[2020-05-24T09:53:03.464Z]                     return any(has_array(x) for x in element)
[2020-05-24T09:53:03.464Z]                 except (TypeError, KeyError, RecursionError):
[2020-05-24T09:53:03.464Z]                     return False
[2020-05-24T09:53:03.464Z]     
[2020-05-24T09:53:03.464Z] >       assert not any(has_array(x) for x in gc.garbage), 'Found leaked NDArrays due to reference cycles'
[2020-05-24T09:53:03.464Z] E       AssertionError: Found leaked NDArrays due to reference cycles
[2020-05-24T09:53:03.464Z] E       assert not True
[2020-05-24T09:53:03.464Z] E        +  where True = any(<generator object check_leak_ndarray.<locals>.<genexpr> at 0x7f96c07802b0>)
[2020-05-24T09:53:03.464Z] 
[2020-05-24T09:53:03.464Z] tests/python/conftest.py:78: AssertionError
[2020-05-24T09:53:03.464Z] ---------------------------- Captured stderr setup -----------------------------
[2020-05-24T09:53:03.464Z] DEBUG:root:np/mx/python random seeds are set to 135663639, use MXNET_TEST_SEED=135663639 to reproduce.
[2020-05-24T09:53:03.464Z] ------------------------------ Captured log setup ------------------------------
[2020-05-24T09:53:03.464Z] DEBUG    root:conftest.py:193 np/mx/python random seeds are set to 135663639, use MXNET_TEST_SEED=135663639 to reproduce.
[2020-05-24T09:53:03.464Z] ----------------------------- Captured stderr call -----------------------------
[2020-05-24T09:53:03.464Z] [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1404816900 to reproduce.
[2020-05-24T09:53:03.465Z] DEBUG:common:Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1404816900 to reproduce.
[2020-05-24T09:53:03.465Z] ------------------------------ Captured log call -------------------------------
[2020-05-24T09:53:03.465Z] DEBUG    common:common.py:221 Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1404816900 to reproduce.

@leezu

leezu commented 4 years ago

As the flakyness occurrs with mx.autograd.Function, which is "known to leak" (cf the test_function in the same file), I suggest to mark the flaky test_function1 as "known to leak" as well. I'm not yet sure why test_function1 leaks only sometimes.

leezu commented 4 years ago

Happened also in http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-18408/runs/4/nodes/364/steps/758/log/?start=0

leezu commented 4 years ago
 _____________________ ERROR at teardown of test_get_symbol _____________________
[2020-06-04T22:33:35.745Z] 
[2020-06-04T22:33:35.745Z] request = <SubRequest 'check_leak_ndarray' for <Function test_get_symbol>>
[2020-06-04T22:33:35.745Z] 
[2020-06-04T22:33:35.745Z]     @pytest.fixture(autouse=True)
[2020-06-04T22:33:35.745Z]     def check_leak_ndarray(request):
[2020-06-04T22:33:35.745Z]         garbage_expected = request.node.get_closest_marker('garbage_expected')
[2020-06-04T22:33:35.745Z]         if garbage_expected:  # Some tests leak references. They should be fixed.
[2020-06-04T22:33:35.745Z]             yield  # run test
[2020-06-04T22:33:35.745Z]             return
[2020-06-04T22:33:35.745Z]     
[2020-06-04T22:33:35.745Z]         if 'centos' in platform.platform():
[2020-06-04T22:33:35.745Z]             # Multiple tests are failing due to reference leaks on CentOS. It's not
[2020-06-04T22:33:35.745Z]             # yet known why there are more memory leaks in the Python 3.6.9 version
[2020-06-04T22:33:35.745Z]             # shipped on CentOS compared to the Python 3.6.9 version shipped in
[2020-06-04T22:33:35.745Z]             # Ubuntu.
[2020-06-04T22:33:35.745Z]             yield
[2020-06-04T22:33:35.745Z]             return
[2020-06-04T22:33:35.745Z]     
[2020-06-04T22:33:35.745Z]         del gc.garbage[:]
[2020-06-04T22:33:35.745Z]         # Collect garbage prior to running the next test
[2020-06-04T22:33:35.745Z]         gc.collect()
[2020-06-04T22:33:35.745Z]         # Enable gc debug mode to check if the test leaks any arrays
[2020-06-04T22:33:35.745Z]         gc_flags = gc.get_debug()
[2020-06-04T22:33:35.745Z]         gc.set_debug(gc.DEBUG_SAVEALL)
[2020-06-04T22:33:35.745Z]     
[2020-06-04T22:33:35.745Z]         # Run the test
[2020-06-04T22:33:35.745Z]         yield
[2020-06-04T22:33:35.745Z]     
[2020-06-04T22:33:35.745Z]         # Check for leaked NDArrays
[2020-06-04T22:33:35.745Z]         gc.collect()
[2020-06-04T22:33:35.745Z]         gc.set_debug(gc_flags)  # reset gc flags
[2020-06-04T22:33:35.745Z]     
[2020-06-04T22:33:35.745Z]         seen = set()
[2020-06-04T22:33:35.745Z]         def has_array(element):
[2020-06-04T22:33:35.745Z]             try:
[2020-06-04T22:33:35.745Z]                 if element in seen:
[2020-06-04T22:33:35.745Z]                     return False
[2020-06-04T22:33:35.745Z]                 seen.add(element)
[2020-06-04T22:33:35.745Z]             except (TypeError, ValueError):  # unhashable
[2020-06-04T22:33:35.745Z]                 pass
[2020-06-04T22:33:35.745Z]     
[2020-06-04T22:33:35.745Z]             if isinstance(element, mx.nd._internal.NDArrayBase):
[2020-06-04T22:33:35.745Z]                 return True
[2020-06-04T22:33:35.745Z]             elif isinstance(element, mx.sym._internal.SymbolBase):
[2020-06-04T22:33:35.745Z]                 return False
[2020-06-04T22:33:35.745Z]             elif hasattr(element, '__dict__'):
[2020-06-04T22:33:35.745Z]                 return any(has_array(x) for x in vars(element))
[2020-06-04T22:33:35.745Z]             elif isinstance(element, dict):
[2020-06-04T22:33:35.745Z]                 return any(has_array(x) for x in element.items())
[2020-06-04T22:33:35.745Z]             else:
[2020-06-04T22:33:35.745Z]                 try:
[2020-06-04T22:33:35.745Z]                     return any(has_array(x) for x in element)
[2020-06-04T22:33:35.745Z]                 except (TypeError, KeyError, RecursionError):
[2020-06-04T22:33:35.745Z]                     return False
[2020-06-04T22:33:35.745Z]     
[2020-06-04T22:33:35.745Z] >       assert not any(has_array(x) for x in gc.garbage), 'Found leaked NDArrays due to reference cycles'
[2020-06-04T22:33:35.745Z] E       AssertionError: Found leaked NDArrays due to reference cycles
[2020-06-04T22:33:35.745Z] E       assert not True
[2020-06-04T22:33:35.745Z] E        +  where True = any(<generator object check_leak_ndarray.<locals>.<genexpr> at 0x7f8a046fb0a0>)
[2020-06-04T22:33:35.745Z] 

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-18485/runs/2/nodes/365/steps/570/log/?start=0

eric-haibin-lin commented 4 years ago

Happened again http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-18525/8/pipeline for test_get_symbol

leezu commented 4 years ago

And a third time. I'm not sure why this happens time to time and why it only affects test_get_symbol, but let's disable the check for test_get_symbol in favor of CI stability: https://github.com/apache/incubator-mxnet/pull/18595

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-18589/runs/1/nodes/364/steps/755/log/?start=0

szha commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-18562/runs/5/nodes/364/steps/758/log/?start=0

szha commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-18562/runs/11/nodes/354/steps/501/log/?start=0

ERROR at teardown of test_grad_with_stype

leezu commented 3 years ago

ERROR at teardown of test_foreach

https://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-19336/runs/4/nodes/284/steps/420/log/?start=0