Open eric-haibin-lin opened 4 years ago
As the flakyness occurrs with mx.autograd.Function
, which is "known to leak" (cf the test_function
in the same file), I suggest to mark the flaky test_function1
as "known to leak" as well. I'm not yet sure why test_function1
leaks only sometimes.
_____________________ ERROR at teardown of test_get_symbol _____________________
[2020-06-04T22:33:35.745Z]
[2020-06-04T22:33:35.745Z] request = <SubRequest 'check_leak_ndarray' for <Function test_get_symbol>>
[2020-06-04T22:33:35.745Z]
[2020-06-04T22:33:35.745Z] @pytest.fixture(autouse=True)
[2020-06-04T22:33:35.745Z] def check_leak_ndarray(request):
[2020-06-04T22:33:35.745Z] garbage_expected = request.node.get_closest_marker('garbage_expected')
[2020-06-04T22:33:35.745Z] if garbage_expected: # Some tests leak references. They should be fixed.
[2020-06-04T22:33:35.745Z] yield # run test
[2020-06-04T22:33:35.745Z] return
[2020-06-04T22:33:35.745Z]
[2020-06-04T22:33:35.745Z] if 'centos' in platform.platform():
[2020-06-04T22:33:35.745Z] # Multiple tests are failing due to reference leaks on CentOS. It's not
[2020-06-04T22:33:35.745Z] # yet known why there are more memory leaks in the Python 3.6.9 version
[2020-06-04T22:33:35.745Z] # shipped on CentOS compared to the Python 3.6.9 version shipped in
[2020-06-04T22:33:35.745Z] # Ubuntu.
[2020-06-04T22:33:35.745Z] yield
[2020-06-04T22:33:35.745Z] return
[2020-06-04T22:33:35.745Z]
[2020-06-04T22:33:35.745Z] del gc.garbage[:]
[2020-06-04T22:33:35.745Z] # Collect garbage prior to running the next test
[2020-06-04T22:33:35.745Z] gc.collect()
[2020-06-04T22:33:35.745Z] # Enable gc debug mode to check if the test leaks any arrays
[2020-06-04T22:33:35.745Z] gc_flags = gc.get_debug()
[2020-06-04T22:33:35.745Z] gc.set_debug(gc.DEBUG_SAVEALL)
[2020-06-04T22:33:35.745Z]
[2020-06-04T22:33:35.745Z] # Run the test
[2020-06-04T22:33:35.745Z] yield
[2020-06-04T22:33:35.745Z]
[2020-06-04T22:33:35.745Z] # Check for leaked NDArrays
[2020-06-04T22:33:35.745Z] gc.collect()
[2020-06-04T22:33:35.745Z] gc.set_debug(gc_flags) # reset gc flags
[2020-06-04T22:33:35.745Z]
[2020-06-04T22:33:35.745Z] seen = set()
[2020-06-04T22:33:35.745Z] def has_array(element):
[2020-06-04T22:33:35.745Z] try:
[2020-06-04T22:33:35.745Z] if element in seen:
[2020-06-04T22:33:35.745Z] return False
[2020-06-04T22:33:35.745Z] seen.add(element)
[2020-06-04T22:33:35.745Z] except (TypeError, ValueError): # unhashable
[2020-06-04T22:33:35.745Z] pass
[2020-06-04T22:33:35.745Z]
[2020-06-04T22:33:35.745Z] if isinstance(element, mx.nd._internal.NDArrayBase):
[2020-06-04T22:33:35.745Z] return True
[2020-06-04T22:33:35.745Z] elif isinstance(element, mx.sym._internal.SymbolBase):
[2020-06-04T22:33:35.745Z] return False
[2020-06-04T22:33:35.745Z] elif hasattr(element, '__dict__'):
[2020-06-04T22:33:35.745Z] return any(has_array(x) for x in vars(element))
[2020-06-04T22:33:35.745Z] elif isinstance(element, dict):
[2020-06-04T22:33:35.745Z] return any(has_array(x) for x in element.items())
[2020-06-04T22:33:35.745Z] else:
[2020-06-04T22:33:35.745Z] try:
[2020-06-04T22:33:35.745Z] return any(has_array(x) for x in element)
[2020-06-04T22:33:35.745Z] except (TypeError, KeyError, RecursionError):
[2020-06-04T22:33:35.745Z] return False
[2020-06-04T22:33:35.745Z]
[2020-06-04T22:33:35.745Z] > assert not any(has_array(x) for x in gc.garbage), 'Found leaked NDArrays due to reference cycles'
[2020-06-04T22:33:35.745Z] E AssertionError: Found leaked NDArrays due to reference cycles
[2020-06-04T22:33:35.745Z] E assert not True
[2020-06-04T22:33:35.745Z] E + where True = any(<generator object check_leak_ndarray.<locals>.<genexpr> at 0x7f8a046fb0a0>)
[2020-06-04T22:33:35.745Z]
Happened again http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-18525/8/pipeline for test_get_symbol
And a third time. I'm not sure why this happens time to time and why it only affects test_get_symbol, but let's disable the check for test_get_symbol in favor of CI stability: https://github.com/apache/incubator-mxnet/pull/18595
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-18394/1/pipeline
@leezu