cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0)

leezu commented 4 years ago

Description

On OSX, test_gluon_data.test_multi_worker_dataloader_release_pool frequently fails with ... libc++abi.dylib: terminating with uncaught exception of type dmlc::Error: [06:16:36] ../src/storage/./cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0) :

https://github.com/apache/incubator-mxnet/blob/db2295b0f39852c3af2878932f48a672982f42d4/src/storage/cpu_shared_storage_manager.h#L216-L242

CC: @eric-haibin-lin

eric-haibin-lin commented 4 years ago

@zhreshold have you seen this before?

zhreshold commented 4 years ago

I don't think I've seen such error log before until Leo send me the log.

leezu commented 4 years ago

Need to revert https://github.com/apache/incubator-mxnet/pull/17797/ when fixing this issue

zixuanweeei commented 4 years ago

We also have this problem built from source on Linux platform using the following command:

make -j DEBUG=1 ENABLE_TESTCOVERAGE=0 USE_CPP_PACKAGE=0 USE_MKLDNN=1 USE_BLAS=blas USE_SIGNAL_HANDLER=1

leezu commented 4 years ago

@zixuanweeei can you share further instructions how to reproduce on linux? I was unable to reproduce the failure with the cmake build on linux.

zixuanweeei commented 4 years ago

Tested at 1efb747ad5cf695f181c64bfb16daccd6ca4c456 using the following command:

MXNET_SUBGRAPH_VERBOSE=0 nosetests -s -v --logging-level=DEBUG tests/python/unittest/test_gluon_data.py

Its outputs:

[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1673400754 to reproduce.
test_gluon_data.test_array_dataset ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1615701232 to reproduce.
ok
test_gluon_data.test_recordimage_dataset ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=187503193 to reproduce.
ok
test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1225456503 to reproduce.
ok
test_gluon_data.test_sampler ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1061742723 to reproduce.
ok
test_gluon_data.test_datasets ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1762025483 to reproduce.
ok
test_gluon_data.test_image_folder_dataset ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=731150274 to reproduce.
ok
test_gluon_data.test_list_dataset ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1768153160 to reproduce.
ok
test_gluon_data.test_multi_worker ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1350251679 to reproduce.
ok
test_gluon_data.test_multi_worker_shape ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1650045375 to reproduce.
ok
test_gluon_data.test_multi_worker_forked_data_loader ... [DEBUG] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=589614560 to reproduce.
terminate called after throwing an instance of 'dmlc::Error'
  what():  [08:54:22] src/storage/./cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0) :

[1]    63713 abort      MXNET_SUBGRAPH_VERBOSE=0 nosetests -s -v --logging-level=DEBUG

ptrendx commented 3 years ago

We encountered this issue when moving to Ubuntu 20.04 and Python 3.8. The issue does not happen during any single test, but running multiple dataworker tests in a row triggers it (the repro that we used for investigation was

nosetests --verbose -s tests/python/unittest/test_gluon_data.py:test_multi_worker{,_shape,_forked_data_loader,_dataloader_release_pool}

The root cause seems to be how Python memory management interacts with forking. Basically if there are still some shared memory NDArrays present (because of garbage collection or the running operation) when the new dataloader from the subsequent test is created, the child processes get the copies of that NDArray (without actually incrementing the usage count on it), since the usage counter as well as the actual data is in the shared memory region instead of the memory space of the process calling fork. So after the fork, when the Python garbage collection kicks in, all of the processes (both the parent actually holding that NDArray in the first place as well as its children) try to destroy the NDArray. Now 2 scenarios can happen:

the parent is "lucky" and destroys it first - then the counter becomes 0 and the children then die because of this CHECK inside cpu_shared_storage_manager.h -> this results in a hang of the dataloader as observed in #17774
the parent is "unlucky" and one of the children destroys the NDArray first -> parent hits the CHECK itself and dies with the error from this issue.

We intend to workaround this issue for our upcoming release by inserting

mx.nd.waitall()
import gc
gc.collect()

in the Gluon Dataloader contructor (which made the error disappear in our tests) but a more robust solution should be devised (maybe increment all the shared memory arrays during fork?).

apache / mxnet

cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0) #17782

Description