Closed nichannah closed 4 years ago
The crash does not happen when I disable all oasis puts/gets. There are still warnings related to unreturned memory.
This was a problem with the test code which had unbalanced oasis put/get calls. This was not a problem with older versions of openmpi but caused a crash on the newer version (4.0.1). The changes have been merged into 29-use-submodules branch.
The JRA55_IAF test case in libaccessom2 is crashing on termination on the new machine (Gadi) with new openmpi libraries.
The error message (pasted below) makes it look like not all MPI resources are being cleaned up properly. Given that this is a very self-contained test case hopefully it's possible to find the problem from code review.
579482456.666229] [gadi-cpu-clx-0455:61857:0] mpool.c:38 UCX WARN object 0x11bf980 was not returned to mpool ucp_requests [1579482456.666232] [gadi-cpu-clx-0455:61857:0] mpool.c:38 UCX WARN object 0x11bfb40 was not returned to mpool ucp_requests [1579482456.680676] [gadi-cpu-clx-0455:61859:0] rcache.c:360 UCX WARN knem rcache device: destroying inuse region 0x2c7ac10 [0x2d4bc40..0x2e1eb40] g- rw ref 1 cookie 716100775085951232 addr 0x2d4bc40 [gadi-cpu-clx-0455:61859:0:61859] rcache.c:200 Assertion `region->refcount == 0' failed ==== backtrace (tid: 61859) ==== 0 0x0000000000051959 ucs_fatal_error_message() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/debug/assert.c:36 1 0x0000000000051a36 ucs_fatal_error_format() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/debug/assert.c:52 2 0x00000000000562f0 ucs_mem_region_destroy_internal() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/memory/rcache.c:200 3 0x000000000005c6c6 ucs_class_call_cleanup_chain() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/type/class.c:52 4 0x0000000000056f38 ucs_rcache_destroy() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/memory/rcache.c:729 5 0x00000000000030f2 uct_knem_md_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/uct/sm/knem/../../../../../src/uct/sm/knem/knem_md.c:91 6 0x000000000000f1c9 ucp_free_resources() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucp/../../../src/ucp/core/ucp_context.c:710 7 0x000000000000f1c9 ucp_cleanup() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucp/../../../src/ucp/core/ucp_context.c:1266 8 0x0000000000005bcc mca_pml_ucx_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:247 9 0x0000000000007909 mca_pml_ucx_component_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx_component.c:82 10 0x00000000000582b9 mca_base_component_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/opal/mca/base/../../../../opal/mca/base/mca_base_components_close.c:53 11 0x0000000000058345 mca_base_components_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/opal/mca/base/../../../../opal/mca/base/mca_base_components_close.c:85 12 0x0000000000058345 mca_base_components_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/opal/mca/base/../../../../opal/mca/base/mca_base_components_close.c:86 13 0x00000000000621da mca_base_framework_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/opal/mca/base/../../../../opal/mca/base/mca_base_framework.c:216 14 0x000000000004f479 ompi_mpi_finalize() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/ompi/../../ompi/runtime/ompi_mpi_finalize.c:363 15 0x000000000004ac29 ompi_finalize_f() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/intel-opt/ompi/mpi/fortran/mpif-h/profile/pfinalize_f.c:71 16 0x000000000041a5b8 accessom2_mod_mp_accessom2deinit() ???:0 17 0x000000000040e768 MAIN.a() ocean.F90:0 18 0x000000000040c9e2 main() ???:0 19 0x0000000000023813 libc_start_main() ???:0 20 0x000000000040c8ee _start() ???:0