It looks like the all_gpu_id_array is not cleaned up when KMT is unloaded. If KMT is initialized multiple times in the same process it will leak the array multiple times. hsakmt_fmm_destroy_process_apertures seems to clean up the other global (gpu_mem) but not all_gpu_id_array like it should.
From ASAN:
Direct leak of 8 byte(s) in 1 object(s) allocated from:
#0 0x5ff5b2387bcf in malloc (/home/nod/src/iree-build/runtime/src/iree/hal/drivers/amdgpu/cts/amdgpu_all_driver_test+0x223bcf) (BuildId: 1530ccada4eb72df)
#1 0x74e567024f56 in hsakmt_fmm_init_process_apertures /home/nod/src/ROCR-Runtime/libhsakmt/src/fmm.c:2642:22
#2 0x74e567034da9 in hsaKmtAcquireSystemProperties /home/nod/src/ROCR-Runtime/libhsakmt/src/topology.c:2190:8
#3 0x74e566ea3a10 in rocr::AMD::BuildTopology() /home/nod/src/ROCR-Runtime/runtime/hsa-runtime/core/runtime/amd_topology.cpp:306:36
#4 0x74e566ea420e in rocr::AMD::Load() /home/nod/src/ROCR-Runtime/runtime/hsa-runtime/core/runtime/amd_topology.cpp:433:18
#5 0x74e566ee96c2 in rocr::core::Runtime::Load() /home/nod/src/ROCR-Runtime/runtime/hsa-runtime/core/runtime/runtime.cpp:1995:17
#6 0x74e566ee0945 in rocr::core::Runtime::Acquire() /home/nod/src/ROCR-Runtime/runtime/hsa-runtime/core/runtime/runtime.cpp:140:51
#7 0x74e566eaaf83 in rocr::HSA::hsa_init() /home/nod/src/ROCR-Runtime/runtime/hsa-runtime/core/runtime/hsa.cpp:206:52
#8 0x74e566f567f5 in hsa_init /home/nod/src/ROCR-Runtime/runtime/hsa-runtime/core/common/hsa_table_interface.cpp:70:35
#9 0x5ff5b243eeed in iree_hsa_init /home/nod/src/iree/runtime/src/iree/hal/drivers/amdgpu/util/libhsa_tables.h:11:1
#10 0x5ff5b243e426 in iree_hal_amdgpu_libhsa_initialize /home/nod/src/iree/runtime/src/iree/hal/drivers/amdgpu/util/libhsa.c:498:14
#11 0x5ff5b2400e80 in iree_hal_amdgpu_driver_load_libhsa /home/nod/src/iree/runtime/src/iree/hal/drivers/amdgpu/driver.c:231:26
#12 0x5ff5b2400b63 in iree_hal_amdgpu_driver_create /home/nod/src/iree/runtime/src/iree/hal/drivers/amdgpu/driver.c:270:26
#13 0x5ff5b23d4222 in iree_hal_amdgpu_driver_factory_try_create /home/nod/src/iree/runtime/src/iree/hal/drivers/amdgpu/registration/driver_module.c:40:26
#14 0x5ff5b23fffaa in iree_hal_driver_registry_try_create /home/nod/src/iree/runtime/src/iree/hal/driver_registry.c:314:14
#15 0x5ff5b23c94f9 in iree::hal::cts::TryGetDriver(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, iree_hal_driver_t**) /home/nod/src/iree/runtime/src/iree/hal/cts/cts_test_base.h:73:26
#16 0x5ff5b23ca866 in iree::hal::cts::DriverTest::CreateDriver() /home/nod/src/iree/runtime/src/iree/hal/cts/driver_test.h:38:14
#17 0x5ff5b23c81ee in iree::hal::cts::DriverTest_QueryAndCreateAvailableDevicesByOrdinal_Test::TestBody() /home/nod/src/iree/runtime/src/iree/hal/cts/driver_test.h:103:17
#18 0x5ff5b2525ce8 in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:2635:10
#19 0x5ff5b24e5491 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:2671:14
#20 0x5ff5b2498e23 in testing::Test::Run() /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:2710:5
#21 0x5ff5b249a796 in testing::TestInfo::Run() /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:2856:11
#22 0x5ff5b249bde6 in testing::TestSuite::Run() /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:3034:30
#23 0x5ff5b24bef9e in testing::internal::UnitTestImpl::RunAllTests() /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:5964:44
#24 0x5ff5b252f928 in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:2635:10
#25 0x5ff5b24ea6b6 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:2671:14
#26 0x5ff5b24be225 in testing::UnitTest::Run() /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:5543:10
#27 0x5ff5b2400690 in RUN_ALL_TESTS() /home/nod/src/iree/third_party/googletest/googletest/include/gtest/gtest.h:2334:73
#28 0x5ff5b24005b3 in main /home/nod/src/iree/runtime/src/iree/testing/gtest_main.cc:20:13
#29 0x74e575c29d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
It looks like the
all_gpu_id_array
is not cleaned up when KMT is unloaded. If KMT is initialized multiple times in the same process it will leak the array multiple times.hsakmt_fmm_destroy_process_apertures
seems to clean up the other global (gpu_mem
) but notall_gpu_id_array
like it should.From ASAN: