ROCm / ROCR-Runtime

ROCm Platform Runtime: ROCr a HPC market enhanced HSA based runtime
https://rocm.docs.amd.com/projects/ROCR-Runtime/en/latest/
Other
223 stars 109 forks source link

[Issue]: leak of `all_gpu_id_array` global in KMT. #255

Open benvanik opened 1 day ago

benvanik commented 1 day ago

It looks like the all_gpu_id_array is not cleaned up when KMT is unloaded. If KMT is initialized multiple times in the same process it will leak the array multiple times. hsakmt_fmm_destroy_process_apertures seems to clean up the other global (gpu_mem) but not all_gpu_id_array like it should.

From ASAN:

Direct leak of 8 byte(s) in 1 object(s) allocated from:
    #0 0x5ff5b2387bcf in malloc (/home/nod/src/iree-build/runtime/src/iree/hal/drivers/amdgpu/cts/amdgpu_all_driver_test+0x223bcf) (BuildId: 1530ccada4eb72df)
    #1 0x74e567024f56 in hsakmt_fmm_init_process_apertures /home/nod/src/ROCR-Runtime/libhsakmt/src/fmm.c:2642:22
    #2 0x74e567034da9 in hsaKmtAcquireSystemProperties /home/nod/src/ROCR-Runtime/libhsakmt/src/topology.c:2190:8
    #3 0x74e566ea3a10 in rocr::AMD::BuildTopology() /home/nod/src/ROCR-Runtime/runtime/hsa-runtime/core/runtime/amd_topology.cpp:306:36
    #4 0x74e566ea420e in rocr::AMD::Load() /home/nod/src/ROCR-Runtime/runtime/hsa-runtime/core/runtime/amd_topology.cpp:433:18
    #5 0x74e566ee96c2 in rocr::core::Runtime::Load() /home/nod/src/ROCR-Runtime/runtime/hsa-runtime/core/runtime/runtime.cpp:1995:17
    #6 0x74e566ee0945 in rocr::core::Runtime::Acquire() /home/nod/src/ROCR-Runtime/runtime/hsa-runtime/core/runtime/runtime.cpp:140:51
    #7 0x74e566eaaf83 in rocr::HSA::hsa_init() /home/nod/src/ROCR-Runtime/runtime/hsa-runtime/core/runtime/hsa.cpp:206:52
    #8 0x74e566f567f5 in hsa_init /home/nod/src/ROCR-Runtime/runtime/hsa-runtime/core/common/hsa_table_interface.cpp:70:35
    #9 0x5ff5b243eeed in iree_hsa_init /home/nod/src/iree/runtime/src/iree/hal/drivers/amdgpu/util/libhsa_tables.h:11:1
    #10 0x5ff5b243e426 in iree_hal_amdgpu_libhsa_initialize /home/nod/src/iree/runtime/src/iree/hal/drivers/amdgpu/util/libhsa.c:498:14
    #11 0x5ff5b2400e80 in iree_hal_amdgpu_driver_load_libhsa /home/nod/src/iree/runtime/src/iree/hal/drivers/amdgpu/driver.c:231:26
    #12 0x5ff5b2400b63 in iree_hal_amdgpu_driver_create /home/nod/src/iree/runtime/src/iree/hal/drivers/amdgpu/driver.c:270:26
    #13 0x5ff5b23d4222 in iree_hal_amdgpu_driver_factory_try_create /home/nod/src/iree/runtime/src/iree/hal/drivers/amdgpu/registration/driver_module.c:40:26
    #14 0x5ff5b23fffaa in iree_hal_driver_registry_try_create /home/nod/src/iree/runtime/src/iree/hal/driver_registry.c:314:14
    #15 0x5ff5b23c94f9 in iree::hal::cts::TryGetDriver(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, iree_hal_driver_t**) /home/nod/src/iree/runtime/src/iree/hal/cts/cts_test_base.h:73:26
    #16 0x5ff5b23ca866 in iree::hal::cts::DriverTest::CreateDriver() /home/nod/src/iree/runtime/src/iree/hal/cts/driver_test.h:38:14
    #17 0x5ff5b23c81ee in iree::hal::cts::DriverTest_QueryAndCreateAvailableDevicesByOrdinal_Test::TestBody() /home/nod/src/iree/runtime/src/iree/hal/cts/driver_test.h:103:17
    #18 0x5ff5b2525ce8 in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:2635:10
    #19 0x5ff5b24e5491 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:2671:14
    #20 0x5ff5b2498e23 in testing::Test::Run() /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:2710:5
    #21 0x5ff5b249a796 in testing::TestInfo::Run() /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:2856:11
    #22 0x5ff5b249bde6 in testing::TestSuite::Run() /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:3034:30
    #23 0x5ff5b24bef9e in testing::internal::UnitTestImpl::RunAllTests() /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:5964:44
    #24 0x5ff5b252f928 in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:2635:10
    #25 0x5ff5b24ea6b6 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:2671:14
    #26 0x5ff5b24be225 in testing::UnitTest::Run() /home/nod/src/iree/third_party/googletest/googletest/src/gtest.cc:5543:10
    #27 0x5ff5b2400690 in RUN_ALL_TESTS() /home/nod/src/iree/third_party/googletest/googletest/include/gtest/gtest.h:2334:73
    #28 0x5ff5b24005b3 in main /home/nod/src/iree/runtime/src/iree/testing/gtest_main.cc:20:13
    #29 0x74e575c29d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
ppanchad-amd commented 1 day ago

Hi @benvanik. Internal ticket has been created investigate your issue. Thanks!