ROCm / clr

MIT License
85 stars 35 forks source link

hipamd: SIGSEGV when compiled with -march=znver4 #18

Open AngryLoki opened 8 months ago

AngryLoki commented 8 months ago

Due to unaligned allocations, library crashes in nontemporalMemcpy in _mm512_stream_si512 (which requires 64-aligned allocations, but used to copy default-aligned objects) in https://github.com/ROCm-Developer-Tools/clr/blob/5914ac3c6e9b3848023a7fa25e19e560b1c38541/rocclr/device/rocm/rocvirtual.cpp#L2793

Originally reported to https://bugs.gentoo.org/915969 as a part of rocBLAS and miopen update (failure in hipamd module loader causes crash in dependent libraries).

iassiour commented 8 months ago

I will create an internal PR to fix this. The PARAMETERS_MIN_ALIGNMENT https://github.com/ROCm-Developer-Tools/clr/blob/develop/rocclr/utils/flags.hpp#L55 should be set to the native alignment.

AngryLoki commented 8 months ago

Hi, unfortunately, setting PARAMETERS_MIN_ALIGNMENT seems to be not enough. Code still crashes in nontemporalMemcpy in in rocThrust tests (and supposedly in other libraries that pass unaligned pointers to hipMemcpyAsync).

The traceback is:

libamdhip64.so.5!roc::nontemporalMemcpy(uint16_t size, const void * __restrict__ src, void * __restrict__ dst) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\rocclr\device\rocm\rocvirtual.cpp:2799)
libamdhip64.so.5!roc::VirtualGPU::submitKernelInternal(roc::VirtualGPU * const this, const amd::NDRangeContainer & sizes,  kernel, const_address parameters, void * eventHandle, uint32_t sharedMemBytes, amd::NDRangeKernelCommand * vcmd, hsa_kernel_dispatch_packet_t * aql_packet) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\rocclr\device\rocm\rocvirtual.cpp:3099)
libamdhip64.so.5!roc::KernelBlitManager::copyBuffer(const roc::KernelBlitManager * const this,  srcMemory,  dstMemory,  srcOrigin,  dstOrigin,  sizeIn, bool entire, amd::CopyMetadata copyMetadata) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\rocclr\device\rocm\rocblit.hpp:217)
libamdhip64.so.5!roc::VirtualGPU::copyMemory(roc::VirtualGPU * const this, cl_command_type type, amd::Memory & srcMem, amd::Memory & dstMem, bool entire, const amd::Coord3D & srcOrigin, const amd::Coord3D & dstOrigin, const amd::Coord3D & size, const amd::BufferRect & srcRect, const amd::BufferRect & dstRect, amd::CopyMetadata copyMetadata) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\rocclr\device\rocm\rocvirtual.cpp:1832)
libamdhip64.so.5!roc::VirtualGPU::submitCopyMemory(roc::VirtualGPU * const this, amd::CopyMemoryCommand & cmd) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\rocclr\device\rocm\rocvirtual.cpp:1881)
libamdhip64.so.5!amd::Command::enqueue(amd::Command * const this) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\rocclr\platform\command.cpp:393)
libamdhip64.so.5!ihipMemcpy(void * dst, const void * src, size_t sizeBytes, hipMemcpyKind kind, hip::Stream & stream, bool isHostAsync, bool isGPUAsync) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\hipamd\src\hip_memory.cpp:502)
libamdhip64.so.5!hipMemcpyAsync_common(void * dst, const void * src, size_t sizeBytes, hipMemcpyKind kind, hipStream_t stream) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\hipamd\src\hip_memory.cpp:1422)
libamdhip64.so.5!hipMemcpyAsync(void * dst, const void * src, size_t sizeBytes, hipMemcpyKind kind, hipStream_t stream) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\hipamd\src\hip_memory.cpp:1428)
thrust::system::hip::detail::async_copy_n<thrust::hip_rocprim::execution_policy<thrust::hip_rocprim::tag>, thrust::hip_rocprim::execution_policy<thrust::hip_rocprim::tag>, thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> >, long>(thrust::hip_rocprim::execution_policy<thrust::hip_rocprim::tag> & from_exec, thrust::hip_rocprim::execution_policy<thrust::hip_rocprim::tag> & to_exec, thrust::detail::normal_iterator<thrust::device_ptr<short> > first, long n, thrust::detail::normal_iterator<thrust::device_ptr<short> > output) (\src\rocThrust\thrust\system\hip\detail\async\copy.h:132)
thrust::hip_rocprim::async_copy<thrust::hip_rocprim::tag, thrust::hip_rocprim::tag, thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> > >(thrust::hip_rocprim::execution_policy<thrust::hip_rocprim::tag> & from_exec, thrust::hip_rocprim::execution_policy<thrust::hip_rocprim::tag> & to_exec, thrust::detail::normal_iterator<thrust::device_ptr<short> > first, thrust::detail::normal_iterator<thrust::device_ptr<short> > last, thrust::detail::normal_iterator<thrust::device_ptr<short> > output) (\src\rocThrust\thrust\system\hip\detail\async\copy.h:533)
thrust::async::copy_detail::copy_fn::call<thrust::hip_rocprim::tag, thrust::hip_rocprim::tag, thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> > >(const thrust::detail::execution_policy_base<thrust::hip_rocprim::tag> & from_exec, const thrust::detail::execution_policy_base<thrust::hip_rocprim::tag> & to_exec) (\src\rocThrust\thrust\async\copy.h:86)
thrust::async::copy_detail::copy_fn::call<thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> > >() (\src\rocThrust\thrust\async\copy.h:121)
thrust::async::copy_detail::copy_fn::operator()<thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> > >() (\src\rocThrust\thrust\async\copy.h:137)
invoke_async_copy_fn::operator()<thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> > >() (\src\rocThrust\test\test_async_copy.cpp:30)
AsyncCopyDeviceToHost<short, invoke_async_copy_fn>() (\src\rocThrust\test\test_async_copy.cpp:116)
AsyncCopyTests_TestAsyncCopyTriviallyRelocatableDeviceToHost_Test<Params<short, thrust::hip_rocprim::par_t> >::TestBody(AsyncCopyTests_TestAsyncCopyTriviallyRelocatableDeviceToHost_Test<Params<short, thrust::hip_rocprim::par_t> > * this) (\src\rocThrust\test\test_async_copy.cpp:132)
libgtest.so.1.13.0!void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (Unknown Source:0)
libgtest.so.1.13.0!testing::Test::Run() (Unknown Source:0)
libgtest.so.1.13.0!testing::TestInfo::Run() (Unknown Source:0)

It looks like thrust::host_vector and thrust::device_vector are using std::allocator, then it eventually appears in nontemporalMemcpy.

Should I report this to rocThrust repo?

Upd: according to https://docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory.html#gab388b755a0ee2c86ca5e0c29391a584c , there are no alignment requirements for hipMemcpyAsync pointers, so I think it should be fixed in clr.

iassiour commented 8 months ago

Thank you @AngryLoki let me raise the issue internally and I will get back shortly.

iassiour commented 8 months ago

Hi @AngryLoki the nontemporal copy that is shown here is irrelevant to the passed pointers in hipMemcpyAsync. This is a copy of the kernel argument stack instead and this is managed internally.

I see some offsets are still calculated based on the 16 alignment.

https://github.com/ROCm-Developer-Tools/clr/blob/develop/rocclr/platform/kernel.hpp#L160 https://github.com/ROCm-Developer-Tools/clr/blob/develop/rocclr/platform/kernel.hpp#L162 https://github.com/ROCm-Developer-Tools/clr/blob/develop/rocclr/platform/kernel.hpp#L223

Could you please try to change these lines to the native alignment and let me know if it fixes the issue. I will followup with a permanent fix in that case.

AngryLoki commented 8 months ago

Hi @iassiour ,

sorry for delay, I think I don't fully understand what exactly I need to change in the first link (wrong line number?). I applied this:

--- a/rocclr/platform/kernel.hpp
+++ b/rocclr/platform/kernel.hpp
@@ -159,7 +159,7 @@ class KernelParameters : protected HeapObject {
         deviceKernelArgs_(false) {
     totalSize_ = signature.paramsSize() + (signature.numMemories() +
         signature.numSamplers() + signature.numQueues()) * sizeof(void*);
-    values_ = reinterpret_cast<address>(this) + alignUp(sizeof(KernelParameters), 16);
+    values_ = reinterpret_cast<address>(this) + alignUp(sizeof(KernelParameters), NATIVE_ALIGNEMENT_SIZE);
     memoryObjOffset_ = signature_.paramsSize();
     memoryObjects_ = reinterpret_cast<amd::Memory**>(values_ + memoryObjOffset_);
     samplerObjOffset_ = memoryObjOffset_ + signature_.numMemories() * sizeof(amd::Memory*);
@@ -183,7 +183,7 @@ class KernelParameters : protected HeapObject {
         execNewVcop_(rhs.execNewVcop_),
         execPfpaVcop_(rhs.execPfpaVcop_),
         deviceKernelArgs_(false) {
-    values_ = reinterpret_cast<address>(this) + alignUp(sizeof(KernelParameters), 16);
+    values_ = reinterpret_cast<address>(this) + alignUp(sizeof(KernelParameters), NATIVE_ALIGNEMENT_SIZE);
     memoryObjOffset_ = signature_.paramsSize();
     memoryObjects_ = reinterpret_cast<amd::Memory**>(values_ + memoryObjOffset_);
     samplerObjOffset_ = memoryObjOffset_ + signature_.numMemories() * sizeof(amd::Memory*);
@@ -220,7 +220,7 @@ class KernelParameters : protected HeapObject {
   //! Allocate memory for this instance as well as the required storage for
   //  the values_, defined_, and rawPointer_ arrays.
   void* operator new(size_t size, const KernelSignature& signature) {
-    size_t requiredSize = alignUp(size, 16) + signature.paramsSize() +
+    size_t requiredSize = alignUp(size, NATIVE_ALIGNEMENT_SIZE) + signature.paramsSize() +
       (signature.numMemories() + signature.numSamplers() + signature.numQueues()) *
        sizeof(void*);
     return AlignedMemory::allocate(requiredSize, PARAMETERS_MIN_ALIGNMENT);

it does not crash anymore with this change in nontemporalMemcpy, but it makes one subtest (TestBinarySearchDevice) in one rocThrust test fail.

Summary: 1) after recompiling clr without -march test succeeds 2) after recompiling clr with -march=znver4 and memcpy (initial idea) test succeeds 3) after recompiling clr with NATIVE_ALIGNEMENT_SIZE one subtest fails in https://github.com/ROCmSoftwarePlatform/rocThrust/blob/rocm-5.7.1/test/test_binary_search.cpp#L731 with

[ RUN      ] BinarySearchTests.TestBinarySearchDevice
/src/rocThrust/test/test_binary_search.cpp:731: Failure
Expected equality of these values:
  h_result
    Which is: { 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... }
  d_result
    Which is: { 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... }
Google Test trace:
/src/rocThrust/test/test_binary_search.cpp:708: with seed= 1
/src/rocThrust/test/test_binary_search.cpp:704: with size= 1048453
/src/rocThrust/test/test_binary_search.cpp:701: with device_id= 0
[  FAILED  ] BinarySearchTests.TestBinarySearchDevice (15571 ms)

Note the large size; test succeeds with smaller vectors and with size=1048453 ((1 << 20) - 123) it always fails, so I guess my NATIVE_ALIGNEMENT_SIZE patch is not complete.

AngryLoki commented 8 months ago

By the way, here is another idea to fix this issue once and forever, which I learned recently.

I described it in https://github.com/ROCmSoftwarePlatform/rocRAND/issues/403: if you require C++17 in CMakeLists.txt and add __attribute__((ext_vector_type(4)))-family hints, you can get rid of manual alignUp's and new/delete operators. Maybe you'll find it useful.

iassiour commented 7 months ago

Thank you for the pointers @AngryLoki I have not managed to reproduce the issue in rocThrust TestBinarySearchDevice yet but the changes you made above in operator new and the constructors in kernel.hpp are correct and in any case are required in order to make the current implementation that uses nonTemmporalCopy copy work with avx/avx512. I will create a PR internally to add these, it should appear in the next release.

LtdJorge commented 7 months ago

Same happens to me on Gentoo (awell) running the tests from the Orochi project (from gpuopen), using -march=znver2. It segfaults in the memset. I tried building everything with Clang at first, thinking it was an issue with GCC LTO, but when I discovered this issue, I tried disabling the AVX/2 options with: COMMON_FLAGS="-O3 -march=znver2 -mno-avx -mno-avx2 -g" Now it runs correctly, as it was doing before I updated my system and newer HIP libraries got installed.

I'll subscribe here and to the Gentoo PR to get notified when it's fixed and I can enable AVX. Thank you @AngryLoki for the deep dive and finding the root cause.

gotzl commented 7 months ago

I just want to add that this issue also applies to amdocl, as described in #31. The bug is inrocclr, which is common to hip and opencl build targets.