StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
689 stars 144 forks source link

[BUG] HTR failure on lassen #1755

Closed seemamirch closed 2 months ago

seemamirch commented 2 months ago

cuda version 11.8

starting with legion commit ea1eaee73

realm: Add ATS/HMM support for DMA paths

results in the assert below

averageTest.exec: /usr/WS1/mirchandaney1/legion_latest/runtime/realm/cuda/cuda_internal.cc:88: virtual int Realm::Cuda::AddressInfoCudaArray::set_rect(const Realm::RegionInstanceImpl*, const Realm::InstanceLayoutPieceBase*, size_t, size_t, int, const int64_t*, const int64_t*, const int*): Assertion ms failed.

seemamirch commented 2 months ago

More debugging on this @muraj

(gdb) bt
#0  0x000020000605fcb0 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x000020000606200c in __GI_abort () at abort.c:90
#2  0x00002000060557d4 in __assert_fail_base (fmt=0x2000061bb7d0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x20000499c2b8 "ms",
    file=0x20000499c248 "/usr/WS1/mirchandaney1/legion_latest/runtime/realm/cuda/cuda_internal.cc", line=<optimized out>, function=<optimized out>) at assert.c:92
#3  0x00002000060558c4 in __GI___assert_fail (assertion=0x20000499c2b8 "ms", file=0x20000499c248 "/usr/WS1/mirchandaney1/legion_latest/runtime/realm/cuda/cuda_internal.cc", line=<optimized out>,
    function=0x20000499d648 <Realm::Cuda::AddressInfoCudaArray::set_rect(Realm::RegionInstanceImpl const*, Realm::InstanceLayoutPieceBase const*, unsigned long, unsigned long, int, long const*, long const*, int const*)::__PRETTY_FUNCTION__> "virtual int Realm::Cuda::AddressInf\
oCudaArray::set_rect(const Realm::RegionInstanceImpl*, const Realm::InstanceLayoutPieceBase*, size_t, size_t, int, const int64_t*, const int64_t*, const int*)") at assert.c:101
#4  0x0000200003d50320 in Realm::Cuda::AddressInfoCudaArray::set_rect (this=0x200027f1d730, inst=0x20004f24adb0, piece=0x20004f4f08e0, field_size=8, field_offset=0, ndims=2, lo=0x200027f1a8f8,
    hi=0x200027f1a908, order=0x20004f571304) at /usr/WS1/mirchandaney1/legion_latest/runtime/realm/cuda/cuda_internal.cc:88
#5  0x00002000037a00c8 in Realm::TransferIteratorBase<2, long long>::step_custom (this=0x20004f571280, max_bytes=4032, info=..., tentative=false)
    at /usr/WS1/mirchandaney1/legion_latest/runtime/realm/transfer/transfer.cc:435
#6  0x0000200003d52e4c in Realm::Cuda::GPUXferDes::progress_xd (this=0x20004f57e660, channel=0x14a9c220, work_until=...)
    at /usr/WS1/mirchandaney1/legion_latest/runtime/realm/cuda/cuda_internal.cc:683
#7  0x0000200003d65990 in Realm::XDQueue<Realm::Cuda::GPUChannel, Realm::Cuda::GPUXferDes>::do_work (this=0x14a9c258, work_until=...)
    at /usr/WS1/mirchandaney1/legion_latest/runtime/realm/transfer/channel.inl:166
#8  0x0000200003692004 in Realm::BackgroundWorkManager::Worker::do_work (this=0x200027f1e928, max_time_in_ns=-1, interrupt_flag=0x0)
    at /usr/WS1/mirchandaney1/legion_latest/runtime/realm/bgwork.cc:600
#9  0x000020000368f3ac in Realm::BackgroundWorkThread::main_loop (this=0x12572bf0) at /usr/WS1/mirchandaney1/legion_latest/runtime/realm/bgwork.cc:103
#10 0x00002000036949c4 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x12572bf0)
    at /usr/WS1/mirchandaney1/legion_latest/runtime/realm/threads.inl:97
#11 0x000020000382e224 in Realm::KernelThread::pthread_entry (data=0x12573b60) at /usr/WS1/mirchandaney1/legion_latest/runtime/realm/threads.cc:854
#12 0x0000200006248cd4 in start_thread (arg=0x200027f1f8b0) at pthread_create.c:309
#13 0x0000200006147f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104
muraj commented 2 months ago

@seemamirch Can you show me how you're registering this cuda array with an HDF_MEM? And what platform are you running this on that pageable_access_supported is true? Is this a P9 system?

muraj commented 2 months ago

Hmmmm... I think I know what might be going on... The HDF_MEM instance is a non-affine addressed instance right? I think currently Realm assumes if the GPU is involved in a non-affine transfer, these are done with CUarray, but without pageable memory access, transferring from HDF_MEM(nonaffine) -> GPU_FB_MEM (affine) would mean going through an CPU memcpy to ibmem buffer first before copying to GPU_FB_MEM via a affine->affine transfer. With pageable memory access, this ibmem staging is removed and we hit the corner case that the HDF_MEM instance is non-affine and has no CUarray associated with it, so we hit the assert.

This only happens on P9 and any HMM/ATS enabled system (I believe the Grace+Hopper ARM chip systems are support to have this support). We can delay this support for the short term while I implement a fallback path that transfers each affine section in the non-affine instance and have that ready for the next release. Is that okay?

seemamirch commented 2 months ago

@muraj - a fix/disable this feature for sept release is ok until a fallback path is implemented. HTR needs to work on Lassen.

muraj commented 2 months ago

Work around for this issue is in cperry/ats-disable and is out for review.

muraj commented 2 months ago

Set this to default disable and added a flag "-cuda:pageable_access" to enable, merged in master. Moving to close.