Open elliottslaughter opened 1 year ago
Documenting what I already told @elliottslaughter. This is a memory corruption of an active message payload. The most likely causes are realm cleaning up or stomping on a buffer before an active message send is complete, gasnet corrupting the payload in flight, or a hardware bug. None of them are likely to be very easy to try down.
I suppose it could be a random memory corruption too, but seems unlikely that would happen anywhere near the pinned memory used for sending GASNet active message payloads.
@elliottslaughter :
-gex;cksum 0
, do things run correctly or do you get hangs/bad results/other things indicative of realm packets getting corrupted?This is on Piz Daint.
I am currently trying a run with REALM_NETWORKS=gasnet1
. After that I'll try the things you suggest.
Using REALM_NETWORKS=gasnet1
is sufficient to work around this for now.
-gex:immediate 0
does not fix the issue.
When I run with -gex:cksum 0
, I get (probably unsurprisingly):
[44 - 15550717e840] 5.733568 {6}{realm}: invalid subgraph handle: id=7a30027000
pennant: /users/eslaught/resilience/legion/runtime/realm/runtime_impl.cc:2593: Realm::SubgraphImpl* Realm::RuntimeImpl::get_subgraph_impl(Realm::ID): Assertion `0 && "invalid subgraph handle"' failed.
Running with a debug GASNet produces:
[6 - 155504942880] 5.862988 {6}{gexxpair}: medium payload too large! src=6/0 tgt=11/0 max=4072 act=6224
Do you have a backtrace to go with that last one? And does it fail reliably in that way?
Yes, this failure mode appears to be deterministic.
Backtrace:
#4 0x000015554cb59360 in raise () from /lib64/libc.so.6
#5 0x000015554cb5a941 in abort () from /lib64/libc.so.6
#6 0x000015554efd2286 in Realm::XmitSrcDestPair::reserve_pbuf_inline (this=0x1838d70, hdr_bytes=20, payload_bytes=6224, overflow_ok=true, pktbuf=@0x1849698: 0x0, pktidx=@0x18496a0: -1,
hdr_base=@0x155504c4be08: 0x155504c4be80, payload_base=@0x155504c4be10: 0x0) at /users/eslaught/resilience/legion/runtime/realm/gasnetex/gasnetex_internal.cc:1189
#7 0x000015554efdb594 in Realm::GASNetEXInternal::prepare_message (this=0x7fb6c0, target=13, target_ep_index=0, msgid=99, header_base=@0x155504c4be08: 0x155504c4be80, header_size=20,
payload_base=@0x155504c4be10: 0x0, payload_size=6224, dest_payload_addr=0) at /users/eslaught/resilience/legion/runtime/realm/gasnetex/gasnetex_internal.cc:3853
#8 0x000015554efc036a in Realm::GASNetEXMessageImpl::GASNetEXMessageImpl (this=0x155504c4be00, _internal=0x7fb6c0, _target=13, _msgid=99, _header_size=20, _max_payload_size=6224, _src_payload_addr=0x0,
_src_payload_lines=0, _src_payload_line_stride=0, _dest_payload_addr=0, _dest_ep_index=0) at /users/eslaught/resilience/legion/runtime/realm/gasnetex/gasnetex_module.cc:234
#9 0x000015554efc1fcc in Realm::GASNetEXModule::create_active_message_impl (this=0x7bdff0, target=13, msgid=99, header_size=20, max_payload_size=6224, src_payload_addr=0x0, src_payload_lines=0,
src_payload_line_stride=0, storage_base=0x155504c4be00, storage_size=256) at /users/eslaught/resilience/legion/runtime/realm/gasnetex/gasnetex_module.cc:676
#10 0x000015554eb16819 in Realm::Network::create_active_message_impl (target=13, msgid=99, header_size=16, max_payload_size=6224, src_payload_addr=0x0, src_payload_lines=0, src_payload_line_stride=0,
storage_base=0x155504c4be00, storage_size=256) at /users/eslaught/resilience/legion/runtime/realm/network.inl:100
#11 0x000015554f0a885f in Realm::ActiveMessage<Realm::RemoteMicroOpMessage<Realm::ByFieldMicroOp<1, long long, Realm::Point<1, long long> > >, 256ul>::init (this=0x155504c4bde0, _target=13,
_max_payload_size=6224) at /users/eslaught/resilience/legion/runtime/realm/activemsg.inl:53
#12 0x000015554f09fece in Realm::ActiveMessage<Realm::RemoteMicroOpMessage<Realm::ByFieldMicroOp<1, long long, Realm::Point<1, long long> > >, 256ul>::ActiveMessage (this=0x155504c4bde0, _target=13,
_max_payload_size=6224) at /users/eslaught/resilience/legion/runtime/realm/activemsg.inl:44
#13 0x000015554f099bd8 in Realm::PartitioningMicroOp::forward_microop<Realm::ByFieldMicroOp<1, long long, Realm::Point<1, long long> > > (target=13, op=0x1545604027c0, microop=0x15457c172180)
at /users/eslaught/resilience/legion/runtime/realm/deppart/partitions.inl:46
#14 0x000015554f092d18 in Realm::ByFieldMicroOp<1, long long, Realm::Point<1, long long> >::dispatch (this=0x15457c172180, op=0x1545604027c0, inline_ok=true)
at /users/eslaught/resilience/legion/runtime/realm/deppart/./byfield.cc:208
#15 0x000015554f0932aa in Realm::ByFieldOperation<1, long long, Realm::Point<1, long long> >::execute (this=0x1545604027c0) at /users/eslaught/resilience/legion/runtime/realm/deppart/./byfield.cc:320
#16 0x000015554eca3d7f in Realm::PartitioningOpQueue::do_work (this=0x16a7280, work_until=...) at /users/eslaught/resilience/legion/runtime/realm/deppart/partitions.cc:978
#17 0x000015554edd92b1 in Realm::BackgroundWorkManager::Worker::do_work (this=0x155504c4d0e0, max_time_in_ns=-1, interrupt_flag=0x0) at /users/eslaught/resilience/legion/runtime/realm/bgwork.cc:621
#18 0x000015554edd6fe7 in Realm::BackgroundWorkThread::main_loop (this=0x144a770) at /users/eslaught/resilience/legion/runtime/realm/bgwork.cc:125
#19 0x000015554edda6a6 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x144a770)
at /users/eslaught/resilience/legion/runtime/realm/threads.inl:97
#20 0x000015554ef5b331 in Realm::KernelThread::pthread_entry (data=0x8e3920) at /users/eslaught/resilience/legion/runtime/realm/threads.cc:781
I've got my job for a little bit if you want me to test anything else.
@streichler suggested that this may be a duplicate of #1229. As a workaround I am testing building GASNet with:
--with-aries-max-medium=8192
With this, running DEBUG=1
with a debug GASNet, I'm able to get to the end of execution and then fail with the error message at https://github.com/StanfordLegion/legion/issues/1415#issuecomment-1450541920 (which is expected and an unrelated issue).
I guess my one question is whether this is really a solution, since presumably the required max medium size will scale as $\mathcal{O}(N)$.
As I suspected, when I go to 512 nodes I start hitting:
[6 - 155501ae1880] 11.069872 {6}{gexxpair}: medium payload too large! src=6/0 tgt=10/0 max=8168 act=12368
So I'd need a max medium size of 16K for 512 nodes, 32K for 1024 nodes, etc. I can work around this for now, but it doesn't seem sustainable.
I agree this isn't sustainable, but I can renew the work on #1229 for a proper fix.
I'm also hitting this issue for a 128-node run of a modified version of Stencil on Perlmutter:
[9 - 7fb002856000] 50.335033 {6}{gex}: CRC MISMATCH: batch_size=16 payload_size=16 exp=815730 act=7020ff50
The backtraces don't appear to have anything useful. I'm on commit 92afef
of Legion. Some GASNet info:
*** Details for bug reporting (proc 5): config=RELEASE=2023.9.0,SPEC=1.20,PTR=64bit,nodebug,PAR,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native compiler=GNU/12.3.0 sys=x86_64-pc-linux-gnu
Can you try increasing the maximum size of GASNet's medium message as directed above and see if that fixes the problem?
I increased it to 2**15
, quadruple the default value, but I'm still hitting the same error.
I was able to run this without errors with a debug GASNet and debug Legion, though. I'm not sure if this was random, or enabling debug mode somehow fixed the issue.
I think what we're looking for with Rupanshu's issue is a way to either confirm the root cause in a release build of Legion/GASNet, or to force the issue to reproduce somehow in debug mode GASNet so we can see what's really going on.
Any ideas?
Well you definitely don't need to do anything with a debug version of Legion as there should be nothing that Legion can do to cause a CRC mismatch. Realm assertions are on by default even in release mode so there's no benefit in building a debug Realm. I'm not sure if there are any checks in GASNet or Realm that confirm that all active messages that are sent are within the maximum size for that kind of active message. I would bet a heavily on Realm sending an active message beyond the maximum allowed size for a particular kind. I have no good suggestions for you other than to go in and annotate all the active message sends in Realm so that they check that they are within the bounds of the appropriate size.
I believe that GASNet refuses to compile when it thinks it's being included into a non-debug application (if it is itself built in debug mode).
I told @rupanshusoi to try commenting that check out. In theory it should just be a #error
line somewhere in one of the GASNet header files.
Is there a reason why Realm doesn't check maximum lengths? In debug mode in particular, that seems like an eminently reasonable thing to do....
Ok, I got an assertion failure with a debug GASNet and release Legion:
*** FATAL ERROR: Assertion failure (proc 103): in gasnetc_ofi_handle_am() at anguage/gasnet/GASNet-2023.9.0/ofi-conduit/gasnet_ofi.c:1725: isreq == header->isreq
op1 : 1 (0x00000001) == isreq
op2 : 0 (0x00000000) == header->isreq
I'm moving discussion of @rupanshusoi 's issue to #1660 because this appears to be a different underlying root cause, even though the initial symptoms looked similar.
I'm seeing the following failure in C++ Pennant starting at 256 nodes:
It happens about 80% of the time, so it's going to be pretty painful to work around (if it's possible at all—considering I need longer runs for the replay test).
The backtrace looks useless, but here it is:
If there are workarounds or solutions to this, they'd be appreciated.