StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
682 stars 145 forks source link

CRC mismatch at 256 nodes in C++ Pennant #1449

Open elliottslaughter opened 1 year ago

elliottslaughter commented 1 year ago

I'm seeing the following failure in C++ Pennant starting at 256 nodes:

[6 - 155507163840]    5.450854 {6}{gex}: CRC MISMATCH: arg0=1610612737 header_size=36 payload_size=16 exp=300a8000 act=68e03e37

It happens about 80% of the time, so it's going to be pretty painful to work around (if it's possible at all—considering I need longer runs for the replay test).

The backtrace looks useless, but here it is:

  [0] = /lib64/libpthread.so.0(+0x13310) [0x155550cfd310]
  [1] = /lib64/libc.so.6(gsignal+0x110) [0x15554f786360]
  [2] = /lib64/libc.so.6(abort+0x151) [0x15554f787941]
  [3] = /users/eslaught/resilience/cpp_pennant.run22_exit_crash/librealm.so.1(+0x437336) [0x155551a17336]
  [4] = /users/eslaught/resilience/cpp_pennant.run22_exit_crash/librealm.so.1(+0x43a045) [0x155551a1a045]
  [5] = /users/eslaught/resilience/cpp_pennant.run22_exit_crash/librealm.so.1(+0x44b7b0) [0x155551a2b7b0]
  [6] = /users/eslaught/resilience/cpp_pennant.run22_exit_crash/librealm.so.1(+0x44bbc0) [0x155551a2bbc0]
  [7] = /users/eslaught/resilience/cpp_pennant.run22_exit_crash/librealm.so.1(gasnetc_recv_am_unlocked+0x830) [0x155551f14440]
  [8] = /users/eslaught/resilience/cpp_pennant.run22_exit_crash/librealm.so.1(+0x93555d) [0x155551f1555d]
  [9] = /users/eslaught/resilience/cpp_pennant.run22_exit_crash/librealm.so.1(gasnetc_poll+0x9) [0x155551f15649]
  [10] = /users/eslaught/resilience/cpp_pennant.run22_exit_crash/librealm.so.1(gasnetc_AMPoll+0x10) [0x155551f0c2f0]
  [11] = /users/eslaught/resilience/cpp_pennant.run22_exit_crash/librealm.so.1(+0x442c64) [0x155551a22c64]
  [12] = /users/eslaught/resilience/cpp_pennant.run22_exit_crash/librealm.so.1(+0x311d91) [0x1555518f1d91]
  [13] = /users/eslaught/resilience/cpp_pennant.run22_exit_crash/librealm.so.1(+0x3125b9) [0x1555518f25b9]
  [14] = /users/eslaught/resilience/cpp_pennant.run22_exit_crash/librealm.so.1(+0x3f639f) [0x1555519d639f]
  [15] = /lib64/libpthread.so.0(+0x8539) [0x155550cf2539]
  [16] = /lib64/libc.so.6(clone+0x3f) [0x15554f848e0f]

If there are workarounds or solutions to this, they'd be appreciated.

lightsighter commented 1 year ago

Documenting what I already told @elliottslaughter. This is a memory corruption of an active message payload. The most likely causes are realm cleaning up or stomping on a buffer before an active message send is complete, gasnet corrupting the payload in flight, or a hardware bug. None of them are likely to be very easy to try down.

lightsighter commented 1 year ago

I suppose it could be a random memory corruption too, but seems unlikely that would happen anywhere near the pinned memory used for sending GASNet active message payloads.

streichler commented 1 year ago

@elliottslaughter :

  1. which system is this happening on? is it possible to try it on another system, ideally one using another gasnet conduit?
  2. can you try running with a debug build of gasnet and see if gasnet's checks flag anything?
  3. if you run with -gex;cksum 0, do things run correctly or do you get hangs/bad results/other things indicative of realm packets getting corrupted?
elliottslaughter commented 1 year ago

This is on Piz Daint.

I am currently trying a run with REALM_NETWORKS=gasnet1. After that I'll try the things you suggest.

elliottslaughter commented 1 year ago

Using REALM_NETWORKS=gasnet1 is sufficient to work around this for now.

elliottslaughter commented 1 year ago

-gex:immediate 0 does not fix the issue.

elliottslaughter commented 1 year ago

When I run with -gex:cksum 0, I get (probably unsurprisingly):

[44 - 15550717e840]    5.733568 {6}{realm}: invalid subgraph handle: id=7a30027000
pennant: /users/eslaught/resilience/legion/runtime/realm/runtime_impl.cc:2593: Realm::SubgraphImpl* Realm::RuntimeImpl::get_subgraph_impl(Realm::ID): Assertion `0 && "invalid subgraph handle"' failed.
elliottslaughter commented 1 year ago

Running with a debug GASNet produces:

[6 - 155504942880]    5.862988 {6}{gexxpair}: medium payload too large!  src=6/0 tgt=11/0 max=4072 act=6224
streichler commented 1 year ago

Do you have a backtrace to go with that last one? And does it fail reliably in that way?

elliottslaughter commented 1 year ago

Yes, this failure mode appears to be deterministic.

Backtrace:

#4  0x000015554cb59360 in raise () from /lib64/libc.so.6
#5  0x000015554cb5a941 in abort () from /lib64/libc.so.6
#6  0x000015554efd2286 in Realm::XmitSrcDestPair::reserve_pbuf_inline (this=0x1838d70, hdr_bytes=20, payload_bytes=6224, overflow_ok=true, pktbuf=@0x1849698: 0x0, pktidx=@0x18496a0: -1, 
    hdr_base=@0x155504c4be08: 0x155504c4be80, payload_base=@0x155504c4be10: 0x0) at /users/eslaught/resilience/legion/runtime/realm/gasnetex/gasnetex_internal.cc:1189
#7  0x000015554efdb594 in Realm::GASNetEXInternal::prepare_message (this=0x7fb6c0, target=13, target_ep_index=0, msgid=99, header_base=@0x155504c4be08: 0x155504c4be80, header_size=20, 
    payload_base=@0x155504c4be10: 0x0, payload_size=6224, dest_payload_addr=0) at /users/eslaught/resilience/legion/runtime/realm/gasnetex/gasnetex_internal.cc:3853
#8  0x000015554efc036a in Realm::GASNetEXMessageImpl::GASNetEXMessageImpl (this=0x155504c4be00, _internal=0x7fb6c0, _target=13, _msgid=99, _header_size=20, _max_payload_size=6224, _src_payload_addr=0x0, 
    _src_payload_lines=0, _src_payload_line_stride=0, _dest_payload_addr=0, _dest_ep_index=0) at /users/eslaught/resilience/legion/runtime/realm/gasnetex/gasnetex_module.cc:234
#9  0x000015554efc1fcc in Realm::GASNetEXModule::create_active_message_impl (this=0x7bdff0, target=13, msgid=99, header_size=20, max_payload_size=6224, src_payload_addr=0x0, src_payload_lines=0, 
    src_payload_line_stride=0, storage_base=0x155504c4be00, storage_size=256) at /users/eslaught/resilience/legion/runtime/realm/gasnetex/gasnetex_module.cc:676
#10 0x000015554eb16819 in Realm::Network::create_active_message_impl (target=13, msgid=99, header_size=16, max_payload_size=6224, src_payload_addr=0x0, src_payload_lines=0, src_payload_line_stride=0, 
    storage_base=0x155504c4be00, storage_size=256) at /users/eslaught/resilience/legion/runtime/realm/network.inl:100
#11 0x000015554f0a885f in Realm::ActiveMessage<Realm::RemoteMicroOpMessage<Realm::ByFieldMicroOp<1, long long, Realm::Point<1, long long> > >, 256ul>::init (this=0x155504c4bde0, _target=13, 
    _max_payload_size=6224) at /users/eslaught/resilience/legion/runtime/realm/activemsg.inl:53
#12 0x000015554f09fece in Realm::ActiveMessage<Realm::RemoteMicroOpMessage<Realm::ByFieldMicroOp<1, long long, Realm::Point<1, long long> > >, 256ul>::ActiveMessage (this=0x155504c4bde0, _target=13, 
    _max_payload_size=6224) at /users/eslaught/resilience/legion/runtime/realm/activemsg.inl:44
#13 0x000015554f099bd8 in Realm::PartitioningMicroOp::forward_microop<Realm::ByFieldMicroOp<1, long long, Realm::Point<1, long long> > > (target=13, op=0x1545604027c0, microop=0x15457c172180)
    at /users/eslaught/resilience/legion/runtime/realm/deppart/partitions.inl:46
#14 0x000015554f092d18 in Realm::ByFieldMicroOp<1, long long, Realm::Point<1, long long> >::dispatch (this=0x15457c172180, op=0x1545604027c0, inline_ok=true)
    at /users/eslaught/resilience/legion/runtime/realm/deppart/./byfield.cc:208
#15 0x000015554f0932aa in Realm::ByFieldOperation<1, long long, Realm::Point<1, long long> >::execute (this=0x1545604027c0) at /users/eslaught/resilience/legion/runtime/realm/deppart/./byfield.cc:320
#16 0x000015554eca3d7f in Realm::PartitioningOpQueue::do_work (this=0x16a7280, work_until=...) at /users/eslaught/resilience/legion/runtime/realm/deppart/partitions.cc:978
#17 0x000015554edd92b1 in Realm::BackgroundWorkManager::Worker::do_work (this=0x155504c4d0e0, max_time_in_ns=-1, interrupt_flag=0x0) at /users/eslaught/resilience/legion/runtime/realm/bgwork.cc:621
#18 0x000015554edd6fe7 in Realm::BackgroundWorkThread::main_loop (this=0x144a770) at /users/eslaught/resilience/legion/runtime/realm/bgwork.cc:125
#19 0x000015554edda6a6 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x144a770)
    at /users/eslaught/resilience/legion/runtime/realm/threads.inl:97
#20 0x000015554ef5b331 in Realm::KernelThread::pthread_entry (data=0x8e3920) at /users/eslaught/resilience/legion/runtime/realm/threads.cc:781

I've got my job for a little bit if you want me to test anything else.

elliottslaughter commented 1 year ago

@streichler suggested that this may be a duplicate of #1229. As a workaround I am testing building GASNet with:

--with-aries-max-medium=8192

With this, running DEBUG=1 with a debug GASNet, I'm able to get to the end of execution and then fail with the error message at https://github.com/StanfordLegion/legion/issues/1415#issuecomment-1450541920 (which is expected and an unrelated issue).

I guess my one question is whether this is really a solution, since presumably the required max medium size will scale as $\mathcal{O}(N)$.

elliottslaughter commented 1 year ago

As I suspected, when I go to 512 nodes I start hitting:

[6 - 155501ae1880]   11.069872 {6}{gexxpair}: medium payload too large!  src=6/0 tgt=10/0 max=8168 act=12368

So I'd need a max medium size of 16K for 512 nodes, 32K for 1024 nodes, etc. I can work around this for now, but it doesn't seem sustainable.

streichler commented 1 year ago

I agree this isn't sustainable, but I can renew the work on #1229 for a proper fix.

rupanshusoi commented 7 months ago

I'm also hitting this issue for a 128-node run of a modified version of Stencil on Perlmutter:

[9 - 7fb002856000]   50.335033 {6}{gex}: CRC MISMATCH: batch_size=16 payload_size=16 exp=815730 act=7020ff50

The backtraces don't appear to have anything useful. I'm on commit 92afef of Legion. Some GASNet info:

*** Details for bug reporting (proc 5): config=RELEASE=2023.9.0,SPEC=1.20,PTR=64bit,nodebug,PAR,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native compiler=GNU/12.3.0 sys=x86_64-pc-linux-gnu
lightsighter commented 7 months ago

Can you try increasing the maximum size of GASNet's medium message as directed above and see if that fixes the problem?

rupanshusoi commented 7 months ago

I increased it to 2**15, quadruple the default value, but I'm still hitting the same error.

rupanshusoi commented 7 months ago

I was able to run this without errors with a debug GASNet and debug Legion, though. I'm not sure if this was random, or enabling debug mode somehow fixed the issue.

elliottslaughter commented 7 months ago

I think what we're looking for with Rupanshu's issue is a way to either confirm the root cause in a release build of Legion/GASNet, or to force the issue to reproduce somehow in debug mode GASNet so we can see what's really going on.

Any ideas?

lightsighter commented 7 months ago

Well you definitely don't need to do anything with a debug version of Legion as there should be nothing that Legion can do to cause a CRC mismatch. Realm assertions are on by default even in release mode so there's no benefit in building a debug Realm. I'm not sure if there are any checks in GASNet or Realm that confirm that all active messages that are sent are within the maximum size for that kind of active message. I would bet a heavily on Realm sending an active message beyond the maximum allowed size for a particular kind. I have no good suggestions for you other than to go in and annotate all the active message sends in Realm so that they check that they are within the bounds of the appropriate size.

elliottslaughter commented 7 months ago

I believe that GASNet refuses to compile when it thinks it's being included into a non-debug application (if it is itself built in debug mode).

I told @rupanshusoi to try commenting that check out. In theory it should just be a #error line somewhere in one of the GASNet header files.

elliottslaughter commented 7 months ago

Is there a reason why Realm doesn't check maximum lengths? In debug mode in particular, that seems like an eminently reasonable thing to do....

rupanshusoi commented 7 months ago

Ok, I got an assertion failure with a debug GASNet and release Legion:

*** FATAL ERROR: Assertion failure (proc 103): in gasnetc_ofi_handle_am() at anguage/gasnet/GASNet-2023.9.0/ofi-conduit/gasnet_ofi.c:1725: isreq == header->isreq
   op1 :           1 (0x00000001) == isreq
   op2 :           0 (0x00000000) == header->isreq

Full log here.

elliottslaughter commented 7 months ago

I'm moving discussion of @rupanshusoi 's issue to #1660 because this appears to be a different underlying root cause, even though the initial symptoms looked similar.