StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
678 stars 145 forks source link

Realm: medium payload too large! #1094

Closed syamajala closed 2 years ago

syamajala commented 3 years ago

Running on spock using the gasnet ucx conduit with multiple ranks I see the following:

[0 - 7fb3609b4c00]    5.468286 {6}{gexxpair}: medium payload too large!  src=0/0 tgt=1/0 max=4040 act=7752
Legion process received signal 6: Aborted

Here is a stack trace:

#0  0x00007f61c2fbb5a0 in nanosleep () from /lib64/libc.so.6
#1  0x00007f61c2fbb4aa in sleep () from /lib64/libc.so.6
#2  0x00007f61be9c0018 in Realm::realm_freeze (signal=6)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/runtime_impl.cc:177
#3  <signal handler called>
#4  0x00007f61c2f2b520 in raise () from /lib64/libc.so.6
#5  0x00007f61c2f2cb01 in abort () from /lib64/libc.so.6
#6  0x00007f61bea3b9b2 in Realm::XmitSrcDestPair::reserve_pbuf_inline (this=0x483b2f0, hdr_bytes=12, 
    payload_bytes=7752, overflow_ok=true, pktbuf=@0x4ebae08: 0x0, pktidx=@0x4ebae10: -1, 
    hdr_base=@0x7f61877cb998: 0x7f61877cba10, payload_base=@0x7f61877cb9a0: 0x0)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/gasnetex/gasnetex_internal.cc:1163
#7  0x00007f61bea43b3c in Realm::GASNetEXInternal::prepare_message (this=0x482d300, target=1, target_ep_index=0, 
    msgid=65, header_base=@0x7f61877cb998: 0x7f61877cba10, header_size=12, payload_base=@0x7f61877cb9a0: 0x0, 
    payload_size=7752, dest_payload_addr=0)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/gasnetex/gasnetex_internal.cc:3169
#8  0x00007f61bea35d8e in Realm::GASNetEXMessageImpl::GASNetEXMessageImpl (this=0x7f61877cb990, _internal=0x482d300, 
    _target=1, _msgid=65, _header_size=12, _max_payload_size=7752, _src_payload_addr=0x0, _src_payload_lines=0, 
    _src_payload_line_stride=0, _dest_payload_addr=0, _dest_ep_index=0)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/gasnetex/gasnetex_module.cc:233
#9  0x00007f61bea37764 in Realm::GASNetEXModule::create_active_message_impl (this=0x482d250, target=1, msgid=65, 
    header_size=12, max_payload_size=7752, src_payload_addr=0x0, src_payload_lines=0, src_payload_line_stride=0, 
    storage_base=0x7f61877cb990, storage_size=256)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/gasnetex/gasnetex_module.cc:651
#10 0x00007f61be5eebd9 in Realm::Network::create_active_message_impl (target=1, msgid=65, header_size=8, 
    max_payload_size=7752, src_payload_addr=0x0, src_payload_lines=0, src_payload_line_stride=0, 
    storage_base=0x7f61877cb990, storage_size=256)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/network.inl:110
#11 0x00007f61be985a03 in Realm::ActiveMessage<Realm::MetadataResponseMessage, 256ul>::init (this=0x7f61877cb970, 
    _target=1, _max_payload_size=7752)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/activemsg.inl:53
#12 0x00007f61be9851c4 in Realm::ActiveMessage<Realm::MetadataResponseMessage, 256ul>::ActiveMessage (
    this=0x7f61877cb970, _target=1, _max_payload_size=7752)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/activemsg.inl:44
#13 0x00007f61be984777 in Realm::MetadataRequestMessage::handle_message (sender=1, args=..., data=0x0, datalen=0)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/metadata.cc:239
#14 0x00007f61be984e43 in Realm::HandlerWrappers::wrap_handler_notimeout<Realm::MetadataRequestMessage, Realm::MetadataRequestMessage::handle_message> (sender=1, header=0x4f5bd60, payload=0x0, payload_size=0)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/activemsg.inl:596
#15 0x00007f61bea5621c in Realm::IncomingMessageManager::do_work (this=0x4be7010, work_until=...)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/activemsg.cc:747
#16 0x00007f61be89fffb in Realm::BackgroundWorkManager::Worker::do_work (this=0x7f61877cc0f0, max_time_in_ns=-1, 
    interrupt_flag=0x0)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/bgwork.cc:610
#17 0x00007f61be89dc32 in Realm::BackgroundWorkThread::main_loop (this=0x4eb49e0)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/bgwork.cc:158
#18 0x00007f61be8a1102 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x4eb49e0)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/threads.inl:97
#19 0x00007f61bea1d6e5 in Realm::KernelThread::pthread_entry (data=0x4180940)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/threads.cc:774
#20 0x00007f61be0764f9 in start_thread () from /lib64/libpthread.so.0
#21 0x00007f61c2fedf2f in clone () from /lib64/libc.so.6
syamajala commented 3 years ago

@streichler Perlmutter early access opens in mid-July, so we need this by then.

syamajala commented 2 years ago

I think this is no longer an issue. I am able to run on Perlmutter.