Closed syamajala closed 2 years ago
Running on spock using the gasnet ucx conduit with multiple ranks I see the following:
[0 - 7fb3609b4c00] 5.468286 {6}{gexxpair}: medium payload too large! src=0/0 tgt=1/0 max=4040 act=7752 Legion process received signal 6: Aborted
Here is a stack trace:
#0 0x00007f61c2fbb5a0 in nanosleep () from /lib64/libc.so.6 #1 0x00007f61c2fbb4aa in sleep () from /lib64/libc.so.6 #2 0x00007f61be9c0018 in Realm::realm_freeze (signal=6) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/runtime_impl.cc:177 #3 <signal handler called> #4 0x00007f61c2f2b520 in raise () from /lib64/libc.so.6 #5 0x00007f61c2f2cb01 in abort () from /lib64/libc.so.6 #6 0x00007f61bea3b9b2 in Realm::XmitSrcDestPair::reserve_pbuf_inline (this=0x483b2f0, hdr_bytes=12, payload_bytes=7752, overflow_ok=true, pktbuf=@0x4ebae08: 0x0, pktidx=@0x4ebae10: -1, hdr_base=@0x7f61877cb998: 0x7f61877cba10, payload_base=@0x7f61877cb9a0: 0x0) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/gasnetex/gasnetex_internal.cc:1163 #7 0x00007f61bea43b3c in Realm::GASNetEXInternal::prepare_message (this=0x482d300, target=1, target_ep_index=0, msgid=65, header_base=@0x7f61877cb998: 0x7f61877cba10, header_size=12, payload_base=@0x7f61877cb9a0: 0x0, payload_size=7752, dest_payload_addr=0) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/gasnetex/gasnetex_internal.cc:3169 #8 0x00007f61bea35d8e in Realm::GASNetEXMessageImpl::GASNetEXMessageImpl (this=0x7f61877cb990, _internal=0x482d300, _target=1, _msgid=65, _header_size=12, _max_payload_size=7752, _src_payload_addr=0x0, _src_payload_lines=0, _src_payload_line_stride=0, _dest_payload_addr=0, _dest_ep_index=0) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/gasnetex/gasnetex_module.cc:233 #9 0x00007f61bea37764 in Realm::GASNetEXModule::create_active_message_impl (this=0x482d250, target=1, msgid=65, header_size=12, max_payload_size=7752, src_payload_addr=0x0, src_payload_lines=0, src_payload_line_stride=0, storage_base=0x7f61877cb990, storage_size=256) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/gasnetex/gasnetex_module.cc:651 #10 0x00007f61be5eebd9 in Realm::Network::create_active_message_impl (target=1, msgid=65, header_size=8, max_payload_size=7752, src_payload_addr=0x0, src_payload_lines=0, src_payload_line_stride=0, storage_base=0x7f61877cb990, storage_size=256) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/network.inl:110 #11 0x00007f61be985a03 in Realm::ActiveMessage<Realm::MetadataResponseMessage, 256ul>::init (this=0x7f61877cb970, _target=1, _max_payload_size=7752) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/activemsg.inl:53 #12 0x00007f61be9851c4 in Realm::ActiveMessage<Realm::MetadataResponseMessage, 256ul>::ActiveMessage ( this=0x7f61877cb970, _target=1, _max_payload_size=7752) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/activemsg.inl:44 #13 0x00007f61be984777 in Realm::MetadataRequestMessage::handle_message (sender=1, args=..., data=0x0, datalen=0) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/metadata.cc:239 #14 0x00007f61be984e43 in Realm::HandlerWrappers::wrap_handler_notimeout<Realm::MetadataRequestMessage, Realm::MetadataRequestMessage::handle_message> (sender=1, header=0x4f5bd60, payload=0x0, payload_size=0) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/activemsg.inl:596 #15 0x00007f61bea5621c in Realm::IncomingMessageManager::do_work (this=0x4be7010, work_until=...) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/activemsg.cc:747 #16 0x00007f61be89fffb in Realm::BackgroundWorkManager::Worker::do_work (this=0x7f61877cc0f0, max_time_in_ns=-1, interrupt_flag=0x0) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/bgwork.cc:610 #17 0x00007f61be89dc32 in Realm::BackgroundWorkThread::main_loop (this=0x4eb49e0) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/bgwork.cc:158 #18 0x00007f61be8a1102 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x4eb49e0) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/threads.inl:97 #19 0x00007f61bea1d6e5 in Realm::KernelThread::pthread_entry (data=0x4180940) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_spock/legion/runtime/realm/threads.cc:774 #20 0x00007f61be0764f9 in start_thread () from /lib64/libpthread.so.0 #21 0x00007f61c2fedf2f in clone () from /lib64/libc.so.6
@streichler Perlmutter early access opens in mid-July, so we need this by then.
I think this is no longer an issue. I am able to run on Perlmutter.
Running on spock using the gasnet ucx conduit with multiple ranks I see the following:
Here is a stack trace: