StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
689 stars 144 forks source link

Realm: gex CRC MISMATCH #1779

Open syamajala opened 1 month ago

syamajala commented 1 month ago

I'm seeing the following error at shutdown when running cunumeric on Perlmutter:

[1 - 7f3eef258740]  194.063976 {6}{gex}: CRC MISMATCH: arg0=54 header_size=36 payload_size=10976 exp=fc88a3b5 act=d1f06
46a
eddy16112 commented 1 month ago

Could you please get a backtrace?

syamajala commented 1 month ago

I cant seem to get backtraces with a debug build and GASNET_BACKTRACE=1 its just saying GASNet abnormal exit.

eddy16112 commented 1 month ago

Could you please try RelwithDebInfo? I would like to see which message triggers the error.

syamajala commented 1 month ago

Here is a stacktrace in debug: http://sapling2.stanford.edu/~seshu/xcsl1028423/backtrace.txt

qldnfox commented 1 month ago

It says 404 Forbidden for me. Can I get access?

syamajala commented 1 month ago

Try it now.

syamajala commented 1 month ago

This is probably the relevant stack trace:

[5] Thread 70 (Thread 0x7fa2f0ebb740 (LWP 1499908) "python"):
[5] #0  0x00007fa44f15dbbf in wait4 () from /lib64/libc.so.6
[5] #1  0x00007fa44f0d4c37 in do_system () from /lib64/libc.so.6
[5] #2  0x00007fa3a7426964 in gasneti_system_redirected () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[5] #3  0x00007fa3a7426fb7 in gasneti_bt_gdb () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[5] #4  0x00007fa3a742a95e in gasneti_print_backtrace () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[5] #5  0x00007fa3a63efe45 in gasneti_defaultSignalHandler () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[5] #6  <signal handler called>
[5] #7  0x00007fa44f0c6d2b in raise () from /lib64/libc.so.6
[5] #8  0x00007fa44f0c83e5 in abort () from /lib64/libc.so.6
[5] #9  0x00007fa3a6a20bc3 in Realm::XmitSrcDestPair::reserve_pbuf_inline (this=0x559a6ba88f00, hdr_bytes=36, payload_bytes=10976, overflow_ok=true, pktbuf=@0x559a6bac94f8: 0x0, pktidx=@0x559a6bac9500: -1, hdr_base=@0x7fa2f0eb8548: 0x7fa2f0eb85c0, payload_base=@0x7fa2f0eb8550: 0x0) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/gasnetex/gasnetex_internal.cc:1258
[5] #10 0x00007fa3a6a2adf7 in Realm::GASNetEXInternal::prepare_message (this=0x559a66cefff0, target=0, target_ep_index=0, msgid=54, header_base=@0x7fa2f0eb8548: 0x7fa2f0eb85c0, header_size=36, payload_base=@0x7fa2f0eb8550: 0x0, payload_size=10976, dest_payload_addr=0) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/gasnetex/gasnetex_internal.cc:3927
[5] #11 0x00007fa3a6a1a647 in Realm::GASNetEXMessageImpl::GASNetEXMessageImpl (this=0x7fa2f0eb8540, _internal=0x559a66cefff0, _target=0, _msgid=54, _header_size=36, _max_payload_size=10976, _src_payload_addr=0x0, _src_payload_lines=0, _src_payload_line_stride=0, _dest_payload_addr=0, _dest_ep_index=0) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/gasnetex/gasnetex_module.cc:221
[5] #12 0x00007fa3a6a1c547 in Realm::GASNetEXModule::create_active_message_impl (this=0x559a6662e6f0, target=0, msgid=54, header_size=36, max_payload_size=10976, src_payload_addr=0x0, src_payload_lines=0, src_payload_line_stride=0, storage_base=0x7fa2f0eb8540, storage_size=256) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/gasnetex/gasnetex_module.cc:670
[5] #13 0x00007fa3a6418fb5 in Realm::Network::create_active_message_impl (target=0, msgid=54, header_size=32, max_payload_size=10976, src_payload_addr=0x0, src_payload_lines=0, src_payload_line_stride=0, storage_base=0x7fa2f0eb8540, storage_size=256) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/network.inl:100
[5] #14 0x00007fa3a67f9986 in Realm::ActiveMessage<Realm::BarrierTriggerMessage, 256ul>::init (this=0x7fa2f0eb8520, _target=0, _max_payload_size=10976) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/activemsg.inl:53
[5] #15 0x00007fa3a67f7aec in Realm::ActiveMessage<Realm::BarrierTriggerMessage, 256ul>::ActiveMessage (this=0x7fa2f0eb8520, _target=0, _max_payload_size=10976) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/activemsg.inl:44
[5] #16 0x00007fa3a67f2eba in Realm::BarrierTriggerMessage::send_request (target=0, barrier_id=2305930970166984704, trigger_gen=382, previous_gen=39, first_generation=0, redop_id=1048576, migration_target=-1, base_arrival_count=6, data=0x7f7fe45bc760, datalen=10976) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/barrier_impl.cc:288
[5] #17 0x00007fa3a67f4825 in Realm::BarrierImpl::adjust_arrival (this=0x7f81ac043dd0, barrier_gen=40, delta=-1, timestamp=0, wait_on=..., sender=0, forwarded=false, reduce_value=0x7f824a237d80, reduce_value_size=32, work_until=...) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/barrier_impl.cc:688
[5] #18 0x00007fa3a67f2c09 in Realm::BarrierAdjustMessage::handle_message (sender=0, args=..., data=0x7f824a237d80, datalen=32, work_until=...) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/barrier_impl.cc:249
[5] #19 0x00007fa3a67f694c in Realm::HandlerWrappers::wrap_handler<Realm::BarrierAdjustMessage, Realm::BarrierAdjustMessage::handle_message> (sender=0, header=0x7f824a237d50, payload=0x7f824a237d80, payload_size=32, work_until=...) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/activemsg.inl:620
[5] #20 0x00007fa3a6a38dc7 in Realm::IncomingMessageManager::do_work (this=0x559a67f48a20, work_until=...) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/activemsg.cc:740
[5] #21 0x00007fa3a67d6663 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7fa2f0eba0d0, max_time_in_ns=-1, interrupt_flag=0x0) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/bgwork.cc:600
[5] #22 0x00007fa3a67d4301 in Realm::BackgroundWorkThread::main_loop (this=0x559a6b84b910) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/bgwork.cc:103
[5] #23 0x00007fa3a67d7902 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x559a6b84b910) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/threads.inl:97
[5] #24 0x00007fa3a6986a21 in Realm::KernelThread::pthread_entry (data=0x559a664d9dc0) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/threads.cc:854
[5] #25 0x00007fa44f3d46ea in start_thread () from /lib64/libpthread.so.0
[5] #26 0x00007fa44f19449f in clone () from /lib64/libc.so.6
qldnfox commented 1 month ago

worked! thanks Seshu

lightsighter commented 1 month ago

The backtrace doesn't look like it is from a CRC check. Looks like is coming from this error message:

https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_internal.cc?ref_type=heads#L1252-1259

That's probably a bug in Realm where it is trying to send a medium active mssage where it needs to switch to sending a long active message. It probably also explains the CRC check failure on the far side because only some of the payload makes it across in release mode.

eddy16112 commented 1 month ago

I think it is the gasnet version of issue https://github.com/StanfordLegion/legion/issues/1769. I am surprised that the limit is only 8K

[5] [5 - 7fa2f0ebb740]  543.639675 {6}{gexxpair}: medium payload too large!  src=5/0 tgt=0/0 max=8192 act=10976
lightsighter commented 1 month ago

I agree that is a likely cause of the problem.

elliottslaughter commented 1 month ago

The medium AM size can be set with GASNET_OFI_MAX_MEDIUM if you are on the ofi conduit: https://gasnet.lbl.gov/dist-ex/ofi-conduit/README

Don't even need to rebuild, it's just an environment variable.

We've seen this issue before: https://github.com/StanfordLegion/legion/issues/1449 and https://github.com/StanfordLegion/legion/issues/1229 are both variations on this same issue.

syamajala commented 1 month ago

The CRC error goes away in release by setting GASNET_OFI_MAX_MEDIUM but its still not shutting down cleanly.

I just see gasnet abnormal exit. Here is a stack trace:

[13] Thread 1 (Thread 0x7f5ad3659740 (LWP 1843020) "python"):
[13] #0  0x00007f5ad373dbbf in wait4 () from /lib64/libc.so.6
[13] #1  0x00007f5ad36b4c37 in do_system () from /lib64/libc.so.6
[13] #2  0x00007f5a2f0ba654 in gasneti_system_redirected () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #3  0x00007f5a2f0baca7 in gasneti_bt_gdb () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #4  0x00007f5a2f0be64e in gasneti_print_backtrace () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #5  0x00007f5a2e2cca96 in gasneti_error_abort () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #6  0x00007f5a2e2ccb87 in _gasneti_fatalerror () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #7  0x00007f5a2f0b49b2 in gasnetc_ofi_tx_poll () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #8  0x00007f5a2f0b4aec in gasnetc_ofi_poll () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #9  0x00007f5a2f0aa8e0 in gasnetc_AMPoll () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #10 0x00007f5a2f0aadbf in gasnetc_exit () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #11 0x00007f5a2e2d70f1 in gasneti_defaultSignalHandler () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #12 <signal handler called>
[13] #13 0x00007f5ad39c076b in raise () from /lib64/libpthread.so.0
[13] #14 <signal handler called>
[13] #15 0x00007f5ad376d759 in syscall () from /lib64/libc.so.6
[13] #16 0x00007f5a2e04c493 in ofi_intercept_munmap (start=0x7f34c8000000, length=51539607552) at prov/util/src/util_mem_hooks.c:547
[13] #17 0x00007f5a2e6c2d46 in Realm::SharedMemoryInfo::~SharedMemoryInfo() () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #18 0x00007f5a2e678011 in Realm::RuntimeImpl::~RuntimeImpl() () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #19 0x00007f5a2e67d061 in Realm::Runtime::wait_for_shutdown() () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #20 0x00007f5a30279aed in Legion::Internal::Runtime::wait_for_shutdown() () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../liblegion.so.1
[13] #21 0x00007f5a317379be in legate::detail::Runtime::finish() () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../liblegate.so.24.09.00
[13] #22 0x00007f5a3171358d in legate::finish() () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../liblegate.so.24.09.00
[13] #23 0x00007f5a16befd62 in __pyx_f_6legate_4_lib_7runtime_7runtime_7Runtime_finish(__pyx_obj_6legate_4_lib_7runtime_7runtime_Runtime*, int) () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/runtime/runtime.cpython-312-x86_64-linux-gnu.so
[13] #24 0x00007f5a16bf663e in __pyx_f_6legate_4_lib_7runtime_7runtime__cleanup_legate_runtime() () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/runtime/runtime.cpython-312-x86_64-linux-gnu.so
[13] #25 0x00007f5a16bf0e9b in __pyx_pw_11cfunc_dot_to_py_71__Pyx_CFunc_6legate_4_lib_7runtime_7runtime_void__lParen__rParen_to_py__1wrap(_object*, _object*) () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/runtime/runtime.cpython-312-x86_64-linux-gnu.so
[13] #26 0x000055fc811e3f9c in atexit_callfuncs (state=0x55fc815261a8 <_PyRuntime+79656>) at /usr/local/src/conda/python-3.12.7/Modules/atexitmodule.c:137
[13] #27 0x000055fc811d18c3 in _PyAtExit_Call (interp=<optimized out>) at /usr/local/src/conda/python-3.12.7/Modules/atexitmodule.c:157
[13] #28 Py_FinalizeEx () at /usr/local/src/conda/python-3.12.7/Python/pylifecycle.c:1918
[13] #29 0x000055fc811dfd40 in Py_RunMain () at /usr/local/src/conda/python-3.12.7/Modules/main.c:715
[13] #30 0x000055fc8119a067 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/local/src/conda/python-3.12.7/Modules/main.c:767
[13] #31 0x00007f5ad369124d in __libc_start_main () from /lib64/libc.so.6
[13] #32 0x000055fc81199f11 in _start ()
[13] [Inferior 1 (process 1843020) detached]

Will try to get a stacktrace in debug.