StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
678 stars 145 forks source link

Realm: gasnet1 issues on Perlmutter #1181

Closed syamajala closed 2 years ago

syamajala commented 2 years ago

Running the gasnet1 layer in Realm on Perlmutter I am able to do some multinode runs. At 24 nodes I start seeing this:

[70 - 7feae62af700]  365.765712 {6}{realm}: invalid event handle: id=c2ab7f8
s3d.x: /global/u1/s/seshu/legion_s3d/legion/runtime/realm/runtime_impl.cc:2458: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.
lightsighter commented 2 years ago

At a minimum you need to provide a full backtrace.

syamajala commented 2 years ago

Here is backtrace:

[54] Thread 9 (Thread 0x7efc400df700 (LWP 127107) "s3d.x"):
[54] #0  0x00007efc601f5217 in waitpid () from /lib64/libc.so.6
[54] #1  0x00007efc6017276f in do_system () from /lib64/libc.so.6
[54] #2  0x00007efc59bbdd90 in gasneti_system_redirected () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[54] #3  0x00007efc59bbe48e in gasneti_bt_gdb () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[54] #4  0x00007efc59bc1e25 in gasneti_print_backtrace () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[54] #5  0x00007efc593b6b3f in gasneti_defaultSignalHandler () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[54] #6  <signal handler called>
[54] #7  0x00007efc60165420 in raise () from /lib64/libc.so.6
[54] #8  0x00007efc60166a01 in abort () from /lib64/libc.so.6
[54] #9  0x00007efc6015da1a in __assert_fail_base () from /lib64/libc.so.6
[54] #10 0x00007efc6015da92 in __assert_fail () from /lib64/libc.so.6
[54] #11 0x00007efc59662e1a in Realm::RuntimeImpl::get_event_impl (this=<optimized out>, e=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/runtime_impl.cc:2458
[54] #12 0x00007efc59644f8d in Realm::ProcessorImpl::enqueue_or_defer_task (this=0xd47e0f8, task=0x7ef1386fff40, start_event=..., cache=0xd47e290) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/proc_impl.cc:515
[54] #13 0x00007efc59647125 in Realm::ProcessorGroupImpl::spawn_task (this=0xd47e0f8, func_id=4, args=0x7ef128fff740, arglen=<optimized out>, reqs=..., start_event=..., finish_event=0xb6e6028, finish_gen=47, priority=<optimized out>) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/proc_impl.cc:787
[54] #14 0x00007efc59643028 in Realm::Processor::spawn (this=this@entry=0x7ef128fff678, func_id=func_id@entry=4, args=args@entry=0x7ef128fff740, arglen=arglen@entry=112, reqs=..., wait_on=..., wait_on@entry=..., priority=6) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/id.h:73
[54] #15 0x00007efc5a861031 in Legion::Internal::Runtime::issue_runtime_meta_task<Legion::Internal::ShardedPhysicalTemplate::DeferTraceUpdateArgs> (this=0xb1ce370, args=..., priority=priority@entry=Legion::Internal::LG_LATENCY_MESSAGE_PRIORITY, precondition=precondition@entry=..., target=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.h:4675
[54] #16 0x00007efc5a85d345 in Legion::Internal::ShardedPhysicalTemplate::handle_update_view_user (this=0x7ef348d4a1c0, view=<optimized out>, user_expr=<optimized out>, derez=..., applied=..., done=..., dargs=<optimized out>) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/legion_trace.cc:7942
[54] #17 0x00007efc5a85e27b in Legion::Internal::ShardedPhysicalTemplate::handle_trace_update (this=0x7ef348d4a1c0, derez=..., source=<optimized out>) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/legion_trace.cc:7481
[54] #18 0x00007efc5a9ae632 in Legion::Internal::VirtualChannel::handle_messages (this=this@entry=0x7ef359ad9240, num_messages=num_messages@entry=1, runtime=runtime@entry=0xb1ce370, remote_address_space=remote_address_space@entry=102, args=0x7ef1716ad1c0 "\300", args@entry=0x7ef1716ad1ac "r", arglen=<optimized out>, arglen@entry=608) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:12535
[54] #19 0x00007efc5a9aef79 in Legion::Internal::VirtualChannel::process_message (this=0x7ef359ad9240, args=0x7ef1716ad1a4, arglen=608, runtime=0xb1ce370, remote_address_space=102) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:11765
[54] #20 0x00007efc5a9af511 in Legion::Internal::Runtime::legion_runtime_task (args=0x7ef1716ad190, arglen=624, userdata=<optimized out>, userlen=<optimized out>, p=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:31181
[54] #21 0x00007efc5968dd19 in Realm::Task::execute_on_processor (this=0x7ef1711bef70, p=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:302
[54] #22 0x00007efc5968ddb6 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:1594
[54] #23 0x00007efc596907ba in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x492ca30) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:1075
[54] #24 0x00007efc596951f7 in Realm::UserThread::uthread_entry () at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/threads.cc:1337
[54] #25 0x00007efc6017aca0 in ?? () from /lib64/libc.so.6
[54] #26 0x0000000000000000 in ?? ()
streichler commented 2 years ago

The value doesn't appear in the backtrace, but the assertion is saying that the Event value passed to wait_on in frame 14 is ill-formed. @lightsighter can we sanity-check the contents of the message being decoded in frames 16-19?

lightsighter commented 2 years ago

I'm pretty sure this is just an uninitialized std::atomic. Pull and try again.

syamajala commented 2 years ago

I think this fixes the issue.