StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
680 stars 145 forks source link

Legion: hang at 256 nodes on Frontier #1626

Closed syamajala closed 8 months ago

syamajala commented 9 months ago

I'm seeing a hang at 256 nodes on Frontier with S3D running the latest control replication. It happens after the first 10 timesteps. I am also using the new gasnet release 2023.9.0.

There are some stack traces here: http://sapling2.stanford.edu/~seshu/s3d_subrank/stacktraces/

lightsighter commented 9 months ago

Are you sure these backtraces are not changing? If the backtraces are not changing, here's a message send that is getting stuck in a syscall to mprotect trying to pin the buffer:

#0  0x00007fffe4b22177 in mprotect () from /lib64/libc.so.6
#1  0x00007fffe4aac247 in sysmalloc () from /lib64/libc.so.6
#2  0x00007fffe4aad360 in _int_malloc () from /lib64/libc.so.6
#3  0x00007fffe4aae618 in malloc () from /lib64/libc.so.6
#4  0x00007fffe03c7527 in Realm::Serialization::DynamicBufferSerializer::DynamicBufferSerializer (this=0x7f744cfff358, initial_size=4160) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/serialize.inl:158
#5  Realm::ProcessorGroupImpl::spawn_task (this=0x7ffbc61a7fe0, func_id=4, args=0x7ffbc622d8c0, arglen=<optimized out>, reqs=..., start_event=..., finish_event=0x7f4486361d08, finish_gen=1, priority=2) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/proc_impl.cc:791
#6  0x00007fffe03c415a in Realm::Processor::spawn (this=<optimized out>, func_id=4096, args=0x3, arglen=140737030267255, reqs=..., wait_on=..., priority=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/proc_impl.cc:102
#7  0x00007fffe1e83159 in Legion::Internal::VirtualChannel::send_message (this=this@entry=0x7ffbc62241d0, complete=<optimized out>, runtime=<optimized out>, target=..., kind=Legion::Internal::DISTRIBUTED_DOWNGRADE_REQUEST, kind@entry=Legion::Internal::SEND_STARTUP_BARRIER, response=false, shutdown=<optimized out>, send_precondition=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:11584
...
syamajala commented 9 months ago

I will do another run and take some stack traces over time, but 256, 512, and 1024 nodes all did not make it past 10 timesteps on Frontier.

syamajala commented 9 months ago

I reran and the stack traces seemed to have a lot of profiling stuff in them. It did eventually make progress but it took about 10 minutes to finish 10 timesteps. I ran without profiling and it took 3 minutes.

The stacktraces from running with the profiler are here: http://sapling2.stanford.edu/~seshu/s3d_subrank/stacktraces/

lightsighter commented 9 months ago

I don't think this is a hang. This is a result of the new profiling changes that log all the mapper API calls and your filesystem not being able to keep up with all the logging that we do now. Trying filtering the API call logging with -lg:prof_call_threshold.

syamajala commented 9 months ago

That was what I was thinking too, but I couldnt remember the details of what we talked about in the legion meeting a few weeks ago and wasn't sure if those changes got merged yet. I will try that.

lightsighter commented 9 months ago

Right now the default is 0. The "old" default which was hard-coded was 100us.

syamajala commented 9 months ago

I tried -lg:prof_call_threshold 1000 but did not have any luck with that.

lightsighter commented 9 months ago

How large are the profiling logs that are produced now compared to before?

syamajala commented 9 months ago

It does not actually look like there is that much of a difference. The runs I did on Nov 4th the profile logs are around 75mb/node at 128 nodes. The current runs around 90mb/node at 128 nodes, but there are some nodes that are significantly larger for some reason, like 223mb or 165mb.

Also the logs at 256 nodes are all only 33kb so I think its not even writing anything during the run? Does the runtime only write the profile logs at shutdown?

lightsighter commented 9 months ago

Does the runtime only write the profile logs at shutdown?

The runtime does buffer profiling data in memory until either we exceed the buffer size (default 512 MB) or we reach the end of the runtime, you can force eager logging by setting the profiling buffer size to 0 with -lg:prof_footprint 0.

I think it would be good to try to make profiles of the slow running performance with the profiling on so we can see what is going on.

How does the performance time with profiling OFF compare between the old version and the new version?

syamajala commented 9 months ago

I tried some more experiments. -lg:prof_footprint 0 -lg:prof_call_threshold 1000 and writing the logs to the local NVMe disks on the compute nodes instead of the network drive. None of them seemed to make a difference.

It does not seem to be possible to generate profiles from partial logs with -lg:prof_footprint 0. I could try to profile with vtune on sapling if that would be useful.

I never did runs without profiling with the old version so I don't have comparisons. I could compare the runs that did finish up to 128 nodes with profiling.

syamajala commented 9 months ago

Tried to do some more runs without profiling but they seem to hang after 10 timesteps. They worked with -ll:force_kthreads though.

lightsighter commented 9 months ago

What do the profiles with -ll:force_kthreads look like?

lightsighter commented 9 months ago

Do you have an update on this?

Can you also try running without tracing and see if it hangs then and whether there is any discrepancy with and without profiling with the tracing off?

lightsighter commented 9 months ago

I'd like to figure this out before we merge the control replication branch.

syamajala commented 9 months ago

I will do some more runs.

I compared the runs that did work with the old runs with profiling and they look very similar up to 128 nodes. I'm not able to produce any profiles at 256 nodes with the latest runtime.

syamajala commented 9 months ago

It seg faults after 10 timesteps if I turn off tracing:

[104] #3  0x00007fffe0d2072b in gasneti_bt_gdb () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/l
anguage/build/lib/librealm.so.1
[104] #4  0x00007fffe0d16dff in gasneti_print_backtrace () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/
/legion/language/build/lib/librealm.so.1
[104] #5  0x00007fffe0e3451a in gasneti_defaultSignalHandler () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_sub
rank//legion/language/build/lib/librealm.so.1
[104] #6  <signal handler called>
[104] #7  0x00007fffe4a5bcbb in raise () from /lib64/libc.so.6
[104] #8  0x00007fffe4a5d355 in abort () from /lib64/libc.so.6
[104] #9  0x00007fffdbaff5b9 in __gnu_cxx::__verbose_terminate_handler () at ../../../../cpe-gcc-12.2.0-202211182106.97b1815c41a72/libstdc++-v3/libsupc++/vterminate.cc:95
[104] #10 0x00007fffdbb0abea in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../cpe-gcc-12.2.0-202211182106.97b1815c41a72/libstdc++-v3/libsupc++/eh_terminate.cc:48
[104] #11 0x00007fffdbb0ac55 in std::terminate () at ../../../../cpe-gcc-12.2.0-202211182106.97b1815c41a72/libstdc++-v3/libsupc++/eh_terminate.cc:58
[104] #12 0x00007fffdbb0aea7 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7fffdbe6ce58 <typeinfo for std::bad_alloc>, dest=0x7fffdbb09260 <std::bad_alloc::~bad_alloc()>) at ../../../../cpe-gcc-12.2.0-202211182106.97b1815c41a72/libstdc++-v3/libsupc++/eh_throw.cc:98
[104] #13 0x00007fffdbaff1c4 in operator new (sz=sz@entry=560854725137600) at ../../../../cpe-gcc-12.2.0-202211182106.97b1815c41a72/libstdc++-v3/libsupc++/new_op.cc:54
[104] #14 0x00007fffe183b9ba in __gnu_cxx::new_allocator<unsigned int>::allocate (this=0x7f77a81c5f68, __n=140213681284400) at /usr/lib64/gcc/x86_64-suse-linux/7/../../../../include/c++/7/ext/new_allocator.h:111
[104] #15 std::allocator_traits<std::allocator<unsigned int> >::allocate (__a=..., __n=140213681284400) at /usr/lib64/gcc/x86_64-suse-linux/7/../../../../include/c++/7/bits/alloc_traits.h:436
[104] #16 std::_Vector_base<unsigned int, std::allocator<unsigned int> >::_M_allocate (this=0x7f77a81c5f68, __n=140213681284400) at /usr/lib64/gcc/x86_64-suse-linux/7/../../../../include/c++/7/bits/stl_vector.h:172
[104] #17 std::vector<unsigned int, std::allocator<unsigned int> >::_M_default_append (this=0x7f77a81c5f68, __n=140213681284400) at /usr/lib64/gcc/x86_64-suse-linux/7/../../../../include/c++/7/bits/vector.tcc:571
[104] #18 0x00007fffe1c7415b in std::vector<unsigned int, std::allocator<unsigned int> >::resize (this=0x7f77a81c5f68, __new_size=140213681284400) at /usr/lib64/gcc/x86_64-suse-linux/7/../../../../include/c++/7/bits/stl_vector.h:692
[104] #19 Legion::Internal::ProjectionTreeExchange::unpack_collective_stage (this=<optimized out>, derez=..., stage=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:16609
[104] #20 0x00007fffe1c77c55 in Legion::Internal::AllGatherCollective<false>::unpack_stage (this=0x7fb711ec8a40, stage=stage@entry=0, derez=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:12490
[104] #21 0x00007fffe1c77b13 in Legion::Internal::AllGatherCollective<false>::handle_collective_message (this=0x7fb711ec8a40, derez=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:12232
[104] #22 0x00007fffe1e84108 in Legion::Internal::VirtualChannel::handle_messages (this=this@entry=0x7ffa802e6160, num_messages=num_messages@entry=1, runtime=0x7f926dfff4b0, runtime@entry=0x8cf8e40, remote_address_space=<optimized out>, remote_address_space@entry=111, args=0x7f76c40e53f0 "", args@entry=0x7f76c40e53dc "\020\001", arglen=<optimized out>, arglen@entry=48) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:13433
[104] #23 0x00007fffe1e8385f in Legion::Internal::VirtualChannel::process_message (this=0x7ffa802e6160, args=args@entry=0x0, arglen=48, arglen@entry=147820096, runtime=0x8cf8e40, remote_address_space=111) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:11795
[104] #24 0x00007fffe1ec0760 in Legion::Internal::MessageManager::receive_message (this=0x0, args=<optimized out>, arglen=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:13581
[104] #25 0x00007fffe1ed7d4e in Legion::Internal::Runtime::legion_runtime_task (args=<optimized out>, arglen=64, userdata=<optimized out>, userlen=<optimized out>, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:32159
[104] #26 0x00007fffe03c982d in Realm::LocalTaskProcessor::execute_task (this=0x72c1580, func_id=4, task_args=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/proc_impl.cc:1176
[104] #27 0x00007fffe0409a2c in Realm::Task::execute_on_processor (this=0x7f77a83d0240, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:326
[104] #28 0x00007fffe0410043 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=0x2) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1687
[104] #29 0x00007fffe040d32f in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x751d600) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1160
[104] #30 0x00007fffe041849d in Realm::UserThread::uthread_entry () at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/threads.cc:1405
[104] #31 0x00007fffe4a72600 in ?? () from /lib64/libc.so.6
[104] #32 0x0000000000000000 in ?? ()
lightsighter commented 9 months ago

That looks like a control replication violation. Run with -lg:safe_ctrlrepl 1.

lightsighter commented 9 months ago

If the safe control replication checks don't reveal anything then make a reproducer on sapling.

syamajala commented 9 months ago

It says this point task launch is a control replication violation: https://gitlab.com/legion_s3d/legion_s3d/-/blob/subranks/rhst/s3d.rg?ref_type=heads#L1459

https://gitlab.com/legion_s3d/legion_s3d/-/blob/subranks/rhst/mpi_tasks.rg?ref_type=heads#L127

elliottslaughter commented 9 months ago

That error looks correct to me. You are specifying a different task to launch on every shard.

Shouldn't you launch get_timestep_info(lp_q_rank[{0, 0, 0}])? That will run one point task, and then Legion will broadcast the result.

syamajala commented 9 months ago

I had originally tried get_timestep_info(lp_q_rank[{0, 0, 0}]) way back when I originally moved the mpi tasks to regent and from what I can remember the performance was significantly worse because it was doing a copy from node 0 to every other node.

I have some other ideas I can try though.

lightsighter commented 9 months ago

I had originally tried get_timestep_info(lp_q_rank[{0, 0, 0}]) way back when I originally moved the mpi tasks to regent and from what I can remember the performance was significantly worse because it was doing a copy from node 0 to every other node.

You should be using the functionality in the mapper to replicate that individual task launch on every node then and produce multiple copies of the data in different places. Replication is replication regardless of whether it is control replication or not. The mapper interface is fully general now when it comes to replicating tasks.

lightsighter commented 9 months ago

We have to fix the control replication violation though. That could be responsible for any number of crashes and hangs. Once you fix it we should try it again both with and without tracing.

syamajala commented 9 months ago

I am testing a fix now.

lightsighter commented 9 months ago

That will run one point task, and then Legion will broadcast the result.

If you're broadcasting regions then collective broadcast copy logic should kick in. If this is a future you should try using the local function tasks.

elliottslaughter commented 9 months ago

I'm pretty sure this is logic that amounts to:

cur_time += dt

There really is no reason to launch a task around this, except that we have to dig the logic out of Fortran to figure out what it is. But once we know the logic there is no reason to use fancy mechanisms to compute this.

lightsighter commented 9 months ago

Seems like a really good use case for local function tasks (which are control replication safe).

syamajala commented 9 months ago

The control replication violation is fixed. I will try another run on Frontier.

syamajala commented 9 months ago

Seems like profiling is working on Frontier now at 256 nodes as well. Will continue to try scaling.

lightsighter commented 9 months ago

Seems like profiling is working on Frontier now at 256 nodes as well. Will continue to try scaling.

Is that both with and without tracing?

syamajala commented 9 months ago

With tracing. I will try without. It looks like theres issues at 512 nodes though.

lightsighter commented 9 months ago

What kind of issues at 512 nodes? With tracing or without tracing?

syamajala commented 8 months ago

I dont see 512 making progress after 10 timesteps with profiling on or off.

Turning off tracing at 256 nodes I see a crash after 10 timesteps:

[198] Thread 6 (Thread 0x7fff9247ef80 (LWP 51226) "s3d.x"):
[198] #0  0x00007fffe4af274f in wait4 () from /lib64/libc.so.6
[198] #1  0x00007fffe4a69ba7 in do_system () from /lib64/libc.so.6
[198] #2  0x00007fffe0d25d16 in gasneti_system_redirected () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[198] #3  0x00007fffe0d256bb in gasneti_bt_gdb () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[198] #4  0x00007fffe0d1beef in gasneti_print_backtrace () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[198] #5  0x00007fffe0e3934a in gasneti_defaultSignalHandler () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[198] #6  <signal handler called>
[198] #7  Legion::Deserializer::deserialize<unsigned int> (this=0x7fb71efff920, element=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_utilities.h:1119
[198] #8  Legion::Internal::ProjectionTreeExchange::unpack_collective_stage (this=<optimized out>, derez=..., stage=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:16621
[198] #9  0x00007fffe1c7c1a5 in Legion::Internal::AllGatherCollective<false>::unpack_stage (this=0x7f76e025d120, stage=stage@entry=0, derez=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:12500
[198] #10 0x00007fffe1c7c063 in Legion::Internal::AllGatherCollective<false>::handle_collective_message (this=0x7f76e025d120, derez=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:12242
[198] #11 0x00007fffe1e88bc8 in Legion::Internal::VirtualChannel::handle_messages (this=this@entry=0x7ffbc53f0d90, num_messages=num_messages@entry=1, runtime=0x7f767ebf0a70, runtime@entry=0x8ceb7e0, remote_address_space=<optimized out>, remote_address_space@entry=195, args=0x7f767ebf0a70 "", args@entry=0x7f767ebf0a5c "\020\001", arglen=<optimized out>, arglen@entry=48) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:13427
[198] #12 0x00007fffe1e8831f in Legion::Internal::VirtualChannel::process_message (this=0x7ffbc53f0d90, args=args@entry=0x0, arglen=48, arglen@entry=147765216, runtime=0x8ceb7e0, remote_address_space=195) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:11789
[198] #13 0x00007fffe1ec5240 in Legion::Internal::MessageManager::receive_message (this=0x410590 <wallpressbc+6112>, args=<optimized out>, arglen=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:13575
[198] #14 0x00007fffe1edc82e in Legion::Internal::Runtime::legion_runtime_task (args=<optimized out>, arglen=64, userdata=<optimized out>, userlen=<optimized out>, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:32153
[198] #15 0x00007fffe03cec3d in Realm::LocalTaskProcessor::execute_task (this=0x7513120, func_id=4, task_args=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/proc_impl.cc:1176
[198] #16 0x00007fffe040ef7c in Realm::Task::execute_on_processor (this=0x7f7548acad80, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:326
[198] #17 0x00007fffe0415593 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=0x7f7547fff010) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1687
[198] #18 0x00007fffe041287f in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x72016e0) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1160
[198] #19 0x00007fffe041d9ed in Realm::UserThread::uthread_entry () at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/threads.cc:1405
[198] #20 0x00007fffe4a72600 in ?? () from /lib64/libc.so.6
[198] #21 0x0000000000000000 in ?? ()
lightsighter commented 8 months ago

The backtrace still looks like a control replication violation. Are you running with the checks when you get that assertion?

Get backtraces from all the nodes with -lg:inorder -ll:force_kthreads with tracing off on 512 nodes. Make sure that they are not changing.

syamajala commented 8 months ago

I ran with -lg:safe_ctrlrepl 1 and it did not give any errors like it did before. I will do a run on 512 nodes with -lg:inorder -ll:force_kthreads and tracing off.

syamajala commented 8 months ago

512 nodes worked with -lg:inorder -ll:force_kthreads and profiling turned on without tracing.

lightsighter commented 8 months ago

Ok, and what happens if you run without -lg:inorder?

lightsighter commented 8 months ago

Please capture backtraces if it does hang.

syamajala commented 8 months ago

It crashed when running with just -lg:inorder. Here is a stacktrace:

#0  0x00007fffe4aeccc1 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
#1  0x00007fffe4af29c3 in nanosleep () from /lib64/libc.so.6
#2  0x00007fffe4af28da in sleep () from /lib64/libc.so.6
#3  0x00007fffe0d266d3 in gasneti_freezeForDebuggerNow () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
#4  0x00007fffe0e3933c in gasneti_defaultSignalHandler () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
#5  <signal handler called>
#6  Legion::Deserializer::deserialize<long long> (this=0x7faad50c53a0, element=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_utilities.h:1119
#7  Legion::Internal::ProjectionNode::IntervalTree::deserialize (this=0x7f8630cfa998, derez=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_analysis.cc:2400
#8  0x00007fffe1c78489 in Legion::Internal::ProjectionTreeExchange::unpack_collective_stage (this=<optimized out>, derez=..., stage=<optimized out>)
    at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:16612
#9  0x00007fffe1c7c1a5 in Legion::Internal::AllGatherCollective<false>::unpack_stage (this=0x7f85a1a974a0, stage=stage@entry=0, derez=...)
    at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:12500
#10 0x00007fffe1c7c063 in Legion::Internal::AllGatherCollective<false>::handle_collective_message (this=0x7f85a1a974a0, derez=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:12242
#11 0x00007fffe1e88bc8 in Legion::Internal::VirtualChannel::handle_messages (this=this@entry=0x7ffbc5089800, num_messages=num_messages@entry=1, runtime=0x0, runtime@entry=0xa83eab0, remote_address_space=<optimized out>, 
    remote_address_space@entry=19, args=0x7f74dded8af0 "", args@entry=0x7f74dded8adc "\020\001", arglen=<optimized out>, arglen@entry=48) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:13427
#12 0x00007fffe1e8831f in Legion::Internal::VirtualChannel::process_message (this=0x7ffbc5089800, args=args@entry=0x0, arglen=48, arglen@entry=176417456, runtime=0xa83eab0, remote_address_space=19)
    at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:11789
#13 0x00007fffe1ec5240 in Legion::Internal::MessageManager::receive_message (this=0x7f74dded8af0, args=<optimized out>, arglen=<optimized out>)
    at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:13575
#14 0x00007fffe1edc82e in Legion::Internal::Runtime::legion_runtime_task (args=<optimized out>, arglen=64, userdata=<optimized out>, userlen=<optimized out>, p=...)
    at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:32153
#15 0x00007fffe03cec3d in Realm::LocalTaskProcessor::execute_task (this=0x6f14880, func_id=4, task_args=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/proc_impl.cc:1176
#16 0x00007fffe040ef7c in Realm::Task::execute_on_processor (this=0x7f74ddedf020, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:326
#17 0x00007fffe0415593 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=0x7f8630cfa998) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1687
#18 0x00007fffe041287f in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x79e1480) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1160
#19 0x00007fffe041d9ed in Realm::UserThread::uthread_entry () at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/threads.cc:1405
#20 0x00007fffe4a72600 in ?? () from /lib64/libc.so.6
#21 0x0000000000000000 in ?? ()
syamajala commented 8 months ago

Here are stack traces from every rank: http://sapling2.stanford.edu/~seshu/s3d_subrank/dump/

I also saved a core file from the rank that I get that stack trace above from.

lightsighter commented 8 months ago

Rebuild with -DDEBUG_LEGION_COLLECTIVES and try again.

syamajala commented 8 months ago

With -DDEBUG_LEGION_COLLECTIVES it crashes at startup. Heres what I see:

[326] Thread 5 (Thread 0x7ffe4b7d7f80 (LWP 85312) "s3d.x"):                                                                           
[326] #0  0x00007fffe4af274f in wait4 () from /lib64/libc.so.6                                                                        
[326] #1  0x00007fffe4a69ba7 in do_system () from /lib64/libc.so.6                                                                    
[326] #2  0x00007fffe0d21d16 in gasneti_system_redirected () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/langu
age/build/lib/librealm.so.1                                                                                                           
[326] #3  0x00007fffe0d216bb in gasneti_bt_gdb () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/l
ib/librealm.so.1                                                                                                                      
[326] #4  0x00007fffe0d17eef in gasneti_print_backtrace () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[326] #5  0x00007fffe0e3534a in gasneti_defaultSignalHandler () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[326] #6  <signal handler called> 
[326] #7  0x00007fffdea56c0e in ?? () from /opt/rocm-5.1.0/lib/libamdhip64.so.5
[326] #8  0x00007fffdea5b9c0 in ?? () from /opt/rocm-5.1.0/lib/libamdhip64.so.5
[326] #9  0x00007fffdea5d9e3 in ?? () from /opt/rocm-5.1.0/lib/libamdhip64.so.5
[326] #10 0x00007fffdea36bd6 in ?? () from /opt/rocm-5.1.0/lib/libamdhip64.so.5
[326] #11 0x00007fffde9125ff in ?? () from /opt/rocm-5.1.0/lib/libamdhip64.so.5
[326] #12 0x00007fffde912780 in ?? () from /opt/rocm-5.1.0/lib/libamdhip64.so.5
[326] #13 0x00007fffde912898 in ?? () from /opt/rocm-5.1.0/lib/libamdhip64.so.5
[326] #14 0x00007fffde955c52 in hipMemcpy2DAsync () from /opt/rocm-5.1.0/lib/libamdhip64.so.5
[326] #15 0x00007fffe04387ad in Realm::Hip::GPUXferDes::progress_xd (this=<optimized out>, this@entry=0x7fc344162a00, channel=<optimized out>, work_until=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/hip/hip_internal.cc:346
[326] #16 0x00007fffe043e4bd in Realm::XDQueue<Realm::Hip::GPUChannel, Realm::Hip::GPUXferDes>::do_work (this=0x6f888a8, work_until=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/transfer/channel.inl:157
[326] #17 0x00007fffe032b138 in Realm::BackgroundWorkManager::Worker::do_work (this=this@entry=0x7ffe4b7d5c58, max_time_in_ns=<optimized out>, max_time_in_ns@entry=-1, interrupt_flag=interrupt_flag@entry=0x0) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/bgwork.cc:599 
[326] #18 0x00007fffe032aae8 in Realm::BackgroundWorkThread::main_loop (this=0x7572600) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/bgwork.cc:103
[326] #19 0x00007fffe0417cce in Realm::KernelThread::pthread_entry (data=0x7d71da0) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/threads.cc:831
[326] #20 0x00007fffe88936ea in start_thread () from /lib64/libpthread.so.0
[326] #21 0x00007fffe4b28a6f in clone () from /lib64/libc.so.6
lightsighter commented 8 months ago

Does that happen repeatedly? It either suggests an issue with Realm or with HIP.

syamajala commented 8 months ago

Does not appear to be repeatable. The second run did start up but looks like its either progressing very slowly or its hung after 1 timestep. I will let it run for a little longer and dump stack traces over time so we can see.

syamajala commented 8 months ago

There are some logs from the second run here: http://sapling2.stanford.edu/~seshu/s3d_subrank/debug_collective/

syamajala commented 8 months ago

The run with -DDEBUG_LEGION and tracing off is crashing at 256 nodes with this error:

[56 - 7fff96237f80]   16.819969 {5}{runtime}: [error 93] LEGION ERROR: Unable to find entry for color 9 in index space 200. (from file /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/region_tree.cc:8057)
For more information see:
http://legion.stanford.edu/messages/error_code.html#error_code_93

It looks like this happens at smaller node counts too, so I'd guess you'd want a reproducer on sapling for this?

lightsighter commented 8 months ago

Yes, if possible. Let's start with a backtrace. Does it happen at start-up or during the same phase of the execution?

syamajala commented 8 months ago

It is at start up. The smallest node count it seems to happen at is 8 nodes. 1, 2, and 4 nodes work, so it may not fit on sapling, but I can try.

Here is a stack trace:

[6] Thread 9 (Thread 0x7fff92447f80 (LWP 120137) "s3d.x"):
[6] #0  0x00007fffe4af274f in wait4 () from /lib64/libc.so.6
[6] #1  0x00007fffe4a69ba7 in do_system () from /lib64/libc.so.6
[6] #2  0x00007fffe0af5d16 in gasneti_system_redirected () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/
/legion/language/build/lib/librealm.so.1
[6] #3  0x00007fffe0af56bb in gasneti_bt_gdb () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[6] #4  0x00007fffe0aebeef in gasneti_print_backtrace () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[6] #5  0x00007fffe0c0934a in gasneti_defaultSignalHandler () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[6] #6  <signal handler called>
[6] #7  0x00007fffe4a5bcbb in raise () from /lib64/libc.so.6
[6] #8  0x00007fffe4a5d355 in abort () from /lib64/libc.so.6
[6] #9  0x00007fffe1d91ea3 in Legion::Internal::Runtime::report_error_message (id=<optimized out>, file_name=<optimized out>, line=<optimized out>, message=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:31930
[6] #10 0x00007fffe1d17508 in Legion::Internal::IndexSpaceNode::get_child (this=0x7ffbc4300b40, c=c@entry=9, defer=defer@entry=0x0, can_fail=false) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/region_tree.cc:8055
[6] #11 0x00007fffe1d17d21 in Legion::Internal::RegionTreeForest::get_index_partition (this=<optimized out>, parent=..., color=9) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/region_tree.cc:923
[6] #12 0x00007fffe1e077a5 in Legion::Internal::Runtime::get_index_partition (this=<optimized out>, parent=..., color=9) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:18057
[6] #13 0x00007fffe2a70e13 in legion_terra_index_cross_product_get_subpartition_by_color_domain_point (runtime_=..., prod=..., color_=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/bindings/regent/regent_partitions.cc:645
[6] #14 0x00007fffe496fbf6 in $<main> () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//build/hept/libregent_tasks.so
[6] #15 0x00007fffe491399d in $__regent_task_main_primary () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//build/hept/libregent_tasks.so
[6] #16 0x00007fffe019ec3d in Realm::LocalTaskProcessor::execute_task (this=0x6d1f310, func_id=1390, task_args=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/proc_impl.cc:1176
[6] #17 0x00007fffe01def7c in Realm::Task::execute_on_processor (this=0x7ffbc4009a90, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:326
[6] #18 0x00007fffe01e5593 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=0x2) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1687
[6] #19 0x00007fffe01e287f in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x6817e40) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1160
[6] #20 0x00007fffe01ed9ed in Realm::UserThread::uthread_entry () at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/threads.cc:1405
[6] #21 0x00007fffe4a72600 in ?? () from /lib64/libc.so.6
[6] #22 0x0000000000000000 in ?? ()
syamajala commented 8 months ago

This seems suspicious to me, maybe I should poke around a bit first.