Freeze on Crusher on 16 nodes

elliottslaughter commented 2 years ago

I'm seeing a freeze on Crusher with Stencil. Some information about the configuration:

-lg:no_tracing -lg:inorder -lg:safe_ctrlrepl 2 -ll:force_kthreads
freezes in both CPU-only and GPU configurations
8 nodes, 8 ranks per node (= a total of 64 ranks) (note: for GPU configuration, needs 16 nodes to freeze)
same flags/configuration passes on all node counts <= 4 (8 for GPU) (weak scaling)
does not matter whether I use -gex:immediate 0 or not (I mention this because it's a workaround @syamajala previously was recommended by @streichler)
DEBUG=0 or DEBUG=1 doesn't make a difference
REALM_NETWORKS=gasnetex
Legion control_replication 1092389c61bb5d365f740f74ba21c4b6ed56cdd7
GASNet 2022.3.0
ROCm 4.5.0 for GPU runs

Here are backtraces of the freezes (captured after the application has been frozen for 10 minutes, typical running time < 30 seconds):

CPU, 8 nodes: http://sapling.stanford.edu/~eslaught/crusher_freeze_debugging/neweqcr_dcr_backtrace_cpu/
same, with -gex:immediate 0: http://sapling.stanford.edu/~eslaught/crusher_freeze_debugging/neweqcr_dcr_geximm0_backtrace_cpu/
GPU, 16 nodes: http://sapling.stanford.edu/~eslaught/crusher_freeze_debugging/neweqcr_dcr_backtrace/
same, with -gex:immediate 0: http://sapling.stanford.edu/~eslaught/crusher_freeze_debugging/neweqcr_dcr_geximm0_backtrace/

lightsighter commented 2 years ago

Try the fix in #1262

elliottslaughter commented 2 years ago

Based on the suggestion in https://github.com/StanfordLegion/legion/issues/1262#issuecomment-1136091192, I ran from -gex:obcount 256 up to -gex:obcount 2048 by powers of 2 (with the CPU-only configuration from my original post). All my runs still freeze at 8 nodes and above.

If there was something else I was supposed to get out of the discussion at #1262, please let me know.

eddy16112 commented 2 years ago

Just want to confirm that you only see the hang with gasnetex? If so, can you try to run with -gex:bindhip 0?

elliottslaughter commented 2 years ago

Yes, it is gasnetex, but note that it freezes in a CPU-only configuration. I didn't recompile with USE_HIP=0, but I did run without any -ll:gpu, -ll:fsize or similar.

Do you think I need to rebuild with USE_HIP=0 to prove that HIP is not related?

eddy16112 commented 2 years ago

I think if you did not run with -ll:gpu, then the HIP is not related, but it is better to give a try with USE_HIP=0

elliottslaughter commented 2 years ago

The freeze still occurs after a clean rebuild with USE_HIP=0. So that confirms this is unrelated to HIP/AMD, and in principle ought to replicate on any machine with Slingshot 11.

syamajala commented 2 years ago

Might be worth testing on the CPU only nodes of Perlmutter. They are available and use slingshot-11.

elliottslaughter commented 2 years ago

I can do that, but would appreciate advice from @lightsighter or @streichler on what direction to look next. Is replication on another machine helpful, or is there something else to check first?

lightsighter commented 2 years ago

Run with -level dma=2 and dump the log files.

elliottslaughter commented 2 years ago

Log files:

-level dma=2: http://sapling.stanford.edu/~eslaught/crusher_freeze_debugging/neweqcr_dcr_logdma.tar.xz
-level task=2,dma=2,xplan=1,new_dma=2,xd=2: http://sapling.stanford.edu/~eslaught/crusher_freeze_debugging/neweqcr_dcr_logmore.tar.xz

lightsighter commented 2 years ago

Can you filter out the backtraces with filter_backtraces.py and list all the interesting ones here? I'm not going to click on each of them individually to download them.

lightsighter commented 2 years ago

There aren't any interesting backtraces. Did you run with -ll:force_kthreads? Also, how does this relate to #1250? FWIW: most cases where the application hangs with -lg:inorder are almost always either Realm or network bugs. Given the previous instability observed on Crusher, in this case I'm inclined to suspect the network as the primary culprit here as there's no evidence of any hanging Realm DMA operations.

elliottslaughter commented 2 years ago

Derp, I missed -ll:force_kthreads in the CPU backtraces. (The GPU backtraces did have -ll:force_kthreads.)

New CPU backtraces with -ll:force_kthreads: http://sapling.stanford.edu/~eslaught/crusher_freeze_debugging/neweqcr_dcr_backtrace_new_cpu.tar.xz

I'm not sure I understand filter_backtraces.py, but running python3 ~/legion/tools/filter_backtraces.py bt_8x1_r0_121566.0_* does not print anything when I run it. A quick visual scan of the first three ranks didn't show anything noteworthy either.

@streichler also requested a new set of logs with -level task=2,dma=2,event=2,barrier=2: http://sapling.stanford.edu/~eslaught/crusher_freeze_debugging/neweqcr_dcr_logmore2.tar.xz

This is likely the same freeze as #1250, but for a simpler code that passes -lg:safe_ctrlrepl 2 (by virtue of avoiding features like MPI interop). We can simplify debugging by cutting out features and complexity we don't need, and by using a smaller, standard application we can have more assurance that this is not an application issue (e.g., no application memory corruption). If we manage to fix this here, then we can go back to #1250 and check whether it fixes that freeze as well.

lightsighter commented 2 years ago

The filter_backtraces.py script is working correctly with these backtraces. There's not really anything interesting going on anywhere. I guess I'm surprised though because I would have expected to see something interesting given that you're running with -lg:inorder. The only way that we would hang without at least some interesting backtraces with -lg:inorder would be if we hung before the application started or after it finished. Is there any indication that either of those scenarios are occurring?

I'll try to make a pass at poking at the Realm logs a bit later.

Have we tried running this same code with the MPI network module? I'm just curious if that hangs as well.

elliottslaughter commented 2 years ago

For what it's worth, it seems to be hanging in the middle of the run. My estimate is that I should be seeing 28160 task invocations each for stencil and increment. What I'm getting (after adding printf/fflush calls) is 26373 for stencil and 26208 for increment. It seems to be freezing 94% of the way through the run. So it is neither a startup nor shutdown hang.

elliottslaughter commented 2 years ago

With REALM_NETWORKS=mpi I do not get a freeze at 8 nodes. I'm running more configurations now to see if it just freezes at a higher node count or whether the freeze is truly gone.

lightsighter commented 2 years ago

Are you running with physical tracing? If so that might explain why we don't see any interesting backtraces as the runtime won't block operation execution inside of a trace (for obvious reasons).

elliottslaughter commented 2 years ago

No, I had tracing shut off entirely.

To avoid mistakes, I ran with set -x to capture the precise command line in the backtrace files. For posterity, here's one command line from bt_8x1_r0_121566.0_0.out. You can see the -lg:no_tracing on the command line.

+ /ccs/home/eslaught/crusher/amd-gpu-test/legion/language/stencil.run1_cpu_nohip/stencil.idx -nx 32000 -ny 32000 -ntx 16 -nty 16 -tsteps 50 -tprune 30 -hl:sched 1024 -ll:cpu 4 -ll:util 1 -ll:bgwork 2 -ll:csize 15000 -ll:rsize 512 -ll:gsize 0 -level 5 -dm:replicate 1 -dm:same_address_space -lg:no_tracing -lg:inorder -lg:safe_ctrlrepl 2 -ll:force_kthreads

elliottslaughter commented 2 years ago

Update on scaling REALM_NETWORKS=mpi: either the performance drops off dramatically, or I am still seeing a freeze at higher node counts.

At 16 nodes, my first run passes, but the second timed out. In the timed out run, I see completion messages for 33 out of 512 (about 6%) of point tasks.

At 32 nodes, I was not able to get any runs to complete in the 30 minute time out. I got completion messages for 82 out of 1024 (about 8%) of point tasks.

I guess I can try a more generous time out to see if the code is merely slow.

elliottslaughter commented 2 years ago

Update on REALM_NETWORKS=mpi: with the 2 hour timeout (vs 30 minutes) my results are almost exactly the same. Exactly one run at 16 nodes succeeded. The second run at 16 nodes still had exactly 33 completion messages (out of 512). Zero runs at 32 nodes succeeded, and the one that froze had 81 completion messages (vs 82 with the 30 minute timeout, out of 1024).

So it seems genuinely frozen but with some possible nondeterminism. Though it is striking how similar the results are.

lightsighter commented 2 years ago

I still don't understand how we are freezing in the middle of a run with tracing and user-level threading off along with in-order execution on and not seeing any unusual backtraces. At the bare minimum, there should be one thread in one process running the top-level task which has not exited and is waiting on a subtask to finish. Can you print out when the top-level task exits and confirm that at least one hasn't finished? I just can't reconcile the backtraces with the kind of hang you're describing and I'm worried we're missing a vital clue because of that. It's likely, there is nothing interesting in the backtraces, but we really need to rule that out before doing anything else, but I have no faith in the backtraces that we are looking at right now.

lightsighter commented 2 years ago

How are you printing the backtraces?

elliottslaughter commented 2 years ago

I updated to the latest control_replication and the behavior seems to have changed (at least in some configurations).

With REALM_NETWORKS=gasnetex, I get a shutdown hang. This may be consistent with what I observed before, but I never actually ran the version of the app with printf debug statements with GASNet. Anyway, that explains the uninteresting backtraces.

With REALM_NETWORKS=mpi, I seem to be unable to make the application freeze at 16 nodes. Previously, I was seeing freezes in about 50% of my runs. It may be that it's nondeterministic and that I'm just getting (un)lucky. Anyway, it seems I need at least 32 nodes to reliably freeze now.

~Note that the failure mode in MPI at 32 nodes seems to be a startup freeze. This is different from what I observed before, where I always got at least partial output. I am now unable to get a printf to show up, even at the very beginning of the top-level task.~ (Edit: this is incorrect, please see my comment below.)

With MPI I also noticed this suspicious message being printed in every run at 32 nodes (and never at 16 nodes, at least in my runs):

PE 0: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to receive queue exhaustion. Setting environment variable FI_OFI_RXM_RX_SIZE to a higher number might circumvent this scenario. OFI retry continuing...

So this maybe points to network issues? I suppose it may be time to contact OLCF support.

I've got backtraces, though I don't expect them to be helpful/relevant. But let me know if you want them.

Footnotes:

While I don't think it affects my results, I was also seeing some very screwy behavior with stdout. Though I was using fflush(stdout) after every printf, as well as passing --unbuffered to srun, none of my output seemed to appear until the end of the run (which means, for runs that froze, the point where I attached to the process with gdb to collect the backtraces). I also got output from only odd-numbered ranks in GASNet. This is frankly some of the strangest behavior I've ever seen, but the output seems to be normal in passing runs, so right now I'm chalking it up to SLURM weirdness.

For posterity I've also created a branch to track my exact replication environment. You can see the relevant files below.

Script to collect backtraces: https://gitlab.com/StanfordLegion/legion/-/blob/crusher-freeze-repro/tools/run_with_backtrace.sh
Application (with debugging changes): https://gitlab.com/StanfordLegion/legion/-/blob/crusher-freeze-repro/language/examples/stencil_fast.rg
Mapper: https://gitlab.com/StanfordLegion/legion/-/blob/crusher-freeze-repro/language/examples/stencil.cc

elliottslaughter commented 2 years ago

Ok, I have some updates and corrections to my last post.

I seem to have finally fixed the issue collecting output. This had obscured what was going on.
In particular, both REALM_NETWORKS=gasnetex and mpi experience shutdown hangs. I misdiagnosed the latter in my last comment because the output was getting messed up.
However, there is still some reason to believe these might be distinct hangs. In gasnetex, the backtraces I get out are uninteresting, and filter_backtraces.py does not report anything. On the other hand, mpi has multiple interesting backtraces that go through PMPI_Send(), indicating that there may still be a network issue going on there. (And note: this is even with workarounds recommended by OLCF.)

About that warning in MPI from my last comment: I contacted OLCF, and they confirmed this is a known issue. They recommended the following workaround. As best I can tell, the only impact of this workaround is to silence the warning in my last comment. The application behavior does not seem to change either way (and specifically, I get similar backtraces whether I use the workaround or not).

export FI_MR_CACHE_MAX_COUNT=0
export FI_OFI_RXM_RX_SIZE=8192
export FI_OFI_RXM_TX_SIZE=8192

The more interesting aspect of my REALM_NETWORKS=mpi run is that I get non-trivial backtraces. Here's a sample of what I see out of filter_backtraces.py:

Thread 15 (Thread 0x7fab959f9d40 (LWP 41499) "stencil.idx"):
#0  0x00007fabda5ba5f4 in PMPI_Send () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#1  0x00007fabe0545f9c in Realm::MPI::AMSend (tgt=145, msgid=42, header_size=40, payload_size=120, header=0x7fab959f6118 "", payload=0x7fab9c9830a0 "\001", payload_lines=0, payload_line_stride=0, has_dest=0, dest=-1, remote_comp=0x0) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/realm/mpi/am_mpi.cc:281 
#2  0x00007fabe054207e in enqueue_message (target=145, msgid=42, args=0x7fab959f6118, arg_size=40, payload=0x7fab9c9830a0, payload_size=120, payload_lines=0, payload_line_stride=0, dst_offset=-1, remote_comp=0x0) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/realm/mpi/mpi_module.cc:34 
#3  0x00007fabe054365a in Realm::MPIMessageImpl::commit (this=0x7fab959f6090, act_payload_size=120) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/realm/mpi/mpi_module.cc:513
#4  0x00007fabe0482d97 in Realm::ActiveMessage<Realm::SpawnTaskMessage, 256ul>::commit (this=0x7fab959f6070) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/realm/activemsg.inl:448#5  0x00007fabe047e4bd in Realm::RemoteProcessor::spawn_task (this=0x7fab9c1dbc30, func_id=4, args=0x7faa70018090, arglen=88, reqs=..., start_event=..., finish_event=0x7fab962f2360, finish_gen=1, priority=6) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/realm/proc_impl.cc:966 
#6  0x00007fabe047a3ca in Realm::Processor::spawn (this=0x7fab959f6520, func_id=4, args=0x7faa70018090, arglen=88, wait_on=..., priority=6) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/realm/proc_impl.cc:88 
#7  0x00007fabdfbabc13 in Legion::Internal::VirtualChannel::send_message (this=0x7faa7ffe6150, complete=true, runtime=0xa8f230, target=..., kind=Legion::Internal::DISTRIBUTED_VALID_UPDATE, response=false, shutdown=false, send_precondition=...) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/runtime.cc:11280 
#8  0x00007fabdfbab7fc in Legion::Internal::VirtualChannel::package_message (this=0x7faa7ffe6150, rez=..., k=Legion::Internal::DISTRIBUTED_VALID_UPDATE, flush=true, flush_precondition=..., runtime=0xa8f230, target=..., response=false, shutdown=false) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/runtime.cc:11199 
#9  0x00007fabdfc1c902 in Legion::Internal::MessageManager::send_message<(Legion::Internal::MessageKind)65> (this=0x7faa7fd3b870, rez=..., flush=true, response=false, shutdown=false, flush_precondition=...) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/runtime.cc:12897 
#10 0x00007fabdfbd9559 in Legion::Internal::Runtime::send_did_remote_valid_update (this=0xa8f230, target=145, rez=...) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/runtime.cc:21888 
#11 0x00007fabdfd2049e in Legion::Internal::DistributedCollectable::send_remote_valid_decrement (this=0x7faa7800aff0, target=145, mutator=0x7fab959f6900, precondition=..., count=1) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/garbage_collection.cc:2028 
#12 0x00007fabdfaeccb0 in Legion::Internal::IndexSpaceNode::InvalidateRootFunctor::apply (this=0x7fab959f68d0, target=145) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/region_tree.cc:9575 
#13 0x00007fabdfb54841 in AVXTLBitMask<1024u>::map<Legion::Internal::IndexSpaceNode::InvalidateRootFunctor> (this=0x7faa61180f60, functor=...) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/bitmask.h:5779 
#14 0x00007fabdfb415ce in CompoundBitMask<AVXTLBitMask<1024u>, 1u, true>::map<Legion::Internal::IndexSpaceNode::InvalidateRootFunctor> (this=0x7faa7800b148, functor=...) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/bitmask.h:8221 
#15 0x00007fabdfb2dc80 in Legion::Internal::DistributedCollectable::map_over_remote_instances<Legion::Internal::IndexSpaceNode::InvalidateRootFunctor> (this=0x7faa7800aff0, functor=...) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/garbage_collection.h:631 
#16 0x00007fabdfaece2c in Legion::Internal::IndexSpaceNode::invalidate_root (this=0x7faa7800aff0, source=0, applied=std::set with 5008 elements = {...}, total_sharding_functor=false) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/region_tree.cc:9598 
#17 0x00007fabdfac6122 in Legion::Internal::RegionTreeForest::destroy_index_space (this=0xa969f0, handle=..., source=0, applied=std::set with 5008 elements = {...}, total_sharding_collective=false) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/region_tree.cc:783 
#18 0x00007fabdf72e380 in Legion::Internal::TaskContext::report_leaks_and_duplicates (this=0x7fab900238e0, preconditions=std::set with 5008 elements = {...}) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/legion_context.cc:1168 
#19 0x00007fabdf758658 in Legion::Internal::InnerContext::report_leaks_and_duplicates (this=0x7fab900238e0, preconds=std::set with 5008 elements = {...}) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/legion_context.cc:10052 
#20 0x00007fabdf6db35d in Legion::Internal::ShardTask::report_leaks_and_duplicates (this=0x7fab9001d710, preconditions=std::set with 5008 elements = {...}) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/legion_tasks.cc:7996 
#21 0x00007fabdfa5f7e7 in Legion::Internal::ShardManager::trigger_task_complete (this=0x7fab90017060, local=false, effects=...) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/legion_replication.cc:9495 
#22 0x00007fabdfa61e8c in Legion::Internal::ShardManager::handle_trigger_complete (derez=..., runtime=0xa8f230) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/legion_replication.cc:10170 
#23 0x00007fabdfbdfb03 in Legion::Internal::Runtime::handle_replicate_trigger_complete (this=0xa8f230, derez=...) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/runtime.cc:24959 
#24 0x00007fabdfbadd3c in Legion::Internal::VirtualChannel::handle_messages (this=0x7faa71d2efc0, num_messages=1, runtime=0xa8f230, remote_address_space=211, args=0x7fab98c6a2a0 "", arglen=20) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/runtime.cc:12636
#25 0x00007fabdfbac326 in Legion::Internal::VirtualChannel::process_message (this=0x7faa71d2efc0, args=0x7fab98c6a284, arglen=40, runtime=0xa8f230, remote_address_space=211) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/runtime.cc:11469
#26 0x00007fabdfbae4d1 in Legion::Internal::MessageManager::receive_message (this=0x7faa71a138b0, args=0x7fab98c6a280, arglen=48) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/le
gion/runtime.cc:12910 
#27 0x00007fabdfbe2356 in Legion::Internal::Runtime::process_message_task (this=0xa8f230, args=0x7fab98c6a27c, arglen=52) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/runtime.cc:25634 
#28 0x00007fabdfbf6e68 in Legion::Internal::Runtime::legion_runtime_task (args=0x7fab98c6a270, arglen=56, userdata=0xa71480, userlen=8, p=...) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/runtime.cc:31112
#29 0x00007fabe047f301 in Realm::LocalTaskProcessor::execute_task (this=0x9eee60, func_id=4, task_args=...) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/realm/proc_impl.cc:1135
#30 0x00007fabe02ae006 in Realm::Task::execute_on_processor (this=0x7fab982ec7e0, p=...) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/realm/tasks.cc:302
#31 0x00007fabe02b1e7a in Realm::KernelThreadTaskScheduler::execute_task (this=0x9ef090, task=0x7fab982ec7e0) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/realm/tasks.cc:1366
#32 0x00007fabe02b0cb9 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x9ef090) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/realm/tasks.cc:1105
#33 0x00007fabe02b12f6 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x9ef090) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/realm/tasks.cc:1217 
#34 0x00007fabe02b89e0 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x9ef090) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/realm/threads.inl:97
#35 0x00007fabe0289ec7 in Realm::KernelThread::pthread_entry (data=0x7faa6418ac80) at /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/realm/threads.cc:774
#36 0x00007fabdce5d6ea in start_thread () from /lib64/libpthread.so.0
#37 0x00007fabddd0ca8f in clone () from /lib64/libc.so.6

I'd be curious if this indicates a Legion or a network issue. Unless I'm mistaken, even with this trace it's not a slam dunk in favor of being a network issue because, even though this is frozen in PMPI_Send, a hypothetical Realm bug could still result in a missing receive. I'm not saying there necessarily is a missing receive, but just that the backtrace doesn't conclusively prove that this is MPI's fault.

And finally, here are the backtraces themselves:

gasnetex: http://sapling.stanford.edu/~eslaught/crusher_freeze_debugging/cpu_nohip_gex_printmore_backtrace3.tar.xz (note: these are uninteresting)
mpi: http://sapling.stanford.edu/~eslaught/crusher_freeze_debugging/cpu_nohip_mpi_printmore_backtrace_dev995_try2.tar.xz (note: I still need 32 nodes to make this freeze. Also includes the workarounds suggested by OLCF above.)

lightsighter commented 2 years ago

I'd be curious if this indicates a Legion or a network issue.

This is definitely a networking bug. MPI_ISend should always return in finite time regardless of what the rest of the network or other threads are doing. That's part of MPI's forward progress guarantee.

a hypothetical Realm bug could still result in a missing receive

Sure, and you would get a different kind of hang, but not one with a backtrace like this. (See comment above about MPI_ISend always needing to return in finite time regardless of anything else.) If this never returns then Legion can't make forward progress and I guarantee you that we will never finish. Legion needs Realm's spawn calls to always return in finite time to guarantee forward progress and similarly Realm requires MPI_ISend to always return in finite time to make it's own forward progress.

For the GASNet shutdown hangs, what does -level shutdown=2 report? Try building with -DDEBUG_SHUTDOWN_HANG too and see what the output looks like. If you aren't getting continuous output from the logger then the shutdown hang is almost certainly a networking issue since all we do when trying to establish quiescence is to send messages over and over again.

elliottslaughter commented 2 years ago

I'd just like to point out that the call in backtrace is not an MPI_Isend but an MPI_Send:

https://gitlab.com/StanfordLegion/legion/-/blob/crusher-freeze-repro/runtime/realm/mpi/am_mpi.cc#L281

I believe that it is valid MPI semantics for MPI_Send to block until a matching receive has been posted. So if we're depending on this to not block, we really shouldn't be using MPI_Send...

Edit: Sean believes that every send should have a matching receive. If that's true, then even MPI_Send should not block. However, I am not sure how to prove this is true---and if it's not, it would be easy to make the send hang.

I'll check on the rest and get back to you.

elliottslaughter commented 2 years ago

For the gasnetex hang: I built with -DDEBUG_SHUTDOWN_HANG (sample compile line [1]) and ran with -level shutdown=2 -logfile shutdown_"$n"x1_r"$r"_%.out plus my usual flags. I see the correct number of main: end of task messages in my output, but the shutdown log files are empty.

$ wc -l neweqcr_dcr_shutdownhang/shutdown_8x1_r0_* | tail -n1
0 total

This is 8 minutes into a run where the main task was done in seconds.

Am I missing something?

[1]: Sample build command:

CC -MMD -o /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/legion.cc.o -c /autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/legion/legion.cc  -fPIC -DDEBUG_SHUTDOWN_HANG  -march=native -DGASNETI_BUG1389_WORKAROUND=1 -O0 -ggdb  -Wall -std=c++11  -I/autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/bindings/regent -I/autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime -I/autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/runtime/mappers -I/autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/language/gasnet/release/include -I/autofs/nccs-svm1_home1/eslaught/crusher/amd-gpu-test/legion/language/gasnet/release/include/ofi-conduit

elliottslaughter commented 2 years ago

Updates for REALM_NETWORKS=mpi:

@streichler suggested doing a run with two backtraces in a row on the same process, in order to confirm that it's really frozen. Indeed, as best I can tell it appears to be:

http://sapling.stanford.edu/~eslaught/crusher_freeze_debugging/cpu_nohip_mpi_printmore_backtrace_dev995_twice.tar.xz

In rank 0, I see threads 15, 5, 4, 3 in some sort of MPI call. The same threads are stuck in the same MPI calls with the same stack traces exactly 2 minutes later.

@streichler also pointed out that some of the calls are in MPI_Irecv in an earlier version of this backtrace. Here, we see this behavior in rank 0 thread 5. MPI_Irecv should definitely not freeze, so this looks like a smoking gun.

lightsighter commented 2 years ago

I'd just like to point out that the call in backtrace is not an MPI_Isend but an MPI_Send:

That is true, so that should be matching with a receive, but the way the MPI network module works is that it should have a thread running around continually posting MPI_Irecv over and over again until the end of time, so we should never have any trouble matching with a sender. https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/mpi/am_mpi.cc#L195 The result is we should see some threads on a remote node stuck in an MPI_Irecv and not making forward progress.

@streichler also pointed out that some of the calls are in MPI_Irecv in an earlier version of this backtrace. Here, we see this behavior in rank 0 thread 5. MPI_Irecv should definitely not freeze, so this looks like a smoking gun.

Indeed. :) I bet that both GASNet and MPI are making the same assumptions about the behavior of verbs and the network seems to be violating it in some way.

elliottslaughter commented 2 years ago

Just recording updates on a couple of experiments I did for REALM_NETWORKS=mpi.

At @streichler 's suggestion I tried running with the following diff:

diff --git a/runtime/realm/mpi/am_mpi.cc b/runtime/realm/mpi/am_mpi.cc
index 12ec68aa6..88b36fcc9 100644
--- a/runtime/realm/mpi/am_mpi.cc
+++ b/runtime/realm/mpi/am_mpi.cc
@@ -19,7 +19,7 @@
 #define AM_BUF_SIZE_W_HEADER  (AM_BUF_SIZE + AM_MSG_HEADER_SIZE)

 // if defined, replaces use of MPI_Put with two-sided send/recv
-//define REALM_MPI_USE_TWO_SIDED_ONLY
+#define REALM_MPI_USE_TWO_SIDED_ONLY

 static MPI_Win g_am_win = MPI_WIN_NULL;
 static void *g_am_base = NULL;

Unfortunately this does not alter the behavior at all.

I also tried running with:

export FI_MR_CACHE_MAX_COUNT=0
export FI_OFI_RXM_RX_SIZE=16384
export FI_OFI_RXM_TX_SIZE=16384

and:

export FI_MR_CACHE_MAX_COUNT=0
export FI_OFI_RXM_RX_SIZE=32768
export FI_OFI_RXM_TX_SIZE=32768

under the theory that the magnitude of the size parameter might matter. Doesn't do anything either.

elliottslaughter commented 2 years ago

I got an update from OLCF with the following suggestion. With this setting, I am able to run with REALM_NETWORKS=mpi up to 32 nodes without freezing; though 64 nodes still freezes.

export FI_CXI_RX_MATCH_MODE=software
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_MR_CACHE_MONITOR=memhooks

Note: I do NOT include the previous set of variables when I do this. If I do, it freezes even at only 8 nodes. I am following up to see if there is a setting that would allow 64 nodes to run.

lightsighter commented 2 years ago

Do we know what the default values for some of those settings are? That looks like a monstrously large completion queue size to me. It seems like their starting to poke at the problem, but just throwing more resources at it allows us just to get farther, not actually fix the original problem.

elliottslaughter commented 2 years ago

My understanding is that the default value for FI_CXI_DEFAULT_CQ_SIZE is 32768, so this is 4× that. I'm doing a test at 262144 to see if I can get another factor 2 before I freeze again.

Continuing to scale this won't be an ultimate solution, though if we work around the issue with queue sizes like this it does seem to suggest a lower-level network error.

lightsighter commented 2 years ago

I agree, it's not quite a smoking gun, but it definitely is corroborating evidence that the problem is in the networking stack.

elliottslaughter commented 2 years ago

I tried some higher FI_CXI_DEFAULT_CQ_SIZE values (262144 and 524288), and so far have not observed any change in symptoms. The program still freezes at 64 nodes in all configurations.

I also obtained a new set of flags from OLCF (shown below); but this does not change the symptoms either.

export FI_CXI_RX_MATCH_MODE=software
export FI_CXI_DEFAULT_CQ_SIZE=262144
export FI_MR_CACHE_MONITOR=memhooks
export FI_CXI_REQ_BUF_MIN_POSTED=10
export FI_CXI_REQ_BUF_SIZE=25165824

lightsighter commented 2 years ago

Do you have any problems if you have just one rank per node? There is some evidence from #1250 that the problem might be with the PSHM which would explain problems in both GASNet and MPI concurrently.

lightsighter commented 2 years ago

Alternatively try GASNet with --disable-pshm.

elliottslaughter commented 2 years ago

There is no GASNet here. This is REALM_NETWORKS=mpi.

lightsighter commented 2 years ago

Right, but GASNet is the only one I know how to build without pshm.

lightsighter commented 2 years ago

Might still be interesting to see if it hangs with one process per physical node.

eddy16112 commented 2 years ago

If REALM_NETWORKS=mpi, can we try different MPIs?

PHHargrove commented 2 years ago

Right, but GASNet is the only one I know how to build without pshm.

If the goal is to run HPE Cray MPI without single-copy intra-node communication via cross-mapping of buffers, then I believe you want to set MPICH_SMP_SINGLE_COPY_MODE=NONE.

If REALM_NETWORKS=mpi, can we try different MPIs?

I would be very surprised if the Open MPI team didn't have something installed on Crusher, but you'd need to contact them for help in locating it.

elliottslaughter commented 2 years ago

I tried the following sets of configurations, but none of them change the observed symptoms in a meaningful way:

Configuration E:

export MPICH_SMP_SINGLE_COPY_MODE=NONE

Configuration F:

export MPICH_SMP_SINGLE_COPY_MODE=CMA

Configuration G:

export FI_CXI_RX_MATCH_MODE=software
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_MR_CACHE_MONITOR=memhooks

export MPICH_SMP_SINGLE_COPY_MODE=NONE

Configuration H:

export FI_CXI_RX_MATCH_MODE=software
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_MR_CACHE_MONITOR=memhooks

export MPICH_SMP_SINGLE_COPY_MODE=CMA

And the results:

Configuration E: froze at 16 nodes and above
Configuration F: froze at 16 nodes and above
Configuration G: froze at 64 nodes
Configuration H: froze at 64 nodes

elliottslaughter commented 2 years ago

I was able to run successfully on Crusher on up to 128 nodes with both REALM_NETWORKS=mpi and REALM_NETWORKS=gasnetex with the following configuration variable (and no other relevant variables):

export FI_CXI_DEFAULT_CQ_SIZE=1310720 # default value: 131072

This is a variable read by libfabric which controls the queue size for the Slingshot 11 provider, which seems to confirm pretty strongly that this is a libfabric or network issue.

HPE is still investigating a root cause so that we can get a proper solution.

Unfortunately, this workaround does not appear to resolve #1250, so that may either be a separate issue.

elliottslaughter commented 7 months ago

We got the fix from HPE on this last year.

StanfordLegion / legion

Freeze on Crusher on 16 nodes #1270