Open rohany opened 3 years ago
Run with -lg:safe_ctrlrepl 1 -lg:inorder
and report back what happens. If it still hangs then make me a reproducer on sapling.
It appears to not be stuck with those arguments (though I imagine they have performance impacts?).
If it still hangs then make me a reproducer on sapling.
As I mentioned in the issue, I haven't been able to reproduce the bug on sapling.
I take that back -- it appears to get through all of my program and then is stuck on shutdown with -lg:safe_ctrlrepl 1 -lg:inorder
. There doesn't appear to be anything interesting in any of the stacks of nodes I randomly sampled as well. This might be an unrelated problem.
I added a missing check to the safe control replication checks related to this. Pull and try again with the safe control replication checks after rebuilding.
If they still pass then attach a debugger to each of the shards and print out the value of collective_index
on this line of frame 11 in the interesting threads:
https://gitlab.com/StanfordLegion/legion/-/blob/control_replication/runtime/legion/runtime.cc?expanded=true#L4040
The new checks didn't fire and all nodes have the same collective_index = 1140
. I think I found some interesting details though:
-lg:inorder
and hangs without.#0 0x00002000000fe92c in __pthread_cond_wait (cond=0x20009efba1a8, mutex=0x170197d8) at pthread_cond_wait.c:153
#1 0x0000000010e482c0 in Realm::CondVar::wait (this=<error reading variable: value has been optimized out>) at /g/g15/yadav2/taco/legion/legion/runtime/realm/mutex.cc:231
#2 0x0000000010dfc0b8 in Realm::KernelThreadTaskScheduler::worker_sleep (this=0x170197d0, switch_to=0x2034d0001dd0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/tasks.cc:1428
#3 0x0000000010dfd758 in Realm::ThreadedTaskScheduler::thread_blocking (this=0x170197d0, thread=<optimized out>) at /g/g15/yadav2/taco/legion/legion/runtime/realm/tasks.cc:941
#4 0x0000000010d32874 in Realm::Thread::wait_for_condition<Realm::EventTriggeredCondition> (cond=..., poisoned=@0x20009efbaad8: false)
at /g/g15/yadav2/taco/legion/legion/runtime/realm/event_impl.cc:168
#5 0x0000000010d20b0c in Realm::Event::wait_faultaware (this=0x2034d0075fe0, poisoned=@0x20009efbaad8: false) at /g/g15/yadav2/taco/legion/legion/runtime/realm/event_impl.cc:266
#6 0x0000000010d20d0c in Realm::Event::wait (this=<optimized out>) at /g/g15/yadav2/taco/legion/legion/runtime/realm/event_impl.cc:214
#7 0x000000001047d4e4 in Legion::Internal::LgEvent::wait (this=<optimized out>) at /g/g15/yadav2/taco/legion/legion/runtime/legion/legion_types.h:2757
#8 0x00000000106a40cc in Legion::Internal::ReplFutureMapImpl::get_all_futures (this=0x2034d0075d40, others=std::map with 0 elements)
at /g/g15/yadav2/taco/legion/legion/runtime/legion/runtime.cc:4025
#9 0x000000001064c964 in Legion::Internal::ReplFutureMapImpl::wait_all_results (this=0x2034d0075d40, silence_warnings=<optimized out>, warning_string=<optimized out>)
at /g/g15/yadav2/taco/legion/legion/runtime/legion/runtime.cc:4061
#10 0x00000000103d3bd8 in Legion::FutureMap::wait_all_results (this=<optimized out>, silence_warnings=<optimized out>, warning_string=<optimized out>)
at /g/g15/yadav2/taco/legion/legion/runtime/legion/legion.cc:2565
#11 0x0000000010367600 in placeLegionA (ctx=0x2034c4000900, runtime=0x16fb3eb0, A=..., rpoc=<optimized out>, c=<optimized out>)
at /g/g15/yadav2/taco/legion/solomonikMM/taco-generated.cu:119
#12 0x000000001033fdcc in operator() (__closure=0x2034d006b550) at /g/g15/yadav2/taco/legion/solomonikMM/main.cpp:89
#13 std::_Function_handler<void(), top_level_task(const Legion::Task*, const std::vector<Legion::PhysicalRegion>&, Legion::Context, Legion::Runtime*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/std_function.h:297
#14 0x00000000103509f8 in operator() (this=0x20009efbd5d0) at /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/std_function.h:260
#15 benchmark(Legion::Internal::TaskContext*, Legion::Runtime*, std::vector<unsigned long, std::allocator<unsigned long> >&, std::function<void ()>) (ctx=0x2034c4000900,
runtime=0x16fb3eb0, times=std::vector of length 5, capacity 8 = {...}, f=...) at /g/g15/yadav2/taco/legion/src/legion_utils.cpp:35
#16 0x0000000010344180 in top_level_task (ctx=<optimized out>, runtime=<optimized out>, regions=..., task=<optimized out>)
at /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/std_function.h:87
#17 0x00000000103449ac in Legion::LegionTaskWrapper::legion_task_wrapper<&(top_level_task(Legion::Task const*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> > const&, Legion::Internal::TaskContext*, Legion::Runtime*))> (args=<optimized out>, arglen=<optimized out>, userdata=<optimized out>, userlen=<optimized out>, p=...)
at /g/g15/yadav2/taco/legion/legion/cmake-install/include/legion/legion.inl:22541
#18 0x0000000010dadb3c in Realm::LocalTaskProcessor::execute_task (this=0x171d7e10, func_id=<optimized out>, task_args=...)
at /g/g15/yadav2/taco/legion/legion/runtime/realm/bytearray.inl:150
#19 0x0000000010dfc444 in Realm::Task::execute_on_processor (this=0x2034c40224d0, p=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/runtime_impl.h:378
#20 0x0000000010dfc5a4 in Realm::KernelThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /g/g15/yadav2/taco/legion/legion/runtime/realm/tasks.cc:1380
#21 0x000000001138d4fc in Realm::OpenMPTaskScheduler<Realm::KernelThreadTaskScheduler>::execute_task (this=0x170197d0, task=0x2034c40224d0)
at /g/g15/yadav2/taco/legion/legion/runtime/realm/openmp/openmp_module.cc:122
#22 0x0000000010dff370 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x170197d0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/tasks.cc:1125
#23 0x0000000010dff928 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x170197d0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/tasks.cc:1231
#24 0x0000000010e02b34 in Realm::KernelThread::pthread_entry (data=0x17026030) at /g/g15/yadav2/taco/legion/legion/runtime/realm/threads.cc:774
#25 0x00002000000f8cd4 in start_thread (arg=0x20009efbf8b0) at pthread_create.c:309
#26 0x0000200006627e14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104
(gdb)
While all of the other nodes are doing a collective sync, this one is just waiting on an event, which is probably the deadlock.
Some information from this stack:
(gdb) f 9
#9 0x000000001064c964 in Legion::Internal::ReplFutureMapImpl::wait_all_results (this=0x2034d0075d40, silence_warnings=<optimized out>, warning_string=<optimized out>)
at /g/g15/yadav2/taco/legion/legion/runtime/legion/runtime.cc:4061
4061 get_all_futures(dummy_others);
(gdb) info locals
dummy_others = std::map with 0 elements
(gdb) f 8
(gdb) info locals
mutator = {<Legion::Internal::ReferenceMutator> = {_vptr.ReferenceMutator = 0x0}, mutation_effects = <error reading variable: Cannot access memory at address 0x4b000000a3>}
f_lock = {local_lock = @0x200000000000, previous = 0x0, exclusive = false, held = 165}
And that's why you need to look at the stack traces on all the nodes and not just some of them. This backtrace all but guarantees that one or more tasks didn't run on this shard for this particular index space launch.
Run again with -ll:defalloc 0
.
And that's why you need to look at the stack traces on all the nodes and not just some of them.
Is there a more efficient way of doing this than ssh-ing into each node and attaching to the process with gdb? This is sort of annoying at 8 nodes, and prohibitive at 32 nodes (where I see a similar hang but on a different application).
This backtrace all but guarantees that one or more tasks didn't run on this shard for this particular index space launch.
That's true for this application, but isn't true for another application that I think is hanging the same way.
Run again with -ll:defalloc 0.
This causes my application to OOM (failed instance allocations in the mapper). When my application doesn't hang, it runs to completion without OOMs (when I don't use -ll:defalloc 0).
And that's why you need to look at the stack traces on all the nodes and not just some of them.
Is there a more efficient way of doing this than ssh-ing into each node and attaching to the process with gdb? This is sort of annoying at 8 nodes, and prohibitive at 32 nodes (where I see a similar hang but on a different application).
Try something like:
for JOB_COMPUTE_HOST in ...; do
ssh $JOB_COMPUTE_HOST <legion-dir>/tools/print_backtraces.sh <executable-name> > $JOB_COMPUTE_HOST.txt
done
@manopapad said exactly what I was going to say for capturing the backtraces. There are very few things that we'll ask you to do that don't already have a tool.
That's true for this application, but isn't true for another application that I think is hanging the same way.
Then you'll need to find a reproducer for that code that hangs with -ll:defalloc 0
This causes my application to OOM (failed instance allocations in the mapper). When my application doesn't hang, it runs to completion without OOMs (when I don't use -ll:defalloc 0).
This is a known problem how deferred buffers are allocated today: they don't guarantee deadlock freedom with deferred allocation. There is a pending but incomplete fix in the newinsts
branch. The group has instructed me to put all my Legion development time towards collective instances and control replication for the time being. If you want to change those priorities then you'll need to bring it up in a Legion meeting.
I can try to get a reproducer with -ll:defalloc 0
, but my code doesn't use any deferred buffers.
I have a reproducer on sapling that you can use now. Go to /home/rohany/taco/build
and run mpirun -H g0002:4,g0003:4,g0004:4 --bind-to none -npernode 4 ./runner.sh bin/solomonikMM-cuda -n 35000 -rpoc 4 -c 2 -rpoc3 4 -tm:untrack_valid_regions -ll:ocpu 1 -ll:othr 1 -ll:csize 3000 -ll:util 4 -dm:replicate 1 -ll:gpu 1 -ll:fsize 15000 -ll:bgwork 12 -ll:bgnumapin 1 -ll:force_kthreads
. You should see a print every ~10 seconds. If the prints stop showing up then it's hanging.
There are other forms of deferred allocation other than deferred buffers/values. If you want to convince me that there is a hang that doesn't involve deferred allocation, then it needs to reproduce with the -ll:defalloc 0
flag, and this reproducer can't do that. Alternatively, you can find a cycle in the Realm event graph that doesn't involve deferred allocation, but I suspect that such a cycle does not exist.
@streichler also said that he was planning on adding support for detecting resource cycles to Realm's hang detection mechanism, so getting Realm's hang detection to tell us (with Legion Spy logging) what the event cycle is involving deferred allocation would be instructive, but I'm 99% positive that it's going to involve deferred allocation.
I'm not successful in getting a reproduction with -ll:defalloc 0
-- everything either succeeds on a smaller problem size or OOMs on problem sizes close to my target. What are my options here?
newinsts
branchI need these applications to complete so that I have a full set of experiments for the PLDI deadline in mid-november. If a timeline for work on newinsts
wouldn't complete by then even with priority, then I'm a bit stuck for what to do.
It's possible that there is a faster solution to the problem, but we can't know that until we've characterized the exact nature of the deferred allocations that are causing the problem. Without that we can't say which approach is the best. I would try to modify the Realm cycle detection mechanism to add edges through deferred allocations so that it can "see" those kinds of cycles too.
I would try to modify the Realm cycle detection mechanism to add edges through deferred allocations so that it can "see" those kinds of cycles too.
I can give this a shot if you give me some pointers @streichler, or if you think it's more efficient for you to just do I can wait on it.
@lightsighter it will not be helpful (for this issue) to teach realm to find cycles through deferred allocations. Although there are deferred instance destructions pending in every one of the event waiter dumps @rohany has captured, in all cases the destructions are waiting on user events that have been created by legion but not triggered (i.e. neither deferred nor immediate), so there is no loop involving deferred allocations in the realm operation graph itself.
If figuring out the source of the deferred allocation hangs comes back to me, it's going to wait until after GTC, Supercomputing, and the PLDI deadline have all passed.
I was able to mostly work around this by backpressuring my task's mapping so that the working set never exceeded the amount of memory in the GPU. I ran into the unable to find distributed collectible
bug again at 128 nodes, but I was able to run my app for fewer iterations to not hit it.
Run with -ll:force_kthreads
and capture backtraces on every single node when you crash with the unable to find distributed collectable bug
. Report the full error message for the crash along with the log files.
I can do that. Is there a way that I can get backtraces sent to a log file "automatically"? 128 node jobs on lassen take many hours to come back, often executing overnight. I can't guarantee that I'll be around to be able to manually run the script to gdb into each node to collect the backtraces.
along with the log files.
Is there a particular logger that you want me to turn on?
Is there a way that I can get backtraces sent to a log file "automatically"?
Would something like this work?
jsrun <appname> &
sleep <job-timelimit - 1h>
for HOST in $LSB_HOSTS; do
ssh $HOST <legion-dir>/tools/print_backtraces.sh <appname> > $HOST.txt
done
wait
That's a good idea. I'll try that.
@lightsighter, I've collected the logs for you and placed them here: /home/rohany/dist-collectible-logs.tgz
on sapling. The logs contain the actual output from the run in log.txt
, and then a backtrace for each of the 128 nodes given by hostname.txt.
@rohany Did we sort this one out?
There were a few points in the issue I think:
We can change the title once we have more information about what's going on. I see the following hang on several applications -- here are stacks from some of the nodes:
The thing that stands out in each is that they are all waiting on a collective from a future map:
I haven't had any luck reproducing this on Sapling, but I'm happy to gdb into a stuck process and give you the necessary information.
Based on looking into this with Sean, we believe that the bug is a legion problem, as realm is showing that each node is waiting on an event that hasn't been triggered yet.