StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
686 stars 145 forks source link

legion: hang on ~8 nodes of lassen #1145

Open rohany opened 3 years ago

rohany commented 3 years ago

We can change the title once we have more information about what's going on. I see the following hang on several applications -- here are stacks from some of the nodes:

Dump 1: https://pastebin.com/4TaKAz0w
Dump 2: https://pastebin.com/V0ckQEh0
Dump 3: https://pastebin.com/rbLtUF9E

The thing that stands out in each is that they are all waiting on a collective from a future map:

Thread 11 (Thread 0x20009ea8f8b0 (LWP 178175)):
#0  0x00002000000fe92c in __pthread_cond_wait (cond=0x20009ea889b0, mutex=0x191e9498) at pthread_cond_wait.c:153
#1  0x00000000127f9f8c in Realm::CondVar::wait (this=0x20009ea889a8) at /g/g15/yadav2/taco/legion/legion/runtime/realm/mutex.cc:231
#2  0x00000000127761cc in Realm::KernelThreadTaskScheduler::worker_sleep (this=0x191e9490, switch_to=0x2034cc0015b0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/tasks.cc:1428
#3  0x0000000012773ddc in Realm::ThreadedTaskScheduler::thread_blocking (this=0x191e9490, thread=0x191f5cf0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/tasks.cc:941
#4  0x000000001259d730 in Realm::Thread::wait_for_condition<Realm::EventTriggeredCondition> (cond=..., poisoned=@0x20009ea8a198: false)
    at /g/g15/yadav2/taco/legion/legion/runtime/realm/threads.inl:218
#5  0x0000000012587ea4 in Realm::Event::wait_faultaware (this=0x20009ea8a4d8, poisoned=@0x20009ea8a198: false) at /g/g15/yadav2/taco/legion/legion/runtime/realm/event_impl.cc:266
#6  0x00000000125879a4 in Realm::Event::wait (this=0x20009ea8a4d8) at /g/g15/yadav2/taco/legion/legion/runtime/realm/event_impl.cc:214
#7  0x0000000011282a20 in Legion::Internal::LgEvent::wait (this=0x20009ea8a4d8) at /g/g15/yadav2/taco/legion/legion/runtime/legion/legion_types.h:2741
#8  0x00000000121978a0 in Legion::Internal::AllGatherCollective<false>::perform_collective_wait (this=0x20009ea8a390, block=true)
    at /g/g15/yadav2/taco/legion/legion/runtime/legion/legion_replication.cc:11003
#9  0x0000000012176a04 in Legion::Internal::ShardCollective::perform_collective_sync (this=0x20009ea8a390, precondition=...)
    at /g/g15/yadav2/taco/legion/legion/runtime/legion/legion_replication.cc:10567
Python Exception <class 'gdb.error'> There is no member or method named _M_value_field.:
#10 0x000000001217fa20 in Legion::Internal::FutureNameExchange::exchange_future_names (this=0x20009ea8a390, futures=std::map with 2 elements)
    at /g/g15/yadav2/taco/legion/legion/runtime/legion/legion_replication.cc:13122
#11 0x00000000117c2718 in Legion::Internal::ReplFutureMapImpl::get_all_futures (this=0x2034cc03da50, others=std::map with 0 elements)
    at /g/g15/yadav2/taco/legion/legion/runtime/legion/runtime.cc:4035
#12 0x00000000117c2978 in Legion::Internal::ReplFutureMapImpl::wait_all_results (this=0x2034cc03da50, silence_warnings=false, warning_string=0x0)
    at /g/g15/yadav2/taco/legion/legion/runtime/legion/runtime.cc:4062
#13 0x0000000011267d9c in Legion::FutureMap::wait_all_results (this=0x20009ea8bb18, silence_warnings=false, warning_string=0x0)
    at /g/g15/yadav2/taco/legion/legion/runtime/legion/legion.cc:2543
#14 0x000000001113ba2c in placeLegionA (ctx=0x2034bc078220, runtime=0x1903ffc0, A=..., rpoc=<optimized out>, c=<optimized out>)
    at /g/g15/yadav2/taco/legion/solomonikMM/taco-generated.cu:119
#15 0x00000000111191ac in operator() (__closure=0x2034cc028930) at /g/g15/yadav2/taco/legion/solomonikMM/main.cpp:89
#16 std::_Function_handler<void(), top_level_task(const Legion::Task*, const std::vector<Legion::PhysicalRegion>&, Legion::Context, Legion::Runtime*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/std_function.h:297
#17 0x0000000011128ed8 in operator() (this=0x20009ea8cdf0) at /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/std_function.h:260
#18 benchmark(Legion::Internal::TaskContext*, Legion::Runtime*, std::vector<unsigned long, std::allocator<unsigned long> >&, std::function<void ()>) (ctx=0x2034bc078220,
    runtime=0x1903ffc0, times=std::vector of length 0, capacity 0, f=...) at /g/g15/yadav2/taco/legion/src/legion_utils.cpp:35
#19 0x000000001111d530 in top_level_task (ctx=<optimized out>, runtime=<optimized out>, regions=..., task=<optimized out>)
    at /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/std_function.h:87
---Type <return> to continue, or q <return> to quit---
#20 0x000000001111dd5c in Legion::LegionTaskWrapper::legion_task_wrapper<&(top_level_task(Legion::Task const*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> > const&, Legion::Internal::TaskContext*, Legion::Runtime*))> (args=<optimized out>, arglen=<optimized out>, userdata=<optimized out>, userlen=<optimized out>, p=...)
    at /g/g15/yadav2/taco/legion/legion/cmake-install/include/legion/legion.inl:22539
#21 0x00000000126dc9b8 in Realm::LocalTaskProcessor::execute_task (this=0x193a4160, func_id=16, task_args=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/proc_impl.cc:1150
#22 0x0000000012770e7c in Realm::Task::execute_on_processor (this=0x2034bc07f260, p=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/tasks.cc:306
#23 0x0000000012775eec in Realm::KernelThreadTaskScheduler::execute_task (this=0x191e9490, task=0x2034bc07f260) at /g/g15/yadav2/taco/legion/legion/runtime/realm/tasks.cc:1380
#24 0x0000000012fe2940 in Realm::OpenMPTaskScheduler<Realm::KernelThreadTaskScheduler>::execute_task (this=0x191e9490, task=0x2034bc07f260)
    at /g/g15/yadav2/taco/legion/legion/runtime/realm/openmp/openmp_module.cc:125
#25 0x0000000012774aac in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x191e9490) at /g/g15/yadav2/taco/legion/legion/runtime/realm/tasks.cc:1127
#26 0x000000001277500c in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x191e9490) at /g/g15/yadav2/taco/legion/legion/runtime/realm/tasks.cc:1231
#27 0x000000001277ffd0 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x191e9490)
    at /g/g15/yadav2/taco/legion/legion/runtime/realm/threads.inl:97
#28 0x00000000127930ec in Realm::KernelThread::pthread_entry (data=0x191f5cf0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/threads.cc:774
#29 0x00002000000f8cd4 in start_thread (arg=0x20009ea8f8b0) at pthread_create.c:309
#30 0x0000200006117e14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

I haven't had any luck reproducing this on Sapling, but I'm happy to gdb into a stuck process and give you the necessary information.

Based on looking into this with Sean, we believe that the bug is a legion problem, as realm is showing that each node is waiting on an event that hasn't been triggered yet.

lightsighter commented 3 years ago

Run with -lg:safe_ctrlrepl 1 -lg:inorder and report back what happens. If it still hangs then make me a reproducer on sapling.

rohany commented 3 years ago

It appears to not be stuck with those arguments (though I imagine they have performance impacts?).

If it still hangs then make me a reproducer on sapling.

As I mentioned in the issue, I haven't been able to reproduce the bug on sapling.

rohany commented 3 years ago

I take that back -- it appears to get through all of my program and then is stuck on shutdown with -lg:safe_ctrlrepl 1 -lg:inorder. There doesn't appear to be anything interesting in any of the stacks of nodes I randomly sampled as well. This might be an unrelated problem.

lightsighter commented 3 years ago

I added a missing check to the safe control replication checks related to this. Pull and try again with the safe control replication checks after rebuilding.

If they still pass then attach a debugger to each of the shards and print out the value of collective_index on this line of frame 11 in the interesting threads: https://gitlab.com/StanfordLegion/legion/-/blob/control_replication/runtime/legion/runtime.cc?expanded=true#L4040

rohany commented 3 years ago

The new checks didn't fire and all nodes have the same collective_index = 1140. I think I found some interesting details though:

While all of the other nodes are doing a collective sync, this one is just waiting on an event, which is probably the deadlock.

rohany commented 3 years ago

Some information from this stack:

(gdb) f 9
#9  0x000000001064c964 in Legion::Internal::ReplFutureMapImpl::wait_all_results (this=0x2034d0075d40, silence_warnings=<optimized out>, warning_string=<optimized out>)
    at /g/g15/yadav2/taco/legion/legion/runtime/legion/runtime.cc:4061
4061          get_all_futures(dummy_others);
(gdb) info locals
dummy_others = std::map with 0 elements
(gdb) f 8
(gdb) info locals
mutator = {<Legion::Internal::ReferenceMutator> = {_vptr.ReferenceMutator = 0x0}, mutation_effects = <error reading variable: Cannot access memory at address 0x4b000000a3>}
f_lock = {local_lock = @0x200000000000, previous = 0x0, exclusive = false, held = 165}
lightsighter commented 3 years ago

And that's why you need to look at the stack traces on all the nodes and not just some of them. This backtrace all but guarantees that one or more tasks didn't run on this shard for this particular index space launch.

Run again with -ll:defalloc 0.

rohany commented 3 years ago

And that's why you need to look at the stack traces on all the nodes and not just some of them.

Is there a more efficient way of doing this than ssh-ing into each node and attaching to the process with gdb? This is sort of annoying at 8 nodes, and prohibitive at 32 nodes (where I see a similar hang but on a different application).

This backtrace all but guarantees that one or more tasks didn't run on this shard for this particular index space launch.

That's true for this application, but isn't true for another application that I think is hanging the same way.

Run again with -ll:defalloc 0.

This causes my application to OOM (failed instance allocations in the mapper). When my application doesn't hang, it runs to completion without OOMs (when I don't use -ll:defalloc 0).

manopapad commented 3 years ago

And that's why you need to look at the stack traces on all the nodes and not just some of them.

Is there a more efficient way of doing this than ssh-ing into each node and attaching to the process with gdb? This is sort of annoying at 8 nodes, and prohibitive at 32 nodes (where I see a similar hang but on a different application).

Try something like:

for JOB_COMPUTE_HOST in ...; do
  ssh $JOB_COMPUTE_HOST <legion-dir>/tools/print_backtraces.sh <executable-name> > $JOB_COMPUTE_HOST.txt
done
lightsighter commented 3 years ago

@manopapad said exactly what I was going to say for capturing the backtraces. There are very few things that we'll ask you to do that don't already have a tool.

That's true for this application, but isn't true for another application that I think is hanging the same way.

Then you'll need to find a reproducer for that code that hangs with -ll:defalloc 0

This causes my application to OOM (failed instance allocations in the mapper). When my application doesn't hang, it runs to completion without OOMs (when I don't use -ll:defalloc 0).

This is a known problem how deferred buffers are allocated today: they don't guarantee deadlock freedom with deferred allocation. There is a pending but incomplete fix in the newinsts branch. The group has instructed me to put all my Legion development time towards collective instances and control replication for the time being. If you want to change those priorities then you'll need to bring it up in a Legion meeting.

rohany commented 3 years ago

I can try to get a reproducer with -ll:defalloc 0, but my code doesn't use any deferred buffers.

rohany commented 3 years ago

I have a reproducer on sapling that you can use now. Go to /home/rohany/taco/build and run mpirun -H g0002:4,g0003:4,g0004:4 --bind-to none -npernode 4 ./runner.sh bin/solomonikMM-cuda -n 35000 -rpoc 4 -c 2 -rpoc3 4 -tm:untrack_valid_regions -ll:ocpu 1 -ll:othr 1 -ll:csize 3000 -ll:util 4 -dm:replicate 1 -ll:gpu 1 -ll:fsize 15000 -ll:bgwork 12 -ll:bgnumapin 1 -ll:force_kthreads. You should see a print every ~10 seconds. If the prints stop showing up then it's hanging.

lightsighter commented 3 years ago

There are other forms of deferred allocation other than deferred buffers/values. If you want to convince me that there is a hang that doesn't involve deferred allocation, then it needs to reproduce with the -ll:defalloc 0 flag, and this reproducer can't do that. Alternatively, you can find a cycle in the Realm event graph that doesn't involve deferred allocation, but I suspect that such a cycle does not exist.

lightsighter commented 3 years ago

@streichler also said that he was planning on adding support for detecting resource cycles to Realm's hang detection mechanism, so getting Realm's hang detection to tell us (with Legion Spy logging) what the event cycle is involving deferred allocation would be instructive, but I'm 99% positive that it's going to involve deferred allocation.

rohany commented 3 years ago

I'm not successful in getting a reproduction with -ll:defalloc 0 -- everything either succeeds on a smaller problem size or OOMs on problem sizes close to my target. What are my options here?

I need these applications to complete so that I have a full set of experiments for the PLDI deadline in mid-november. If a timeline for work on newinsts wouldn't complete by then even with priority, then I'm a bit stuck for what to do.

lightsighter commented 3 years ago

It's possible that there is a faster solution to the problem, but we can't know that until we've characterized the exact nature of the deferred allocations that are causing the problem. Without that we can't say which approach is the best. I would try to modify the Realm cycle detection mechanism to add edges through deferred allocations so that it can "see" those kinds of cycles too.

rohany commented 3 years ago

I would try to modify the Realm cycle detection mechanism to add edges through deferred allocations so that it can "see" those kinds of cycles too.

I can give this a shot if you give me some pointers @streichler, or if you think it's more efficient for you to just do I can wait on it.

streichler commented 3 years ago

@lightsighter it will not be helpful (for this issue) to teach realm to find cycles through deferred allocations. Although there are deferred instance destructions pending in every one of the event waiter dumps @rohany has captured, in all cases the destructions are waiting on user events that have been created by legion but not triggered (i.e. neither deferred nor immediate), so there is no loop involving deferred allocations in the realm operation graph itself.

lightsighter commented 3 years ago

If figuring out the source of the deferred allocation hangs comes back to me, it's going to wait until after GTC, Supercomputing, and the PLDI deadline have all passed.

rohany commented 3 years ago

I was able to mostly work around this by backpressuring my task's mapping so that the working set never exceeded the amount of memory in the GPU. I ran into the unable to find distributed collectible bug again at 128 nodes, but I was able to run my app for fewer iterations to not hit it.

lightsighter commented 3 years ago

Run with -ll:force_kthreads and capture backtraces on every single node when you crash with the unable to find distributed collectable bug. Report the full error message for the crash along with the log files.

rohany commented 3 years ago

I can do that. Is there a way that I can get backtraces sent to a log file "automatically"? 128 node jobs on lassen take many hours to come back, often executing overnight. I can't guarantee that I'll be around to be able to manually run the script to gdb into each node to collect the backtraces.

rohany commented 3 years ago

along with the log files.

Is there a particular logger that you want me to turn on?

manopapad commented 3 years ago

Is there a way that I can get backtraces sent to a log file "automatically"?

Would something like this work?

jsrun <appname> &
sleep <job-timelimit - 1h>
for HOST in $LSB_HOSTS; do
  ssh $HOST <legion-dir>/tools/print_backtraces.sh <appname> > $HOST.txt
done
wait
rohany commented 3 years ago

That's a good idea. I'll try that.

rohany commented 3 years ago

@lightsighter, I've collected the logs for you and placed them here: /home/rohany/dist-collectible-logs.tgz on sapling. The logs contain the actual output from the run in log.txt, and then a backtrace for each of the 128 nodes given by hostname.txt.

streichler commented 2 years ago

@rohany Did we sort this one out?

rohany commented 2 years ago

There were a few points in the issue I think: