StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
669 stars 146 forks source link

Crash in overdecomposition use case with tracing, without DCR or index launches #1085

Closed elliottslaughter closed 3 years ago

elliottslaughter commented 3 years ago

This is in a branch based on control_replication commit e7e51ed1e22f51c8c616f926eefbebd17ddabcf7. I'm not 100% sure which variables matter, but here's what I know so far:

The relevant part of the backtrace looks like:

[2] Thread 8 (Thread 0x2aaaaac3c800 (LWP 29038) "circuit.noidx"):
[2] #0  0x00002aaaaf349217 in waitpid () from /lib64/libc.so.6
[2] #1  0x00002aaaaf2c676f in do_system () from /lib64/libc.so.6
[2] #2  0x00002aaaaccdd4bb in gasneti_bt_gdb () from /users/eslaught/regent-index-launch-sc21/language/circuit.run2_overdecompose_debug_2/libregent.so
[2] #3  0x00002aaaacce0dca in gasneti_print_backtrace () from /users/eslaught/regent-index-launch-sc21/language/circuit.run2_overdecompose_debug_2/libregent.so
[2] #4  0x00002aaaabf92979 in gasneti_defaultSignalHandler () from /users/eslaught/regent-index-launch-sc21/language/circuit.run2_overdecompose_debug_2/libregent.so
[2] #5  <signal handler called>
[2] #6  0x00002aaaac1d3645 in Legion::Internal::RegionTreePath::register_child (this=this@entry=0x2aab4024b4c8, depth=4294967295, color=33) at /users/eslaught/regent-index-launch-sc21/runtime/legion/legion_analysis.cc:19418
[2] #7  0x00002aaaac28b038 in Legion::Internal::RegionTreeForest::initialize_path (this=<optimized out>, child=0x2aab3cc88640, parent=0x2aab3cc81b80, path=...) at /users/eslaught/regent-index-launch-sc21/runtime/legion/region_tree.cc:5634
[2] #8  0x00002aaaac055a11 in Legion::Internal::VirtualCloseOp::initialize (this=this@entry=0x2aab4024b140, ctx=ctx@entry=0x2aab34a5b150, index=index@entry=1, req=..., target=0x2aab3440b560) at /users/eslaught/regent-index-launch-sc21/runtime/legion/legion_ops.cc:10212
[2] #9  0x00002aaaac11db5e in Legion::Internal::InnerContext::end_task (this=0x2aab34a5b150, res=<optimized out>, res_size=<optimized out>, owned=<optimized out>, deferred_result_instance=..., callback_functor=<optimized out>, result_kind=<optimized out>, freefunc=<optimized out>, metadataptr=<optimized out>, metadatasize=<optimized out>) at /users/eslaught/regent-index-launch-sc21/runtime/legion/legion_context.cc:10440
[2] #10 0x00002aaaac003884 in legion_task_postamble (runtime_=..., ctx_=..., retval=0x0, retsize=0) at /users/eslaught/regent-index-launch-sc21/runtime/legion/legion_c.cc:7727
[2] #11 0x000000000040d5ad in $__regent_task_init_piece_primary ()
[2] #12 0x00002aaaac820e51 in Realm::LocalTaskProcessor::execute_task (this=0x1613000, func_id=27, task_args=...) at /users/eslaught/regent-index-launch-sc21/runtime/realm/bytearray.inl:58
[2] #13 0x00002aaaac7165c3 in Realm::Task::execute_on_processor (this=0x2aab280ba950, p=...) at /users/eslaught/regent-index-launch-sc21/runtime/realm/tasks.cc:306
[2] #14 0x00002aaaac716656 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /users/eslaught/regent-index-launch-sc21/runtime/realm/tasks.cc:1646
[2] #15 0x00002aaaac718cc9 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x10dc890) at /users/eslaught/regent-index-launch-sc21/runtime/realm/tasks.cc:1127
[2] #16 0x00002aaaac6feb3f in Realm::UserThread::uthread_entry () at /users/eslaught/regent-index-launch-sc21/runtime/realm/threads.cc:1337
[2] #17 0x00002aaaaf2ceca0 in ?? () from /lib64/libc.so.6
[2] #18 0x0000000000000000 in ?? ()

Reproduction Steps on Sapling

Copy my files:

cp -r /scratch2/eslaught/regent-index-launch-sc21 .
cd regent-index-launch-sc21
git stash
git reset --hard ec153e7d6da2cdeed0da69f5e8bb4bbe7b09e23b
git stash pop

Build:

salloc -N 1 -n 1 -p gpu --exclusive
srun --pty bash --login
./sc21_scripts/build_circuit.sh circuit.run_dir

Run:

cd circuit.run_dir
salloc -N 4 -n 4 -p gpu --exclusive
mpirun -n 4 -npernode 1 -bind-to none -x LD_LIBRARY_PATH=$PWD -x REALM_BACKTRACE=1 ./circuit.noidx -npp 500 -wpp 2000 -l 50 -p $(( 4 * 10 )) -pps 1 -prune 10 -hl:sched 1024 -ll:gpu 1 -ll:util 2 -ll:bgwork 2 -ll:csize 15000 -ll:fsize 15000 -ll:zsize 2048 -ll:rsize 512 -ll:gsize 0 -dm:replicate 0 -dm:memoize -lg:no_fence_elision -lg:parallel_replay 2

Note that this command reproduces only about 50% of the time.

lightsighter commented 3 years ago

@elliottslaughter Pull control replication and see if you can still reproduce this.

elliottslaughter commented 3 years ago

Fixed. Thanks!