I'm seeing crashes in my multi-node tests for resilient applications on fairly recent versions of master (regent-resilience based on 4284cf6dc654fa2d42faf71a4f3d60c517c34277).
Backtrace:
#5 0x00007fbc4d80488e in std::min<unsigned long> (__a=<error reading variable>, __b=@0x7fbbc8014f70: 2204)
at /usr/include/c++/9/bits/stl_algobase.h:203
#6 0x00007fbc4e0c074a in Legion::Internal::FutureInstance::copy_from (this=0x0, source=0x7fbbc8014f70, op=0x7fbbc80109c0,
precondition=...) at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:3370
#7 0x00007fbc4e0bc541 in Legion::Internal::FutureImpl::unpack_future_result (this=0x7fbbc8011240, derez=...)
at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:2394
#8 0x00007fbc4e0bef2f in Legion::Internal::FutureImpl::handle_future_result (derez=..., runtime=0x56392fa71130)
at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:3034
#9 0x00007fbc4e11b209 in Legion::Internal::Runtime::handle_future_result (this=0x56392fa71130, derez=...)
at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:25278
#10 0x00007fbc4e0e806f in Legion::Internal::VirtualChannel::handle_messages (this=0x7fbbca9094f0, num_messages=1,
runtime=0x56392fa71130, remote_address_space=0, args=0x7fbbf0033430 <incomplete sequence \344>, arglen=124)
at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:12936
#11 0x00007fbc4e0e6c2c in Legion::Internal::VirtualChannel::process_message (this=0x7fbbca9094f0, args=0x7fbbf0033414, arglen=144,
runtime=0x56392fa71130, remote_address_space=0) at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:12060
#12 0x00007fbc4e0e9299 in Legion::Internal::MessageManager::receive_message (this=0x7fbbc8201b00, args=0x7fbbf0033410, arglen=152)
at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:13855
#13 0x00007fbc4e11f89e in Legion::Internal::Runtime::process_message_task (this=0x56392fa71130, args=0x7fbbf003340c, arglen=156)
at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:26889
#14 0x00007fbc4e136a09 in Legion::Internal::Runtime::legion_runtime_task (args=0x7fbbf0033400, arglen=160, userdata=0x563931f83ab0,
userlen=8, p=...) at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:32365
#15 0x00007fbc4afbe3fe in Realm::LocalTaskProcessor::execute_task (this=0x56392e7c8710, func_id=4, task_args=...)
at /scratch/eslaught/resilience-test-network/legion/runtime/realm/proc_impl.cc:1176
#16 0x00007fbc4b03b1ea in Realm::Task::execute_on_processor (this=0x7fbbf0036390, p=...)
at /scratch/eslaught/resilience-test-network/legion/runtime/realm/tasks.cc:326
#17 0x00007fbc4b040564 in Realm::UserThreadTaskScheduler::execute_task (this=0x56392e8b06e0, task=0x7fbbf0036390)
at /scratch/eslaught/resilience-test-network/legion/runtime/realm/tasks.cc:1687
#18 0x00007fbc4b03e2ff in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x56392e8b06e0)
at /scratch/eslaught/resilience-test-network/legion/runtime/realm/tasks.cc:1160
To reproduce:
cd /scratch/eslaught/resilience-test-network
source experiment/sapling/env.sh
salloc -n 1 -N 1 -c 40 -p cpu --exclusive
i=0; while REALM_FREEZE_ON_ERROR=1 srun -n 2 -c 20 ./build/tests/region_destroy -level resilience=4; do let i++; echo $i; done
To rebuild:
cd /scratch/eslaught/resilience-test-network
source experiment/sapling/env.sh
srun -n 1 -N 1 -c 4 -p cpu --exclusive --pty bash --login
cd legion/build
make install -j20
cd ../../build
make clean && make -j20
I'm seeing crashes in my multi-node tests for resilient applications on fairly recent versions of master (
regent-resilience
based on 4284cf6dc654fa2d42faf71a4f3d60c517c34277).Backtrace:
To reproduce:
To rebuild: