StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
689 stars 144 forks source link

Crash in resilient test program on multiple nodes #1750

Closed elliottslaughter closed 2 months ago

elliottslaughter commented 2 months ago

I'm seeing crashes in my multi-node tests for resilient applications on fairly recent versions of master (regent-resilience based on 4284cf6dc654fa2d42faf71a4f3d60c517c34277).

Backtrace:

#5  0x00007fbc4d80488e in std::min<unsigned long> (__a=<error reading variable>, __b=@0x7fbbc8014f70: 2204)
    at /usr/include/c++/9/bits/stl_algobase.h:203
#6  0x00007fbc4e0c074a in Legion::Internal::FutureInstance::copy_from (this=0x0, source=0x7fbbc8014f70, op=0x7fbbc80109c0, 
    precondition=...) at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:3370
#7  0x00007fbc4e0bc541 in Legion::Internal::FutureImpl::unpack_future_result (this=0x7fbbc8011240, derez=...)
    at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:2394
#8  0x00007fbc4e0bef2f in Legion::Internal::FutureImpl::handle_future_result (derez=..., runtime=0x56392fa71130)
    at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:3034
#9  0x00007fbc4e11b209 in Legion::Internal::Runtime::handle_future_result (this=0x56392fa71130, derez=...)
    at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:25278
#10 0x00007fbc4e0e806f in Legion::Internal::VirtualChannel::handle_messages (this=0x7fbbca9094f0, num_messages=1, 
    runtime=0x56392fa71130, remote_address_space=0, args=0x7fbbf0033430 <incomplete sequence \344>, arglen=124)
    at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:12936
#11 0x00007fbc4e0e6c2c in Legion::Internal::VirtualChannel::process_message (this=0x7fbbca9094f0, args=0x7fbbf0033414, arglen=144, 
    runtime=0x56392fa71130, remote_address_space=0) at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:12060
#12 0x00007fbc4e0e9299 in Legion::Internal::MessageManager::receive_message (this=0x7fbbc8201b00, args=0x7fbbf0033410, arglen=152)
    at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:13855
#13 0x00007fbc4e11f89e in Legion::Internal::Runtime::process_message_task (this=0x56392fa71130, args=0x7fbbf003340c, arglen=156)
    at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:26889
#14 0x00007fbc4e136a09 in Legion::Internal::Runtime::legion_runtime_task (args=0x7fbbf0033400, arglen=160, userdata=0x563931f83ab0, 
    userlen=8, p=...) at /scratch/eslaught/resilience-test-network/legion/runtime/legion/runtime.cc:32365
#15 0x00007fbc4afbe3fe in Realm::LocalTaskProcessor::execute_task (this=0x56392e7c8710, func_id=4, task_args=...)
    at /scratch/eslaught/resilience-test-network/legion/runtime/realm/proc_impl.cc:1176
#16 0x00007fbc4b03b1ea in Realm::Task::execute_on_processor (this=0x7fbbf0036390, p=...)
    at /scratch/eslaught/resilience-test-network/legion/runtime/realm/tasks.cc:326
#17 0x00007fbc4b040564 in Realm::UserThreadTaskScheduler::execute_task (this=0x56392e8b06e0, task=0x7fbbf0036390)
    at /scratch/eslaught/resilience-test-network/legion/runtime/realm/tasks.cc:1687
#18 0x00007fbc4b03e2ff in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x56392e8b06e0)
    at /scratch/eslaught/resilience-test-network/legion/runtime/realm/tasks.cc:1160

To reproduce:

cd /scratch/eslaught/resilience-test-network
source experiment/sapling/env.sh
salloc -n 1 -N 1 -c 40 -p cpu --exclusive
i=0; while REALM_FREEZE_ON_ERROR=1 srun -n 2 -c 20 ./build/tests/region_destroy -level resilience=4; do let i++; echo $i; done

To rebuild:

cd /scratch/eslaught/resilience-test-network
source experiment/sapling/env.sh
srun -n 1 -N 1 -c 4 -p cpu --exclusive --pty bash --login
cd legion/build
make install -j20
cd ../../build
make clean && make -j20
lightsighter commented 2 months ago

Fix here: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1447/diffs?commit_id=a01209b36d0379203adb52673575595054d69a48

lightsighter commented 2 months ago

Merged.