StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
657 stars 146 forks source link

`finder != operations.end()` assertion failure #1686

Closed cmelone closed 1 month ago

cmelone commented 2 months ago

Running at 4 nodes, debug mode, with GPUs. This regression was introduced by https://gitlab.com/StanfordLegion/legion/-/commit/12d5a56fe5b07975c2f1d70b4df156fb9c684949

prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/legion/legion_trace.cc:10301: virtual void Legion::Internal::IssueCopy::execute(std::vector<Legion::Internal::ApEvent>&, std::map<unsigned int, Legion::Internal::ApUserEvent>&, std::map<Legion::Internal::ContextCoordinate, Legion::Internal::MemoizableOp*>&, bool): Assertion `finder != operations.end()' failed.

backtrace:

#0  0x00007fd8b7cc89fd in nanosleep () from /lib64/libc.so.6
#1  0x00007fd8b7cc8894 in sleep () from /lib64/libc.so.6
#2  0x00007fd8bb3f9086 in Realm::realm_freeze (signal=6) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/runtime_impl.cc:206
#3  <signal handler called>
#4  0x00007fd8b7c39387 in raise () from /lib64/libc.so.6
#5  0x00007fd8b7c3aa78 in abort () from /lib64/libc.so.6
#6  0x00007fd8b7c321a6 in __assert_fail_base () from /lib64/libc.so.6
#7  0x00007fd8b7c32252 in __assert_fail () from /lib64/libc.so.6
#8  0x00007fd8ba87bc09 in Legion::Internal::IssueCopy::execute (this=0x7fd824146010, events=..., user_events=..., operations=..., recurrent_replay=false)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/legion/legion_trace.cc:10301
#9  0x00007fd8ba8610a6 in Legion::Internal::PhysicalTemplate::execute_slice (this=0x7fd8259474e0, slice_idx=0, recurrent_replay=false)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/legion/legion_trace.cc:4703
#10 0x00007fd8ba870e40 in Legion::Internal::PhysicalTemplate::handle_replay_slice (args=0x7fd8183ab600) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/legion/legion_trace.cc:7998
#11 0x00007fd8baca3134 in Legion::Internal::Runtime::legion_runtime_task (args=0x7fd8183ab600, arglen=20, userdata=0x447bd10, userlen=8, p=...)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/legion/runtime.cc:32556
#12 0x00007fd8bb72bd44 in Realm::LocalTaskProcessor::execute_task (this=0x445db60, func_id=4, task_args=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/proc_impl.cc:1176
#13 0x00007fd8bb569526 in Realm::Task::execute_on_processor (this=0x7fd8183ab480, p=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/tasks.cc:326
#14 0x00007fd8bb56d43e in Realm::KernelThreadTaskScheduler::execute_task (this=0x445ded0, task=0x7fd8183ab480) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/tasks.cc:1421
#15 0x00007fd8bb56c2bc in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x445ded0) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/tasks.cc:1160
#16 0x00007fd8bb56c8d2 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x445ded0) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/tasks.cc:1272
#17 0x00007fd8bb573b4a in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x445ded0)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/threads.inl:97
#18 0x00007fd8bb543e73 in Realm::KernelThread::pthread_entry (data=0x7fd81546df90) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/threads.cc:831
#19 0x00007fd8b77e6ea5 in start_thread () from /lib64/libpthread.so.0
#20 0x00007fd8b7d01b0d in clone () from /lib64/libc.so.6

@lightsighter

note to self: this is channel flow 8x2x2
lightsighter commented 2 months ago

Make me a reproducer on sapling. You guys had two months to test this. Why are you just reporting it now?

seemamirch commented 2 months ago

It doesn't reproduce on sapling i.e. ChannelFlow, 8x2x2, 4 nodes, GPUs, debug mode using HTR Develop branch commit fbaf5141, legion commit 12d5a56fe.

cmelone commented 2 months ago

I also cannot reproduce on sapling as well as Lassen. The cluster I found the error on is down this week so I will need to check again once it's back up.

lightsighter commented 2 months ago

This error should be deterministic when it does occur. Are we sure we are running exactly the same configuration on both machines?

cmelone commented 1 month ago

No longer able to reproduce