StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
657 stars 146 forks source link

`remove_nested_valid_ref` assertion failure #1672

Closed cmelone closed 2 months ago

cmelone commented 3 months ago

I'm hitting this error on certain test cases: GPU, 1 node, only in debug mode. Latest master.

legion/runtime/legion/legion_views.h:2427: bool Legion::Internal::LogicalView::remove_nested_valid_ref(Legion::DistributedID, int): Assertion `current >= cnt' failed.

bt:

[Switching to thread 4 (Thread 0x7f27b6ff6c80 (LWP 388153))]
#0  0x00007f28170a79fd in nanosleep () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f28170a79fd in nanosleep () from /lib64/libc.so.6
#1  0x00007f28170a7894 in sleep () from /lib64/libc.so.6
#2  0x00007f281a7f2364 in Realm::realm_freeze (signal=6) at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/realm/runtime_impl.cc:206
#3  <signal handler called>
#4  0x00007f2817018387 in raise () from /lib64/libc.so.6
#5  0x00007f2817019a78 in abort () from /lib64/libc.so.6
#6  0x00007f28170111a6 in __assert_fail_base () from /lib64/libc.so.6
#7  0x00007f2817011252 in __assert_fail () from /lib64/libc.so.6
#8  0x00007f2819bc5139 in Legion::Internal::LogicalView::remove_nested_valid_ref (this=0x7f27a80346d0, source=1008806316530992849, cnt=1) at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/legion/legion_views.h:2427
#9  0x00007f2819e21cfb in Legion::Internal::EquivalenceSet::filter_set (this=0x7f27a0044c60, analysis=..., expr=0x7f27c65253b0, expr_covers=true, filter_mask=..., applied_events=..., already_deferred=false)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/legion/legion_analysis.cc:18577
#10 0x00007f2819df8280 in Legion::Internal::FilterAnalysis::perform_analysis (this=0x7f27a402fc80, set=0x7f27a0044c60, expr=0x7f27c65253b0, expr_covers=true, mask=..., applied_events=..., already_deferred=false)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/legion/legion_analysis.cc:11098
#11 0x00007f2819dfe392 in Legion::Internal::EquivalenceSet::analyze (this=0x7f27a0044c60, analysis=..., expr=0x7f27c65253b0, expr_covers=true, traversal_mask=..., deferral_events=..., applied_events=..., already_deferred=false)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/legion/legion_analysis.cc:12387
#12 0x00007f2819de4188 in Legion::Internal::PhysicalAnalysis::analyze (this=0x7f27a402fc80, set=0x7f27a0044c60, mask=..., deferral_events=..., applied_events=..., precondition=..., already_deferred=false)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/legion/legion_analysis.cc:7505
#13 0x00007f2819de4bc9 in Legion::Internal::PhysicalAnalysis::perform_traversal (this=0x7f27a402fc80, precondition=..., info=..., applied_events=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/legion/legion_analysis.cc:7673
#14 0x00007f2819df822c in Legion::Internal::FilterAnalysis::perform_traversal (this=0x7f27a402fc80, precondition=..., info=..., applied_events=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/legion/legion_analysis.cc:11086
#15 0x00007f2819f8bc50 in Legion::Internal::RegionTreeForest::detach_external (this=0x46b0010, req=..., detach_op=0x7f27ce50b4c0, index=0, version_info=..., instances=..., termination_event=..., trace_info=..., map_applied_events=..., filter_precondition=..., 
    second_analysis=true) at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/legion/region_tree.cc:2953
#16 0x00007f2819a0be4b in Legion::Internal::DetachOp::trigger_mapping (this=0x7f27ce50b4c0) at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/legion/legion_ops.cc:21770
#17 0x00007f281a09b19d in Legion::Internal::Runtime::legion_runtime_task (args=0x7f27a8034370, arglen=12, userdata=0x49bcfc0, userlen=8, p=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/legion/runtime.cc:32268
#18 0x00007f281ab2553a in Realm::LocalTaskProcessor::execute_task (this=0x3221e90, func_id=4, task_args=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/realm/proc_impl.cc:1176
#19 0x00007f281a962d70 in Realm::Task::execute_on_processor (this=0x7f27a80341f0, p=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/realm/tasks.cc:326
#20 0x00007f281a966c88 in Realm::KernelThreadTaskScheduler::execute_task (this=0x32221d0, task=0x7f27a80341f0) at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/realm/tasks.cc:1421
#21 0x00007f281a965b06 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x32221d0) at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/realm/tasks.cc:1160
#22 0x00007f281a96611c in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x32221d0) at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/realm/tasks.cc:1272
#23 0x00007f281a96d394 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x32221d0) at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/realm/threads.inl:97
#24 0x00007f281a93d513 in Realm::KernelThread::pthread_entry (data=0x7f27c6578430) at /home/hpcc/gitlabci/psaap-ci/artifacts/6516394368/legion/runtime/realm/threads.cc:831
#25 0x00007f2816bc5ea5 in start_thread () from /lib64/libpthread.so.0
#26 0x00007f28170e0b0d in clone () from /lib64/libc.so.6
lightsighter commented 3 months ago

Make me a reproducer on sapling.

lightsighter commented 3 months ago

If possible, build the reproducer with -DDEBUG_LEGION_GC -DLEGION_GC but only if it still reproduces.

cmelone commented 3 months ago

Hitting a different assertion with those flags:

prometeo_CH41StMix.exec: /home/cmelone/april/legion/runtime/legion/garbage_collection.cc:475: bool Legion::Internal::DistributedCollectable::remove_base_gc_ref_internal(Legion::Internal::ReferenceSource, int): Assertion `finder != detailed_base_gc_references.end()' failed.

Reproducer is at /home/cmelone/april. Run REBUILD=0 ./run.sh to submit the slurm job

lightsighter commented 3 months ago

Try again with this branch: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1201 It's not going to fix the issue, but it will fix a false-positive that you're getting right now in the reference checking code.

cmelone commented 2 months ago

Thanks, that assertion is now gone. I was having trouble reproducing the issue on Sapling, but it turns out there is only about 1/10 chance of hitting the original error, so I updated the script to submit 10 jobs at a time.

The error still reproduces with the CXXFLAGS you requested

cmelone commented 2 months ago

@elliottslaughter please add to #1032, thanks!

lightsighter commented 2 months ago

Where is your Legion source code and how do I rebuild if I change something?

cmelone commented 2 months ago

Legion is at ~/april/legion and can be rebuilt by running REBUILD=1 ./run.sh (this won't recompile HTR)

lightsighter commented 2 months ago

Try the most recent Legion master branch.

cmelone commented 2 months ago

Looks great, thanks Mike!