StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
657 stars 146 forks source link

[HTR] segfault at 4 nodes #1677

Closed cmelone closed 2 months ago

cmelone commented 2 months ago

on GPUs, latest master, cannot reproduce in debug mode.

backtrace:

#0  0x00007f2e255319fd in nanosleep () from /lib64/libc.so.6
#1  0x00007f2e25531894 in sleep () from /lib64/libc.so.6
#2  0x00007f2e26f3cc72 in Realm::realm_freeze (signal=<optimized out>) at /home/hpcc/gitlabci/psaap-ci/artifacts/6544837928/legion/runtime/realm/runtime_impl.cc:206
#3  <signal handler called>
#4  0x00007f2e267c3ae6 in std::_Rb_tree<Legion::Internal::RtEvent, Legion::Internal::RtEvent, std::_Identity<Legion::Internal::RtEvent>, std::less<Legion::Internal::RtEvent>, std::allocator<Legion::Internal::RtEvent> >::_M_get_insert_unique_pos (this=this@entry=0x0, __k=...) at /opt/ohpc/pub/compiler/gcc/8.3.0/include/c++/8.3.0/bits/stl_tree.h:2044
#5  0x00007f2e267c3b81 in std::_Rb_tree<Legion::Internal::RtEvent, Legion::Internal::RtEvent, std::_Identity<Legion::Internal::RtEvent>, std::less<Legion::Internal::RtEvent>, std::allocator<Legion::Internal::RtEvent> >::_M_insert_unique<Legion::Internal::RtEvent const&> (this=0x0, __v=...)
    at /opt/ohpc/pub/compiler/gcc/8.3.0/include/c++/8.3.0/bits/stl_tree.h:2098
#6  0x00007f2e2678fe03 in insert (__x=..., this=<optimized out>) at /home/hpcc/gitlabci/psaap-ci/artifacts/6544837928/legion/runtime/legion/legion_ops.h:359
#7  add_mapping_dependence (dependence=..., this=<optimized out>) at /home/hpcc/gitlabci/psaap-ci/artifacts/6544837928/legion/runtime/legion/legion_ops.h:359
#8  Legion::Internal::Operation::begin_dependence_analysis (this=this@entry=0x7f2ddf5fb2e0)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6544837928/legion/runtime/legion/legion_ops.cc:2400
#9  0x00007f2e2679007b in Legion::Internal::Operation::execute_dependence_analysis (this=0x7f2ddf5fb2e0)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6544837928/legion/runtime/legion/legion_ops.cc:1631
#10 0x00007f2e26859ba3 in Legion::Internal::InnerContext::process_dependence_stage() ()
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6544837928/legion/runtime/legion/legion_context.cc:8470
#11 0x00007f2e26859d09 in Legion::Internal::InnerContext::handle_dependence_stage (args=args@entry=0x7f2de8544990)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6544837928/legion/runtime/legion/legion_context.cc:12228
#12 0x00007f2e26b257a0 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f2de8544990, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6544837928/legion/runtime/legion/runtime.cc:32255
#13 0x00007f2e2712aee5 in Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&) ()
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6544837928/legion/runtime/realm/bytearray.inl:150
#14 0x00007f2e27026f33 in Realm::Task::execute_on_processor(Realm::Processor) () at /home/hpcc/gitlabci/psaap-ci/artifacts/6544837928/legion/runtime/realm/runtime_impl.h:521
#15 0x00007f2e27026fc6 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6544837928/legion/runtime/realm/tasks.cc:1687
#16 0x00007f2e27024b43 in Realm::ThreadedTaskScheduler::scheduler_loop() () at /home/hpcc/gitlabci/psaap-ci/artifacts/6544837928/legion/runtime/realm/tasks.cc:1158
#17 0x00007f2e2700ef0f in Realm::UserThread::uthread_entry() () at /home/hpcc/gitlabci/psaap-ci/artifacts/6544837928/legion/runtime/realm/threads.cc:1405
#18 0x00007f2e254b4190 in ?? () from /lib64/libc.so.6
#19 0x0000000000000000 in ?? ()

@elliottslaughter please add to #1032, thanks!

lightsighter commented 2 months ago

Make a reproducer, but built it with -g -O2. I'm going to be honest: this looks like memory corruption from the application to me.

cmelone commented 2 months ago

my mistake, the CI was missing a few of the latest commits. 9ec4b87f resolves the issue