Closed syamajala closed 8 months ago
That's odd. From the callstack we do seem to be entering the phase barriers.
@syamajala Are you running into this crash consistently? What branch are you on?
Yes it is very consistent. I think I am on an older version of control_replication but I'm not 100% sure? I know I used this commit to scale to 8192 nodes on Frontier in November and I want to do runs at that scale again, so thats why I'm using this commit.
I see this on the latest control replication as well.
s3d.x: /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/dynamic_table.inl:307: Realm::DynamicTable<ALLOCATOR>::ET* Realm::DynamicTable<ALLOCATOR>::lookup_entry(Realm::DynamicTable<ALLOCATOR>::IT, int, Realm::DynamicTable<ALLOCATOR>::ET**, Realm::DynamicTable<ALLOCATOR>::ET**) [with ALLOCATOR = Realm::DynamicTableAllocator<Realm::BarrierImpl, 10, 4>; Realm::DynamicTable<ALLOCATOR>::ET = Realm::BarrierImpl; Realm::DynamicTable<ALLOCATOR>::IT = long long unsigned int]: Assertion `(child != 0) && (child->level == (n_level - 1)) && (index >= child->first_index) && (index <= child->last_index)' failed.
Legion process received signal 6: Aborted
Process 3857564 on node gpu002 is frozen!
Legion process received signal 11: Segmentation fault
Process 4124403 on node gpu001 is frozen!
Legion process received signal 11: Segmentation fault
Process 4124403 on node gpu001 is frozen!
Legion process received signal 11: Segmentation fault
Process 3857564 on node gpu002 is frozen!
Stack trace:
#0 0x00001555527be238 in nanosleep () from /lib64/libc.so.6
#1 0x00001555527be13e in sleep () from /lib64/libc.so.6
#2 0x000015552f645be4 in Realm::realm_freeze (signal=11) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/runtime_impl.cc:204
#3 <signal handler called>
#4 0x000015553302bee6 in Legion::Internal::ApBarrier::operator Realm::Barrier (this=0xbd73af410bcfad4b) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/legion/legion_types.h:2964
#5 0x000015553311090f in Legion::Internal::Runtime::phase_barrier_arrive (bar=..., count=1, precondition=..., reduce_value=0x0, reduce_value_size=0) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/legion/runtime.h:5563
#6 0x00001555336034b4 in Legion::Internal::LegionHandshakeImpl::legion_handoff_to_ext (this=0xbd73af410bcfad0b) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/legion/runtime.cc:6734
#7 0x00001555331a80b2 in Legion::LegionHandshake::legion_handoff_to_ext (this=0x46b1050) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/legion/legion.cc:2242
#8 0x0000155554792ec4 in Legion::MPILegionHandshake::legion_handoff_to_mpi (this=0x46b1050) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/legion.h:4384
#9 0x000015555479138d in legion_handoff_to_mpi () at s3d_rank_wrapper.cc:334
#10 0x0000155551e57ce9 in $<HandoffToMPITask> () from /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/build/shelld/libmpi_tasks.so
#11 0x0000155551e31119 in $__regent_task_HandoffToMPITask_primary () from /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/build/shelld/libmpi_tasks.so
#12 0x000015552f61f046 in Realm::LocalTaskProcessor::execute_task (this=0x4a4d6a0, func_id=36, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/proc_impl.cc:1176
#13 0x000015552f695d10 in Realm::Task::execute_on_processor (this=0x155082794950, p=...) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/tasks.cc:326
#14 0x000015552f699c3c in Realm::KernelThreadTaskScheduler::execute_task (this=0x4a4da40, task=0x155082794950) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/tasks.cc:1421
#15 0x000015552f698abb in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x4a4da40) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/tasks.cc:1160
#16 0x000015552f6990de in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x4a4da40) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/tasks.cc:1272
#17 0x000015552f6a021a in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x4a4da40) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/threads.inl:97
#18 0x000015552f6ac5a1 in Realm::KernelThread::pthread_entry (data=0x155082505ce0) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/threads.cc:831
#19 0x0000155552ac02de in start_thread () from /lib64/libpthread.so.0
#20 0x00001555527f1e83 in clone () from /lib64/libc.so.6
What is the hex value of bar
in frame 5?
Not able to print bar:
>>> frame 5
#5 0x000015553311090f in Legion::Internal::Runtime::phase_barrier_arrive (bar=..., count=1, precondition=..., reduce_value=0x0, reduce_value_size=0) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/legion/runtime.h:5563
5563 Realm::Barrier copy = bar.phase_barrier;
>>> p/x bar
$1 = (const Legion::PhaseBarrier &) <error reading variable>
This is a full debug build.
If you can't read a variable on the stack like that something terrible has gone wrong, like the OS protected the memory on the stack while trying to clean-up the process. What does *this
show in frame 6?
It must be some sort of corruption. So, clearly the event given at frame#10 was valid:
BarrierImpl *RuntimeImpl::get_barrier_impl(Event e) {
ID id(e);
assert(id.is_barrier());
otherwise we would have hit this assert.
Then we go into:
BarrierImpl *impl = n->barriers.lookup_entry(id.barrier_barrier_idx(),
id.barrier_creator_node());
which is the dynamic table lookup. It certainly not a bug in the dynamic table...at least not a recent one and we don't seem to have a chance here to mess something up with the barrier itself since we merely just trying to fetch the underlying allocation.
You need to be a bit careful. The backtrace that @syamajala reported above is not actually the one that hit the assertion. It never even made it into Realm. It's still trying to copy one barrier on the stack to another and getting a segfault when doing that (which is very strange). Somewhere else there is a thread that actually hit the assertion. This reeks of memory corruption somewhere that is probably corrupting the name of the other barrier and might even be scribbling badly on this stack. @syamajala can you make a reproducer on sapling?
I will try to reproduce on sapling.
I am seeing a strange issue when I try to run on sapling.
output.target_procs
is empty in map_task:
s3d.x: rhst_mapper.cc:590: virtual void RHSTMapper::map_task(MapperContext, const Task&, const Legion::Mapping::Mapper$
:MapTaskInput&, Legion::Mapping::Mapper::MapTaskOutput&): Assertion `output.target_procs.size() == 1' failed.
You're expecting it to be non-empty because it's replicated? Is it because you're running on just one node and therefore there is no replication so the output.target_procs
is not prepopulated?
Yes I'm only running on 1 node. I'm not sure if I fixed it properly because the error I'm seeing now after populating output.target_procs is this:
s3d.x: /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/event_impl.cc:2304: void Realm::BarrierImpl::adjust_arrival(Realm::EventImpl::gen_t, int, Realm::Barrier::timestamp_t, Realm::Event, Realm::NodeID, bool, const void*, size_t, Realm::TimeLimit): Assertion `base_arrival_count > 0' failed.
There is a process on g0003: 2941108.
This looks like an application bug to me, like the way that the handshakes are being used is no longer correct. Can you run with -level barrier=1
and get it to crash again with a log file for what the barriers are doing?
It did crash with -level barrier=1
. The log is here: /scratch2/seshu/legion_s3d_tdb/Shell1D/DBO_Test_1/run/run_0.log
I left the process 2965166 running on g0003.
It's not the barriers. There is heap corruption going on in this application. This run died in the logger iterating over the list of streams for that logger:
#3 0x00007f9a5aa42103 in Realm::realm_freeze (signal=11) at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/runtime_impl.cc:204
#4 <signal handler called>
#5 Realm::Logger::log_msg (this=0x7f9a5c083400 <Realm::log_barrier>, level=Realm::Logger::LEVEL_INFO, msgdata=0x7f902afde130 "barrier adjustment: event=2000000000100004 delta=-1 ts=0\020T:\b\220\177", msglen=56) at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/logging.cc:584
#6 0x00007f9a72dc4dfb in Realm::LoggerMessage::~LoggerMessage (this=0x7f902afde0d0, __in_chrg=<optimized out>) at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/logging.inl:590
#7 0x00007f9a5a8e2416 in Realm::BarrierImpl::adjust_arrival (this=0x7f90083a5630, barrier_gen=4, delta=-1, timestamp=0, wait_on=..., sender=0, forwarded=false, reduce_value=0x0, reduce_value_size=0, work_until=...) at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/event_impl.cc:2262
#8 0x00007f9a5a8da158 in Realm::Barrier::arrive (this=0x7f902afde430, count=1, wait_on=..., reduce_value=0x0, reduce_value_size=0) at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/event_impl.cc:513
#9 0x00007f9a5e5dd66e in Legion::Internal::Runtime::phase_barrier_arrive (bar=..., count=1, precondition=..., reduce_value=0x0, reduce_value_size=0) at /scratch2/seshu/legion_s3d_tdb/legion/runtime/legion/runtime.h:5564
The STL vector of streams in the logger has been completely stomped on and has non-sensical pointers:
(gdb) p streams
$13 = std::vector of length 2, capacity 2 = {{
s = 0xffffff9cffffff9c,
min_level = 4294967196,
delete_when_done = 156,
flush_each_write = 255
}, {
s = 0xffffff9cffffff9c,
min_level = 4294967196,
delete_when_done = 156,
flush_each_write = 255
}}
Can you try running with valgrind?
The last time I tried, valgrind no longer worked with Regent and its pretty difficult to test it with S3D.
This version of S3D we're seeing this crash in merges 3 different branches together that each work individually so its most likely I messed up the merge somewhere. I might try to look through the code first before I try valgrind.
Do what you think is best, but this is definitely heap corruption.
Given the proximity in time to the handshake, I wouldn't be surprised if it had something to do with the packing/unpacking of arguments crossing to/from fortran and the APIs being different, especially since that property isn't type-checked by any of the compilers, but that is just a guess.
Was stepping out of bounds when copying an array to fortran. It runs now.
I am seeing the following crash with S3D:
Here is a stack trace:
I am on an older commit of the runtime: