Closed syamajala closed 2 years ago
@syamajala can you see if the error also happens with the gasnetex network layer?
How is this different than #1139? It's exactly the same backtrace, probably dying in exactly the same way. I'm betting the channel
value is corrupted just like in #1139, which is literally the first couple of bytes in the message from Legion's perspective.
I guess it is the same.
Also, the stack trace I see with gasnetex is this:
[60] Thread 8 (Thread 0x7f1c20ae2700 (LWP 46302) "s3d.x"):
[60] #0 0x00007f1c3cf3e217 in waitpid () from /lib64/libc.so.6
[60] #1 0x00007f1c3cebb76f in do_system () from /lib64/libc.so.6
[60] #2 0x00007f1c37f72b60 in gasneti_system_redirected () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[60] #3 0x00007f1c37f732d2 in gasneti_bt_gdb () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[60] #4 0x00007f1c37f76bc5 in gasneti_print_backtrace () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[60] #5 0x00007f1c37742825 in gasneti_defaultSignalHandler () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[60] #6 <signal handler called>
[60] #7 0x00007f1c38d8679a in Legion::Internal::VirtualChannel::process_message (this=0x7f4118080000, args=0x7f007553f724, arglen=48, runtime=0x61, remote_address_space=379254096) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:11697
[60] #8 0x00007f1c38d86e00 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f007553f710, arglen=56, userdata=<optimized out>, userlen=<optimized out>, p=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:31072
[60] #9 0x00007f1c379c9872 in Realm::LocalTaskProcessor::execute_task (this=0x55ed4c0, func_id=4, task_args=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/bytearray.inl:58
[60] #10 0x00007f1c37a0e373 in Realm::Task::execute_on_processor (this=0x7f007553f230, p=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:306
[60] #11 0x00007f1c37a0e416 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:1646
[60] #12 0x00007f1c37a10a62 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x10e14aa0) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:1127
[60] #13 0x00007f1c37a18250 in Realm::UserThread::uthread_entry () at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/threads.cc:1337
[60] #14 0x00007f1c3cec3ca0 in ?? () from /lib64/libc.so.6
[60] #15 0x0000000000000000 in ?? ()
If this is failing with the gasnetex network layer, I'm less inclined to believe it's packet corruption, as there are checksums on every packet header and most payloads (including any payloads for task spawn messages). @syamajala can you see if the failure will reproduce with -level gexmsg=2
logging?
Yes it still crashes with -level gexmsg=2
.
@lightsighter thinks that my 2048-rank job on Selene seems to have crashed in the same way. Here's a not-so-informative backtrace:
Signal 11 received by process 632454 (thread 7f105c0a1000) at: stack trace: 9 frames
[0] = /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f107d9c53c0]
[1] = /opt/legate/install/lib/liblegion.so(Legion::Internal::MessageManager::receive_message(void const*, unsigned long)+0x3) [0x7f1080aa8c83]
[/opt/legate/install/bin/bind.sh: line 106: 632454 Segmentation fault (core dumped) numactl "$@"
@streichler i put the log from the -level gexmsg=2
run here: http://sapling.stanford.edu/~seshu/perlmutter_crash/
@magnatelee and/or @syamajala can you dump the contents of the realm task arguments that are causing the segfault? (e.g. go up to the Legion::Internal::Runtime::legion_runtime_task
frame and do the following in gdb:
p/x ((unsigned char *)args)[0]@arglen
) so that @lightsighter can explain what the corruption looks like so that we can try to catch it earlier? In particular, we can see if it's already wrong before the message goes over the network.
My suspicion is that someone is stomping on the buffer either before or after it goes over gasnet. These parts of the message header are supposed to be immutable for an entire run. Each virtual channel sets them up in their constructor and every time that virtual channel manager sends a message it shouldn't change.
It does seems to have something to do with message tasks though because there's no evidence so far of anything similar happening with other kinds of Legion meta-tasks.
@streichler
(gdb) p/x ((unsigned char *)args)[0]@arglen
$1 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x72, 0x0, 0x0, 0x0, 0x70, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0,
0x0, 0x0, 0x1, 0x0, 0x0, 0x0, 0xb5, 0x0 <repeats 11 times>, 0x14, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff,
0xff, 0xff, 0x1, 0x0, 0x0, 0x0}
The NERSC guys said there is an opportunity for me to run S3D on the full machine (Perlmutter). Is this an easy fix?
First eight bytes are the implicit provenance, no-op for what we care about (0 means unknown sender which is unusual, but not completely out of the question especially given the message ID below)
Next four bytes are the task ID, which in this case looks right with 114
Next four bytes are the sender ID, 0x570 or 1392 (were you running with at least 1392 ranks?)
Next four bytes are the channel ID, all zeros are the default channel which seems fine
Next four bytes are the header, 1 is a full message which looks good
Next four bytes are the number of messages, 1 also looks good
Next four bytes are the message kind, this is message 904 MPI_RANK_EXCHANGE
which is quite an unusual one, it happens at the beginning of a run to set up the MPI interop
Rest of the data is payload for that message that I'm not going to parse right now unless we need it.
Where was the segfault for this one? Everything seems to be in-bounds assuming you had that many ranks.
I believe the seg fault was in receive_message again. This was what I saw in gdb:
[Switching to thread 8 (Thread 0x7f1a88017700 (LWP 53493))]
#0 0x00007f1a9d27e4c0 in nanosleep () from /lib64/libc.so.6
(gdb) where
#0 0x00007f1a9d27e4c0 in nanosleep () from /lib64/libc.so.6
#1 0x00007f1a9d27e3ca in sleep () from /lib64/libc.so.6
#2 0x00007f1a97a78fd9 in gasneti_freezeForDebuggerNow ()
from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
#3 0x00007f1a97a81e8d in gasneti_defaultSignalHandler ()
from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
#4 <signal handler called>
#5 0x00007f1a990c6cb3 in Legion::Internal::MessageManager::receive_message (this=0x1d012e0000000002, args=0x7f11c9925240, arglen=48)
at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:13154
#6 0x00007f1a990c6d09 in Legion::Internal::Runtime::process_message_task (this=<optimized out>, args=<optimized out>,
arglen=<optimized out>) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:25599
#7 0x00007f1a990c6e00 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f11c9925230, arglen=56, userdata=<optimized out>,
userlen=<optimized out>, p=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:31072
#8 0x00007f1a97d08dc2 in Realm::LocalTaskProcessor::execute_task (this=0x55065b0, func_id=4, task_args=...)
at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/bytearray.inl:58
#9 0x00007f1a97d4d8c3 in Realm::Task::execute_on_processor (this=0x7f11c9924d50, p=...)
at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:306
#10 0x00007f1a97d4d966 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>)
at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:1646
#11 0x00007f1a97d4ffb2 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x18c9e010)
at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:1127
#12 0x00007f1a97d577a0 in Realm::UserThread::uthread_entry () at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/threads.cc:1337
#13 0x00007f1a9d203ca0 in ?? () from /lib64/libc.so.6
#14 0x0000000000000000 in ?? ()
The issue seems to be that the message manager is aligned on a 2-byte boundary which should never happen this=0x1d012e0000000002
. I can't seen any obvious reason this would occur. Whatever is causing it will be very subtle and hard to find.
@lightsighter where is the pointer to the message manager stored? 0x1d012e0000000002
looks like a Realm processor ID for a utility processor (if the command line I have from @syamajala applies here) on rank 302. Maybe something before it
in the containing data structure has an overrun for large node counts?
@syamajala and @magnatelee: what are you setting this to?
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/legion_config.h#L170
@lightsighter aha, I was using the default (1024
), so the crash was expected. I can increase the value and give another try.
That said, I thought we had some safeguard for this case, didn't we? I guess we need to figure out why I couldn't get a more informative error message.
It appears we don't have a check for this value which I will add.
It looks like we did have a check because there was an error code for it, but it got deleted at some point during a refactoring.
I added some "goop" to the cmake config to expose these as parameters and also verify that they are a power-of-two. I'll provide a PR once I kick the tires a bit more.
That fixes the seg fault for me. Im still seeing unrelated gasnet issues at 384 nodes on Perlmutter though. I'll ping the gasnet guys.
Ok, I'm closing this issue and #1139 now that I've added checks back into the runtime for this.
Im seeing a seg fault in receive_message when I scale S3D on Perlmutter using gasnet1 and the ucx conduit. At 384 nodes I start seeing this: