StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
678 stars 145 forks source link

legion: Seg fault in receive_message on Perlmutter #1159

Closed syamajala closed 2 years ago

syamajala commented 2 years ago

Im seeing a seg fault in receive_message when I scale S3D on Perlmutter using gasnet1 and the ucx conduit. At 384 nodes I start seeing this:

[39] Thread 8 (Thread 0x7f3d012f8700 (LWP 60436) "s3d.x"):
[39] #0  0x00007f3d19b55217 in waitpid () from /lib64/libc.so.6
[39] #1  0x00007f3d19ad276f in do_system () from /lib64/libc.so.6
[39] #2  0x00007f3d14b864d0 in gasneti_system_redirected () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[39] #3  0x00007f3d14b86c42 in gasneti_bt_gdb () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[39] #4  0x00007f3d14b8a535 in gasneti_print_backtrace () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[39] #5  0x00007f3d14364cdb in gasneti_defaultSignalHandler () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[39] #6  <signal handler called>
[39] #7  0x00007f3d1599e143 in Legion::Internal::MessageManager::receive_message (this=0x1d00560000000001, args=0x7f37fd2b7a70, arglen=48) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:13153
[39] #8  0x00007f3d1599e199 in Legion::Internal::Runtime::process_message_task (this=<optimized out>, args=<optimized out>, arglen=<optimized out>) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:25599
[39] #9  0x00007f3d1599e290 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f37fd2b7a60, arglen=56, userdata=<optimized out>, userlen=<optimized out>, p=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:31076
[39] #10 0x00007f3d145ebac2 in Realm::LocalTaskProcessor::execute_task (this=0x5670190, func_id=4, task_args=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/bytearray.inl:58
[39] #11 0x00007f3d146305c3 in Realm::Task::execute_on_processor (this=0x1963ca50, p=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:306
[39] #12 0x00007f3d14630666 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:1646
[39] #13 0x00007f3d14632cb2 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x1280b010) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:1127
[39] #14 0x00007f3d1463a4a0 in Realm::UserThread::uthread_entry () at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/threads.cc:1337
[39] #15 0x00007f3d19adaca0 in ?? () from /lib64/libc.so.6
[39] #16 0x0000000000000000 in ?? ()
streichler commented 2 years ago

@syamajala can you see if the error also happens with the gasnetex network layer?

lightsighter commented 2 years ago

How is this different than #1139? It's exactly the same backtrace, probably dying in exactly the same way. I'm betting the channel value is corrupted just like in #1139, which is literally the first couple of bytes in the message from Legion's perspective.

syamajala commented 2 years ago

I guess it is the same.

Also, the stack trace I see with gasnetex is this:

[60] Thread 8 (Thread 0x7f1c20ae2700 (LWP 46302) "s3d.x"):
[60] #0  0x00007f1c3cf3e217 in waitpid () from /lib64/libc.so.6
[60] #1  0x00007f1c3cebb76f in do_system () from /lib64/libc.so.6
[60] #2  0x00007f1c37f72b60 in gasneti_system_redirected () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[60] #3  0x00007f1c37f732d2 in gasneti_bt_gdb () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[60] #4  0x00007f1c37f76bc5 in gasneti_print_backtrace () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[60] #5  0x00007f1c37742825 in gasneti_defaultSignalHandler () from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
[60] #6  <signal handler called>
[60] #7  0x00007f1c38d8679a in Legion::Internal::VirtualChannel::process_message (this=0x7f4118080000, args=0x7f007553f724, arglen=48, runtime=0x61, remote_address_space=379254096) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:11697
[60] #8  0x00007f1c38d86e00 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f007553f710, arglen=56, userdata=<optimized out>, userlen=<optimized out>, p=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:31072
[60] #9  0x00007f1c379c9872 in Realm::LocalTaskProcessor::execute_task (this=0x55ed4c0, func_id=4, task_args=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/bytearray.inl:58
[60] #10 0x00007f1c37a0e373 in Realm::Task::execute_on_processor (this=0x7f007553f230, p=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:306
[60] #11 0x00007f1c37a0e416 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:1646
[60] #12 0x00007f1c37a10a62 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x10e14aa0) at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:1127
[60] #13 0x00007f1c37a18250 in Realm::UserThread::uthread_entry () at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/threads.cc:1337
[60] #14 0x00007f1c3cec3ca0 in ?? () from /lib64/libc.so.6
[60] #15 0x0000000000000000 in ?? ()
streichler commented 2 years ago

If this is failing with the gasnetex network layer, I'm less inclined to believe it's packet corruption, as there are checksums on every packet header and most payloads (including any payloads for task spawn messages). @syamajala can you see if the failure will reproduce with -level gexmsg=2 logging?

syamajala commented 2 years ago

Yes it still crashes with -level gexmsg=2.

magnatelee commented 2 years ago

@lightsighter thinks that my 2048-rank job on Selene seems to have crashed in the same way. Here's a not-so-informative backtrace:

Signal 11 received by process 632454 (thread 7f105c0a1000) at: stack trace: 9 frames                                                                                                                                                                                                                                                         
  [0] = /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f107d9c53c0]                                                                                                                                                                                                                                                                 
  [1] = /opt/legate/install/lib/liblegion.so(Legion::Internal::MessageManager::receive_message(void const*, unsigned long)+0x3) [0x7f1080aa8c83]                                                                                                                                                                                             
  [/opt/legate/install/bin/bind.sh: line 106: 632454 Segmentation fault      (core dumped) numactl "$@"                                                                                                                                                                                                                                      
syamajala commented 2 years ago

@streichler i put the log from the -level gexmsg=2 run here: http://sapling.stanford.edu/~seshu/perlmutter_crash/

streichler commented 2 years ago

@magnatelee and/or @syamajala can you dump the contents of the realm task arguments that are causing the segfault? (e.g. go up to the Legion::Internal::Runtime::legion_runtime_task frame and do the following in gdb:

p/x ((unsigned char *)args)[0]@arglen

) so that @lightsighter can explain what the corruption looks like so that we can try to catch it earlier? In particular, we can see if it's already wrong before the message goes over the network.

lightsighter commented 2 years ago

My suspicion is that someone is stomping on the buffer either before or after it goes over gasnet. These parts of the message header are supposed to be immutable for an entire run. Each virtual channel sets them up in their constructor and every time that virtual channel manager sends a message it shouldn't change.

It does seems to have something to do with message tasks though because there's no evidence so far of anything similar happening with other kinds of Legion meta-tasks.

syamajala commented 2 years ago

@streichler

(gdb) p/x ((unsigned char *)args)[0]@arglen
$1 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x72, 0x0, 0x0, 0x0, 0x70, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 
  0x0, 0x0, 0x1, 0x0, 0x0, 0x0, 0xb5, 0x0 <repeats 11 times>, 0x14, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 
  0xff, 0xff, 0x1, 0x0, 0x0, 0x0}
syamajala commented 2 years ago

The NERSC guys said there is an opportunity for me to run S3D on the full machine (Perlmutter). Is this an easy fix?

lightsighter commented 2 years ago

First eight bytes are the implicit provenance, no-op for what we care about (0 means unknown sender which is unusual, but not completely out of the question especially given the message ID below) Next four bytes are the task ID, which in this case looks right with 114 Next four bytes are the sender ID, 0x570 or 1392 (were you running with at least 1392 ranks?) Next four bytes are the channel ID, all zeros are the default channel which seems fine Next four bytes are the header, 1 is a full message which looks good Next four bytes are the number of messages, 1 also looks good Next four bytes are the message kind, this is message 904 MPI_RANK_EXCHANGE which is quite an unusual one, it happens at the beginning of a run to set up the MPI interop Rest of the data is payload for that message that I'm not going to parse right now unless we need it.

Where was the segfault for this one? Everything seems to be in-bounds assuming you had that many ranks.

syamajala commented 2 years ago

I believe the seg fault was in receive_message again. This was what I saw in gdb:

[Switching to thread 8 (Thread 0x7f1a88017700 (LWP 53493))]
#0  0x00007f1a9d27e4c0 in nanosleep () from /lib64/libc.so.6
(gdb) where
#0  0x00007f1a9d27e4c0 in nanosleep () from /lib64/libc.so.6
#1  0x00007f1a9d27e3ca in sleep () from /lib64/libc.so.6
#2  0x00007f1a97a78fd9 in gasneti_freezeForDebuggerNow ()
   from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
#3  0x00007f1a97a81e8d in gasneti_defaultSignalHandler ()
   from /global/homes/s/seshu/legion_s3d/legion/language/build/lib/librealm.so.1
#4  <signal handler called>
#5  0x00007f1a990c6cb3 in Legion::Internal::MessageManager::receive_message (this=0x1d012e0000000002, args=0x7f11c9925240, arglen=48)
    at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:13154
#6  0x00007f1a990c6d09 in Legion::Internal::Runtime::process_message_task (this=<optimized out>, args=<optimized out>, 
    arglen=<optimized out>) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:25599
#7  0x00007f1a990c6e00 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f11c9925230, arglen=56, userdata=<optimized out>, 
    userlen=<optimized out>, p=...) at /global/u1/s/seshu/legion_s3d/legion/runtime/legion/runtime.cc:31072
#8  0x00007f1a97d08dc2 in Realm::LocalTaskProcessor::execute_task (this=0x55065b0, func_id=4, task_args=...)
    at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/bytearray.inl:58
#9  0x00007f1a97d4d8c3 in Realm::Task::execute_on_processor (this=0x7f11c9924d50, p=...)
    at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:306
#10 0x00007f1a97d4d966 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>)
    at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:1646
#11 0x00007f1a97d4ffb2 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x18c9e010)
    at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/tasks.cc:1127
#12 0x00007f1a97d577a0 in Realm::UserThread::uthread_entry () at /global/u1/s/seshu/legion_s3d/legion/runtime/realm/threads.cc:1337
#13 0x00007f1a9d203ca0 in ?? () from /lib64/libc.so.6
#14 0x0000000000000000 in ?? ()
lightsighter commented 2 years ago

The issue seems to be that the message manager is aligned on a 2-byte boundary which should never happen this=0x1d012e0000000002. I can't seen any obvious reason this would occur. Whatever is causing it will be very subtle and hard to find.

streichler commented 2 years ago

@lightsighter where is the pointer to the message manager stored? 0x1d012e0000000002 looks like a Realm processor ID for a utility processor (if the command line I have from @syamajala applies here) on rank 302. Maybe something before it in the containing data structure has an overrun for large node counts?

lightsighter commented 2 years ago

@syamajala and @magnatelee: what are you setting this to?

https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/legion_config.h#L170

magnatelee commented 2 years ago

@lightsighter aha, I was using the default (1024), so the crash was expected. I can increase the value and give another try.

magnatelee commented 2 years ago

That said, I thought we had some safeguard for this case, didn't we? I guess we need to figure out why I couldn't get a more informative error message.

lightsighter commented 2 years ago

It appears we don't have a check for this value which I will add.

lightsighter commented 2 years ago

It looks like we did have a check because there was an error code for it, but it got deleted at some point during a refactoring.

pmccormick commented 2 years ago

I added some "goop" to the cmake config to expose these as parameters and also verify that they are a power-of-two. I'll provide a PR once I kick the tires a bit more.

syamajala commented 2 years ago

That fixes the seg fault for me. Im still seeing unrelated gasnet issues at 384 nodes on Perlmutter though. I'll ping the gasnet guys.

lightsighter commented 2 years ago

Ok, I'm closing this issue and #1139 now that I've added checks back into the runtime for this.