StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
678 stars 145 forks source link

Legion: Seg fault in receive_message when using TRACE_ALLOCATION #1139

Closed syamajala closed 2 years ago

syamajala commented 3 years ago

I am running S3D on Summit with -DTRACE_ALLOCATION at 256 nodes and seeing the following crash:

Thread 8 (Thread 0x2000dfa1f890 (LWP 594571)):
#0  0x000020000093a114 in nanosleep () from /lib64/power9/libc.so.6
#1  0x0000200000939f44 in sleep () from /lib64/power9/libc.so.6
#2  0x00002000033d5328 in Realm::realm_freeze (signal=<optimized out>) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/realm/runtime_impl.cc:177
#3  <signal handler called>
#4  0x000020000280c0b0 in Legion::Internal::MessageManager::receive_message (this=0x3ff000001400742a, args=0x2016054a3fc0, arglen=48) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/legion/runtime.cc:13133
#5  0x000020000280c12c in Legion::Internal::Runtime::process_message_task (this=<optimized out>, args=0x2016054a3fbc, arglen=<optimized out>) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/legion/runtime.cc:25629
#6  0x000020000280c4c0 in Legion::Internal::Runtime::legion_runtime_task (args=0x2016054a3fb0, arglen=56, userdata=<optimized out>, userlen=<optimized out>, p=...) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/legion/runtime.cc:31058
#7  0x00002000033b5fc0 in Realm::LocalTaskProcessor::execute_task (this=0x51ecbe20, func_id=<optimized out>, task_args=...) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/realm/bytearray.inl:150
#8  0x000020000340b1c0 in Realm::Task::execute_on_processor (this=0x51f8e7e0, p=...) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/realm/runtime_impl.h:378
#9  0x000020000340b384 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/realm/tasks.cc:1646
#10 0x000020000340e04c in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x4ec13390) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/realm/tasks.cc:1125
#11 0x0000200003417090 in Realm::UserThread::uthread_entry () at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/realm/threads.cc:1337
#12 0x00002000008a7ffc in makecontext () from /lib64/power9/libc.so.6
#13 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Using commit 42b768e10fb6afbb0842 of control_replication.

lightsighter commented 3 years ago

How repeatable is this? The way you get a segfault there is by literally corrupting the first few bytes of the client payload of an inter-node message.

syamajala commented 3 years ago

It was pretty repeatable. I ran 3 times and hit the same issue each time.

syamajala commented 3 years ago

Also, it only appears at 256 nodes when I turn the trace allocation logging on (-level allocation=2). If I just build with -DTRACE_ALLOCATION but dont turn the logging on it seems to work at 256 nodes. All other node counts from 4 - 128 nodes worked as expected.

lightsighter commented 3 years ago

@streichler Do you have an option to turn on CRC checksums on messages in Realm? Looks like some data is getting corrupted in a message (literally the first four bytes).

streichler commented 3 years ago

The gasnetex network layer does message checksums by default, so try building/running with that.

lightsighter commented 3 years ago

@syamajala Which version of GASNet are you using?

syamajala commented 3 years ago

Should be gasnetex-2021.3.0.

lightsighter commented 3 years ago

Can you try this again with all the GASNet-EX fixes that @streichler pushed?

syamajala commented 3 years ago

It is still seg faulting in receive_message.

lightsighter commented 2 years ago

Duplicate of #1159 and fixed.