Closed syamajala closed 2 years ago
How repeatable is this? The way you get a segfault there is by literally corrupting the first few bytes of the client payload of an inter-node message.
It was pretty repeatable. I ran 3 times and hit the same issue each time.
Also, it only appears at 256 nodes when I turn the trace allocation logging on (-level allocation=2). If I just build with -DTRACE_ALLOCATION but dont turn the logging on it seems to work at 256 nodes. All other node counts from 4 - 128 nodes worked as expected.
@streichler Do you have an option to turn on CRC checksums on messages in Realm? Looks like some data is getting corrupted in a message (literally the first four bytes).
The gasnetex network layer does message checksums by default, so try building/running with that.
@syamajala Which version of GASNet are you using?
Should be gasnetex-2021.3.0.
Can you try this again with all the GASNet-EX fixes that @streichler pushed?
It is still seg faulting in receive_message.
Duplicate of #1159 and fixed.
I am running S3D on Summit with -DTRACE_ALLOCATION at 256 nodes and seeing the following crash:
Using commit 42b768e10fb6afbb0842 of control_replication.