StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
657 stars 146 forks source link

GASNet debug assert failure in Stencil on Perlmutter #1660

Open elliottslaughter opened 3 months ago

elliottslaughter commented 3 months ago

Ok, I got an assertion failure with a debug GASNet and release Legion:

*** FATAL ERROR: Assertion failure (proc 103): in gasnetc_ofi_handle_am() at anguage/gasnet/GASNet-2023.9.0/ofi-conduit/gasnet_ofi.c:1725: isreq == header->isreq
   op1 :           1 (0x00000001) == isreq
   op2 :           0 (0x00000000) == header->isreq

Full log here.

Originally posted by @rupanshusoi in https://github.com/StanfordLegion/legion/issues/1449#issuecomment-2016915789

elliottslaughter commented 3 months ago

I moved this issue here from https://github.com/StanfordLegion/legion/issues/1449#issuecomment-2016915789 because this appears to be a different underlying root cause.

Summary:

I haven't seen this one before so maybe @bonachea @PHHargrove can comment?

PHHargrove commented 3 months ago

This assertion failure is one of the three known manifestations of "the FI_MULTI_RECV bug".
Specifically, the provider has delivered a message buffer to us which is all zeros which happens to result in a detectable inconsistency in our header.

Please let me know immediately if this has occurred when running with GASNET_OFI_RECEIVE_BUFF_SIZE=recv, since that would not use FI_MULTI_RECV and therefore be a new/different issue.

Regarding "Does NOT reproduce with debug GASNet + debug Legion":
That could be either a timing or "chance" issue. In extensive work with the MetaHipMer team, this failure mode (one of three believe to be related to multi-recv buffer handing in the provider) probably accounted for at most 1% of their bad runs.

Please tell me if the failing run took place before or after the Perlmutter maintenance of March 20.
If was after, then this is evidence that the issue is still present in SlingShot 2.1.2.

rupanshusoi commented 3 months ago

As far as I'm aware, this run did not use GASNET_OFI_RECEIVE_BUFF_SIZE=recv. And it was after the March 20 maintenance.