Open elliottslaughter opened 7 months ago
I moved this issue here from https://github.com/StanfordLegion/legion/issues/1449#issuecomment-2016915789 because this appears to be a different underlying root cause.
Summary:
I haven't seen this one before so maybe @bonachea @PHHargrove can comment?
This assertion failure is one of the three known manifestations of "the FI_MULTI_RECV bug".
Specifically, the provider has delivered a message buffer to us which is all zeros which happens to result in a detectable inconsistency in our header.
Please let me know immediately if this has occurred when running with GASNET_OFI_RECEIVE_BUFF_SIZE=recv
, since that would not use FI_MULTI_RECV
and therefore be a new/different issue.
Regarding "Does NOT reproduce with debug GASNet + debug Legion":
That could be either a timing or "chance" issue. In extensive work with the MetaHipMer team, this failure mode (one of three believe to be related to multi-recv buffer handing in the provider) probably accounted for at most 1% of their bad runs.
Please tell me if the failing run took place before or after the Perlmutter maintenance of March 20.
If was after, then this is evidence that the issue is still present in SlingShot 2.1.2.
As far as I'm aware, this run did not use GASNET_OFI_RECEIVE_BUFF_SIZE=recv
. And it was after the March 20 maintenance.
Ok, I got an assertion failure with a debug GASNet and release Legion:
Full log here.
Originally posted by @rupanshusoi in https://github.com/StanfordLegion/legion/issues/1449#issuecomment-2016915789