Closed cmelone closed 7 months ago
I would need to rerun to give this info. is it fundamental?
Yes, I need the name of at least one point in the index space that has bad data in order to be able to shrink down the analysis.
the field id of avg_rho
Do any other regions share the same field space?
If you're going to re-run to get the bad point then also get the UniqueID
of the task that sees bad data.
Yes, I need the name of at least one point in the index space that has bad data in order to be able to shrink down the analysis.
Ok
Do any other regions share the same field space?
No, it is only utilized for that region
If you're going to re-run to get the bad point then also get the UniqueID of the task that sees bad data.
Can I print it from the task body?
Can I print it from the task body?
Yes, on the Task*
:
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion.h#L4005
I believe the issue is related to a recent change in master to improve event creation scalability. The issue comes in when an application creates ProcessorGroups in parallel at runtime, the global data-structure added for managing the per-processor/processorgroup freelist vector was not protected under a lock, so when appended to, it would result in a race and corruption would ensue. I've added a lock in this PR here, based off master, if you'd like to give it a try: https://gitlab.com/StanfordLegion/legion/-/merge_requests/802
This issue has been fixed by https://gitlab.com/StanfordLegion/legion/-/merge_requests/1158. @cmelone, can you please close the issue?
I am able to run one of our applications (16 ranks, 1 rank per node) on Legion commit
cba415a857c2586b2ad2f4848d6d1cd75de7df00
.However, on
9c6c90b9e3857196da2659a29140f2d7686832bb
, I get segmentation faults and non-deterministic errors such as:This program does run successfully with
DEBUG=1
. I am actively running this test case with smaller configurations to see if I can reproduce outside of this specific config.Edit:
16 ranks, 4 ranks per node works