StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
678 stars 145 forks source link

[HTR] Segmentation faults at 16 nodes #1420

Closed cmelone closed 7 months ago

cmelone commented 1 year ago

I am able to run one of our applications (16 ranks, 1 rank per node) on Legion commit cba415a857c2586b2ad2f4848d6d1cd75de7df00.

However, on 9c6c90b9e3857196da2659a29140f2d7686832bb, I get segmentation faults and non-deterministic errors such as:

prometeo_ConstPropMix.exec: prometeo_variables.cc:75: static void UpdatePropertiesFromPrimitiveTask::cpu_base_impl(const UpdatePropertiesFromPrimitiveTask::Args&, const std::vector<Legion::PhysicalRegion>&, const std::vector<Legion::Future>&, Legion::Context, Legion::Runtime*): Assertion `args.mix.CheckMixture(acc_MolarFracs[p])' failed.
[5 - 7fbc93ba8840] 1193.644387 {6}{realm}: invalid event handle: id=7fbcab057570
prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/multi/codes/legion-cpu-release/runtime/realm/runtime_impl.cc:2509: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.

This program does run successfully with DEBUG=1. I am actively running this test case with smaller configurations to see if I can reproduce outside of this specific config.

Edit:

16 ranks, 4 ranks per node works

lightsighter commented 1 year ago

I would need to rerun to give this info. is it fundamental?

Yes, I need the name of at least one point in the index space that has bad data in order to be able to shrink down the analysis.

the field id of avg_rho

Do any other regions share the same field space?

If you're going to re-run to get the bad point then also get the UniqueID of the task that sees bad data.

mariodirenzo commented 1 year ago

Yes, I need the name of at least one point in the index space that has bad data in order to be able to shrink down the analysis.

Ok

Do any other regions share the same field space?

No, it is only utilized for that region

If you're going to re-run to get the bad point then also get the UniqueID of the task that sees bad data.

Can I print it from the task body?

lightsighter commented 1 year ago

Can I print it from the task body?

Yes, on the Task*: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion.h#L4005

muraj commented 1 year ago

I believe the issue is related to a recent change in master to improve event creation scalability. The issue comes in when an application creates ProcessorGroups in parallel at runtime, the global data-structure added for managing the per-processor/processorgroup freelist vector was not protected under a lock, so when appended to, it would result in a race and corruption would ensue. I've added a lock in this PR here, based off master, if you'd like to give it a try: https://gitlab.com/StanfordLegion/legion/-/merge_requests/802

mariodirenzo commented 7 months ago

This issue has been fixed by https://gitlab.com/StanfordLegion/legion/-/merge_requests/1158. @cmelone, can you please close the issue?