Closed syamajala closed 2 years ago
@syamajala Are we sure this isn't a duplicate of #915 ?
S3D has used a fence for a while now in between creating a partition and when it first gets used by a projection functor. That has not changed recently in the application code.
This issue only seems to appear on Summit and on the new commit, going back to an older Legion commit (ffca5154) everything works. I was able to run on Blaze with the newer commit without any problems. There have been no application changes in between updating the versions of Legion.
This issue only seems to appear on Summit and on the new commit, going back to an older Legion commit (https://github.com/StanfordLegion/legion/commit/ffca51544500f3149169d4f0b59651b063f23e77)
Ok, can you bisect exactly which commit causes the failure?
I bisected on Summit and the first bad commit is https://github.com/StanfordLegion/legion/commit/be62515f19143779a53c4c1a6d679515c2196a14
@elliottslaughter I'm not sure it's safe to go full context-free right now with control replication.
What do you want to do?
This was a recent C API change, but the context-free C++ API had been around for a while. There could be C++ clients relying on it.
I could put in a context-ful C API, just to work around this in Regent. It will have some knock-on effects on Regent's analysis (which is why I wanted it context-free in the first place), but we could probably live with that in the short term. But are we going to get back to the context-free API eventually, or are expecting this to continue to be a problem?
It depends on what the group decides we're going to do about #915 which we've discussed in the last two Legion meetings. It's looking more and more like the context-free API is going to be deprecated in a lot of places (although not all). I don't see a way to support context-free in all the places it might be invoked and still be sound. I'm probably going to be willing to keep the context-free API intact, but it will become a fatal error if the runtime can't find an implicit context.
What about inside projection functors? I think that's the reason why I originally pushed the context-free version of this API. Also, it's not obvious to me what it means to issue a fence (or if this is even desirable) inside of a projection functor.
@syamajala can you try this workaround?
diff --git a/bindings/regent/regent_partitions.cc b/bindings/regent/regent_partitions.cc
index cf800b737..2b50fb613 100644
--- a/bindings/regent/regent_partitions.cc
+++ b/bindings/regent/regent_partitions.cc
@@ -641,11 +641,13 @@ legion_terra_index_cross_product_get_subpartition_by_color_domain_point(
legion_domain_point_t color_)
{
Runtime *runtime = CObjectWrapper::unwrap(runtime_);
+ assert(Runtime::has_context());
+ Context ctx = Runtime::get_context();
IndexPartition partition = CObjectWrapper::unwrap(prod.partition);
DomainPoint color = CObjectWrapper::unwrap(color_);
- IndexSpace is = runtime->get_index_subspace(partition, color);
- IndexPartition ip = runtime->get_index_partition(is, prod.other_color);
+ IndexSpace is = runtime->get_index_subspace(ctx, partition, color);
+ IndexPartition ip = runtime->get_index_partition(ctx, is, prod.other_color);
return CObjectWrapper::wrap(ip);
}
diff --git a/runtime/legion/legion_c.cc b/runtime/legion/legion_c.cc
index 3293a6052..b51fbe556 100644
--- a/runtime/legion/legion_c.cc
+++ b/runtime/legion/legion_c.cc
@@ -2275,10 +2275,12 @@ legion_logical_partition_create(legion_runtime_t runtime_,
legion_index_partition_t handle_)
{
Runtime *runtime = CObjectWrapper::unwrap(runtime_);
+ assert(Runtime::has_context());
+ Context ctx = Runtime::get_context();
LogicalRegion parent = CObjectWrapper::unwrap(parent_);
IndexPartition handle = CObjectWrapper::unwrap(handle_);
- LogicalPartition r = runtime->get_logical_partition(parent, handle);
+ LogicalPartition r = runtime->get_logical_partition(ctx, parent, handle);
return CObjectWrapper::wrap(r);
}
This patch works.
So I guess the question for @lightsighter is whether this is a solution we want to go with, at least long enough to be worth committing into the repo?
What about inside projection functors? I think that's the reason why I originally pushed the context-free version of this API. Also, it's not obvious to me what it means to issue a fence (or if this is even desirable) inside of a projection functor.
All the same issues I brought up in the Legion meeting two weeks ago. :)
So I guess the question for @lightsighter is whether this is a solution we want to go with, at least long enough to be worth committing into the repo?
Seems like it's probably worth a more detailed conversation in the Legion meeting for this week now that we have some more context and an actual example.
I looked at this code again, and I can't explain why it changes any functionality in the current implementation. I'm not convinced that it actually fixes the original problem. It might just be perturbing the timing.
I dont think this issue is relevant anymore.
Updating to the latest control_replication (af833c94b086) on Summit, I'm seeing this:
[2 - 20004c33f890] 9.102109 {5}{runtime}: [error 482] LEGION ERROR: Unable to find entry for index partition 18. This is definitely a runtime bug. (from file /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_summit/legion/runtime/legion/region_tree.cc:4661)
Here is a stack trace: