StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
679 stars 145 forks source link

Legion: regression in padded constraint conflicts #1537

Closed mariodirenzo closed 1 year ago

mariodirenzo commented 1 year ago

One of the HTR unit tests has shown that 1626f5638181c210bc1b74c96838b05e5339ba6d has introduced a bug in the detection of the conflicts between padded constraints. In particular, if a constraint with 0 delta is tested against a constraint with positive delta, the conflict is not detected.

The following patch fixes the issue

diff --git a/runtime/legion/legion_constraint.cc b/runtime/legion/legion_constraint.cc
index d194b5263..7c41fe060 100644
--- a/runtime/legion/legion_constraint.cc
+++ b/runtime/legion/legion_constraint.cc
@@ -1765,32 +1765,15 @@ namespace Legion {
           return true;
         for (int idx = 0; idx < delta.get_dim(); idx++)
         {
-          if (delta.lo()[idx] < 0)
-          {
-            if (other.delta.lo()[idx] > 0)
-              return true;
-          }
-          else if (delta.lo()[idx] > 0)
-          {
-            if (other.delta.lo()[idx] < 0)
-              return true;
-            else if ((other.delta.lo()[idx] > 0) &&
-                (delta.lo()[idx] < other.delta.lo()[idx]))
-              return true;
-          }
-          if (delta.hi()[idx] < 0)
-          {
-            if (other.delta.hi()[idx] > 0)
-              return true;
-          }
-          else if (delta.hi()[idx] > 0)
-          {
-            if (other.delta.hi()[idx] < 0)
-              return true;
-            else if ((other.delta.hi()[idx] > 0) &&
-                (delta.hi()[idx] < other.delta.hi()[idx]))
-              return true;
-          }
+          if ((delta.lo()[idx]*other.delta.lo()[idx]) < 0)
+            return true;
+          else if (delta.lo()[idx] < other.delta.lo()[idx])
+            return true;
+
+          if ((delta.hi()[idx]*other.delta.hi()[idx]) < 0)
+            return true;
+          else if (delta.hi()[idx] < other.delta.hi()[idx])
+            return true;
         }
       }
       return false;

@elliottslaughter, could you please add this issue to #1032 with high priority?

lightsighter commented 1 year ago

That's not a bug. A delta of 0 means "don't care". In order to say you don't want any padding then you need to give a negative delta. I even wrote that down here: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/legion_constraint.h?ref_type=heads#L704-707 Also note that that commit didn't change those semantics, just better abided by them.

mariodirenzo commented 1 year ago

I've evidence that your version of the function allows the runtime to map region requirements with padding constraints on instances without padding. Note that such mappings were not allowed before 1626f5638181c210bc1b74c96838b05e5339ba6d and should not be allowed

lightsighter commented 1 year ago

map region requirements with padding constraints

Non-zero padding constraints? Also, you're sure the safe mapper checks are on? They are off by default in release mode. Without them the runtime will let you do all sorts of illegal things.

It's worth noting that the old code was overly strict compared to the documented semantics. If you were relying on 0 bounds before to mean that there was no padding it would have worked but the implementation was inconsistent with the documentation. Your proposed fix is also overly strict and doesn't handle the cases with zero or negative values correctly.

lightsighter commented 1 year ago

If you want we can try switching the semantics so that 0 means absolutely no padding and any negative integer means "don't care".

mariodirenzo commented 1 year ago

The semantics 0 == "I do not care" can be applicable only when you are talking about region requirements. When you are dealing with the LayoutConstraints of an InstanceManager, 0 means that the underlying instance does not have any padding space. For instance, you are checking if the LayoutConstraints of a given InstanceManager are compatible with the constraints of the selected task variant at https://gitlab.com/StanfordLegion/legion/-/blob/control_replication/runtime/legion/mapper_manager.cc?ref_type=heads#L1848 . The current version of the code will not recognize a conflict between an instance that has zero padding and a region requirement with a positive padding constraint. A similar check is in place in many functions of the runtime.

Note 1: your new function would work as expected if you start checking if the layout constraints of the selected task variant conflict with the layout constraints of the selected instance

Note 2: the semantics 0 == "I do not care" is applied by my patch and by your old implementation as well. In fact, the constraints of the task variants are contained in the other.delta in the function. If other.delta is zero along one direction, you will never enter in any of the if statements

+          if ((delta.lo()[idx]*other.delta.lo()[idx]) < 0)
+            return true;
+          else if (delta.lo()[idx] < other.delta.lo()[idx])
+            return true;
+
+          if ((delta.hi()[idx]*other.delta.hi()[idx]) < 0)
+            return true;
+          else if (delta.hi()[idx] < other.delta.hi()[idx])
+            return true;

regardless of the value of delta, which rapresentes of the size of the padding space of the instance.

lightsighter commented 1 year ago

The semantics 0 == "I do not care" can be applicable only when you are talking about region requirements.

You missed this part: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/legion_instances.cc?ref_type=heads#L3909-3918 There are no instances with 0 constraints.

mariodirenzo commented 1 year ago

Do you want a reproducer on sapling?

mariodirenzo commented 1 year ago

BTW, your comment is incorrect. If an instance is created with padding constraints only in one direction, it does not go through that path. For instance, using this padding constraint

const PaddingConstraint padding_1D_X(Point<3>{3, 0, 0}, Point<3>{3, 0, 0});

the runtime will most likely try to create an instance with 3 points of padding on the x sides and 0 points on the other. If afterwards you launch a task with

const PaddingConstraint padding_1D_Y(Point<3>{0, 3, 0}, Point<3>{0, 3, 0});

the runtime will not recognize that the instance created for the previous task is not compatible with this layout constraint.

mariodirenzo commented 1 year ago

If you want to keep your version of the conflict check, you will need to apply the following patch

diff --git a/runtime/legion/legion_instances.cc b/runtime/legion/legion_instances.cc
index 1f6e0e784..eba4ad0a1 100644
--- a/runtime/legion/legion_instances.cc
+++ b/runtime/legion/legion_instances.cc
@@ -3924,6 +3924,14 @@ namespace Legion {
         for (unsigned dim = 0; dim < num_dims; dim++)
           empty[dim] = -1; // no padding
         constraints.padding_constraint.delta = Domain(empty, empty);
+      } else {
+         DomainPoint lo = constraints.padding_constraint.delta.lo();
+         DomainPoint hi = constraints.padding_constraint.delta.hi();
+         for (unsigned dim = 0; dim < num_dims; dim++) {
+            if (lo[dim] == 0) lo[dim] = -1;
+            if (hi[dim] == 0) hi[dim] = -1;
+         }
+         constraints.padding_constraint.delta = Domain(lo, hi);
       }
       // Now let's find the layout constraints to use for this instance
       LayoutDescription *layout = field_space_node->find_layout_description(

Note that also the following fix is required for instances created with delta equal to -1 in some directions

diff --git a/runtime/legion/runtime.cc b/runtime/legion/runtime.cc
index 4bb49bc11..99ec6cb61 100644
--- a/runtime/legion/runtime.cc
+++ b/runtime/legion/runtime.cc
@@ -5323,8 +5323,15 @@ namespace Legion {
 #ifdef DEBUG_LEGION
               assert(domain.get_dim() == delta.get_dim());
 #endif
+              DomainPoint lo = delta.lo();
+              DomainPoint hi = delta.hi();
+              for (int dim = 0; dim < delta.get_dim(); dim++) {
+                lo[dim] = std::max(coord_t(0), lo[dim]);
+                hi[dim] = std::max(coord_t(0), hi[dim]);
+              }
+
               const Domain padded_bounds =
-                Domain(domain.lo() - delta.lo(), domain.hi() + delta.hi());
+                Domain(domain.lo() - lo, domain.hi() + hi);
               switch (domain.get_dim())
               {
 #define DIMFUNC(DIM) \
@@ -5436,7 +5443,13 @@ namespace Legion {
 #ifdef DEBUG_LEGION
             assert(bounds.get_dim() == delta.get_dim());
 #endif
-            outer = Domain(bounds.lo() - delta.lo(), bounds.hi() + delta.hi());
+            DomainPoint lo = delta.lo();
+            DomainPoint hi = delta.hi();
+            for (int dim = 0; dim < delta.get_dim(); dim++) {
+              lo[dim] = std::max(coord_t(0), lo[dim]);
+              hi[dim] = std::max(coord_t(0), hi[dim]);
+            }
+            outer = Domain(bounds.lo() - lo, bounds.hi() + hi);
           }
           else
             outer = bounds;
lightsighter commented 1 year ago

BTW, your comment is incorrect. If an instance is created with padding constraints only in one direction, it does not go through that path.

That's undefined behavior. You should never be trying to create an instance with "don't-care" padding. I'm just going to reverse the polarity of "don't care" and "no constraints" so that nobody else makes the same mistake that you did.

lightsighter commented 1 year ago

Pull and try again.

mariodirenzo commented 1 year ago

You still need to apply the patch above to runtime/legion/runtime.cc to make sure that the outer bounds of the instance are properly computed when delta is -1.

lightsighter commented 1 year ago

Look again, I already did that.

lightsighter commented 1 year ago

Although I still think that should be undefined behavior/an error.

mariodirenzo commented 1 year ago

the execution of the unit test still fails with

(gdb) bt
#0  0x00007f0411a8323f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x55c1159a4920, rem=rem@entry=0x55c1159a4920) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#1  0x00007f0411a88ec7 in __GI___nanosleep (requested_time=requested_time@entry=0x55c1159a4920, remaining=remaining@entry=0x55c1159a4920) at nanosleep.c:27
#2  0x00007f0411a88dfe in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x00007f0415080041 in Realm::realm_freeze (signal=6) at /home/mariodr/legion3/runtime/realm/runtime_impl.cc:200
#4  <signal handler called>
#5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#6  0x00007f04119c8859 in __GI_abort () at abort.c:79
#7  0x00007f04119c8729 in __assert_fail_base (fmt=0x7f0411b5e588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7f0415d08f87 "lo[i] >= 0", file=0x7f0415cfeaf0 "/home/mariodr/legion3/runtime/legion/region_tree.inl", line=505, function=<optimized out>)
    at assert.c:92
#8  0x00007f04119d9fd6 in __GI___assert_fail (assertion=0x7f0415d08f87 "lo[i] >= 0", file=0x7f0415cfeaf0 "/home/mariodr/legion3/runtime/legion/region_tree.inl", line=505,
    function=0x7f0415d0bca0 "Realm::InstanceLayoutGeneric* Legion::Internal::IndexSpaceExpression::create_layout_internal(const Realm::IndexSpace<N, T>&, const Legion::LayoutConstraintSet&, const std::vector<unsigned int>&, const"...) at assert.c:101
#9  0x00007f0414c6dd5d in Legion::Internal::IndexSpaceExpression::create_layout_internal<3, long long> (this=0x7f03f2869860, space=..., constraints=..., field_ids=std::vector of length 2, capacity 2 = {...}, field_sizes=std::vector of length 2, capacity 2 = {...},
    compact=false, piece_list=0x7f03f2f2d7f8, piece_list_size=0x7f03f2f2d800, num_pieces=0x7f03f2f2c2f0) at /home/mariodr/legion3/runtime/legion/region_tree.inl:505
#10 0x00007f0414c4acf2 in Legion::Internal::IndexSpaceNodeT<3, long long>::create_layout (this=0x7f03f2869530, constraints=..., field_ids=std::vector of length 2, capacity 2 = {...}, field_sizes=std::vector of length 2, capacity 2 = {...}, compact=false,
    piece_list=0x7f03f2f2d7f8, piece_list_size=0x7f03f2f2d800, num_pieces=0x7f03f2f2c2f0) at /home/mariodr/legion3/runtime/legion/region_tree.inl:4905
#11 0x00007f04145a00d6 in Legion::Internal::InstanceBuilder::create_physical_instance (this=0x7f03f2f2d5e0, forest=0x55c1123ca840, unsat_kind=0x0, unsat_index=0x0, footprint=0x7f03f2f2d408, precondition=...)
    at /home/mariodr/legion3/runtime/legion/legion_instances.cc:3814
#12 0x00007f04148d7c27 in Legion::Internal::MemoryManager::allocate_physical_instance (this=0x7f03f27109e0, builder=..., footprint=0x7f03f2f2ef90, unsat_kind=0x0, unsat_index=0x0) at /home/mariodr/legion3/runtime/legion/runtime.cc:10366
#13 0x00007f04148cfd5c in Legion::Internal::MemoryManager::find_or_create_physical_instance (this=0x7f03f27109e0, constraints=..., regions=std::vector of length 1, capacity 1 = {...}, result=..., created=@0x7f03f2f2eb29: false, processor=..., acquire=true, priority=0,
    tight_region_bounds=false, unsat_kind=0x0, unsat_index=0x0, footprint=0x7f03f2f2ef90, creator_id=398, remote=false) at /home/mariodr/legion3/runtime/legion/runtime.cc:8728
#14 0x00007f0414913653 in Legion::Internal::Runtime::find_or_create_physical_instance (this=0x55c11344d2e0, target_memory=..., constraints=..., regions=std::vector of length 1, capacity 1 = {...}, result=..., created=@0x7f03f2f2eb29: false, processor=..., acquire=true,
    priority=0, tight_bounds=false, unsat=0x0, footprint=0x7f03f2f2ef90, creator_id=398) at /home/mariodr/legion3/runtime/legion/runtime.cc:26056
#15 0x00007f0414a98446 in Legion::Internal::MapperManager::find_or_create_physical_instance (this=0x55c111d9e360, ctx=0x7f03f29ee4d0, target_memory=..., constraints=..., regions=std::vector of length 1, capacity 1 = {...}, result=..., created=@0x7f03f2f2eb29: false,
    acquire=true, priority=0, tight_region_bounds=false, footprint=0x7f03f2f2ef90, unsat=0x0) at /home/mariodr/legion3/runtime/legion/mapper_manager.cc:2043
#16 0x00007f041473c661 in Legion::Mapping::MapperRuntime::find_or_create_physical_instance (this=0x55c112382540, ctx=0x7f03f29ee4d0, target_memory=..., constraints=..., regions=std::vector of length 1, capacity 1 = {...}, result=..., created=@0x7f03f2f2eb29: false,
    acquire=true, priority=0, tight_bounds=false, footprint=0x7f03f2f2ef90, unsat=0x0) at /home/mariodr/legion3/runtime/legion/legion_mapping.cc:875
#17 0x00007f0414fe5940 in Legion::Mapping::DefaultMapper::default_make_instance (this=0x55c111d9df20, ctx=0x7f03f29ee4d0, target_memory=..., constraints=..., result=..., kind=Legion::Mapping::DefaultMapper::TASK_MAPPING, force_new=false, meets=true, req=...,
    footprint=0x7f03f2f2ef90) at /home/mariodr/legion3/runtime/mappers/default_mapper.cc:2291
#18 0x00007f0414fe3997 in Legion::Mapping::DefaultMapper::default_create_custom_instances (this=0x55c111d9df20, ctx=0x7f03f29ee4d0, target_proc=..., target_memory=..., req=..., index=1, needed_fields=std::set with 0 elements, layout_constraints=...,
    needs_field_constraint_check=false, instances=std::vector of length 1, capacity 1 = {...}, footprint=0x7f03f2f2ef90) at /home/mariodr/legion3/runtime/mappers/default_mapper.cc:1933
#19 0x00007f0414fe1a97 in Legion::Mapping::DefaultMapper::map_task (this=0x55c111d9df20, ctx=0x7f03f29ee4d0, task=..., input=..., output=...) at /home/mariodr/legion3/runtime/mappers/default_mapper.cc:1462
#20 0x00007f0414a91ad0 in Legion::Internal::MapperManager::invoke_map_task (this=0x55c111d9e360, task=0x7f03f2a222d0, input=0x7f03f2f2f110, output=0x7f03f2f2f160, info=0x7f03f29ee4d0) at /home/mariodr/legion3/runtime/legion/mapper_manager.cc:176
#21 0x00007f041430f593 in Legion::Internal::SingleTask::invoke_mapper (this=0x7f03f2a222d0, must_epoch_owner=0x0) at /home/mariodr/legion3/runtime/legion/legion_tasks.cc:3590
#22 0x00007f0414311beb in Legion::Internal::SingleTask::map_all_regions (this=0x7f03f2a222d0, must_epoch_op=0x0, defer_args=0x7f03f2a09560) at /home/mariodr/legion3/runtime/legion/legion_tasks.cc:3984
#23 0x00007f041431fdf9 in Legion::Internal::PointTask::perform_mapping (this=0x7f03f2a222d0, must_epoch_owner=0x0, args=0x7f03f2a09560) at /home/mariodr/legion3/runtime/legion/legion_tasks.cc:7181
#24 0x00007f041492c157 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f03f2a09560, arglen=52, userdata=0x55c11595eb40, userlen=8, p=...) at /home/mariodr/legion3/runtime/legion/runtime.cc:32257
#25 0x00007f041540036e in Realm::LocalTaskProcessor::execute_task (this=0x55c112381ae0, func_id=4, task_args=...) at /home/mariodr/legion3/runtime/realm/proc_impl.cc:1175
#26 0x00007f0415205d88 in Realm::Task::execute_on_processor (this=0x7f03f2a093e0, p=...) at /home/mariodr/legion3/runtime/realm/tasks.cc:326
#27 0x00007f041520b114 in Realm::UserThreadTaskScheduler::execute_task (this=0x55c112381ed0, task=0x7f03f2a093e0) at /home/mariodr/legion3/runtime/realm/tasks.cc:1687
#28 0x00007f0415208ebc in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55c112381ed0) at /home/mariodr/legion3/runtime/realm/tasks.cc:1160
#29 0x00007f0415211ef8 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x55c112381ed0) at /home/mariodr/legion3/runtime/realm/threads.inl:97
#30 0x00007f04151df542 in Realm::UserThread::uthread_entry () at /home/mariodr/legion3/runtime/realm/threads.cc:1355
#31 0x00007f0411a014e0 in ?? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 from /lib/x86_64-linux-gnu/libc.so.6
#32 0x0000000000000000 in ?? ()
(gdb) f 9
#9  0x00007f0414c6dd5d in Legion::Internal::IndexSpaceExpression::create_layout_internal<3, long long> (this=0x7f03f2869860, space=..., constraints=..., field_ids=std::vector of length 2, capacity 2 = {...}, field_sizes=std::vector of length 2, capacity 2 = {...},
    compact=false, piece_list=0x7f03f2f2d7f8, piece_list_size=0x7f03f2f2d800, num_pieces=0x7f03f2f2c2f0) at /home/mariodr/legion3/runtime/legion/region_tree.inl:505
505           assert(lo[i] >= 0);
(gdb) p delta
$1 = (const Legion::Domain &) @0x7f03f2f2d670: {static MAX_RECT_DIM = 3, static NO_DOMAIN = {static MAX_RECT_DIM = 3, static NO_DOMAIN = <same as static member of an already seen type>, is_id = 0, is_type = 0, dim = 0, rect_data = {0, 0, 0, 0, 0, 0}}, is_id = 0,
  is_type = 0, dim = 3, rect_data = {3, -1, -1, 3, -1, -1}}
lightsighter commented 1 year ago

I pushed a fix to the master branch, but need @elliottslaughter to do the merge into control replication to resolve a CMake conflict.

lightsighter commented 1 year ago

The fix was merged into the control replication branch.

mariodirenzo commented 1 year ago

There is still a problem. An unset padding constraint conflucts with a zero padding constraint as you can see in the following case:

(gdb) p $11.padding_constraint
$13 = {<Legion::LayoutConstraint> = {_vptr.LayoutConstraint = 0x5651f277b670 <vtable for Legion::PaddingConstraint+16>}, static constraint_kind = LEGION_PADDING_CONSTRAINT, delta = {static MAX_RECT_DIM = 3, static NO_DOMAIN = {static MAX_RECT_DIM = 3,
      static NO_DOMAIN = <same as static member of an already seen type>, is_id = 0, is_type = 0, dim = 0, rect_data = {0, 0, 0, 0, 0, 0}}, is_id = 0, is_type = 0, dim = 0, rect_data = {0, 0, 0, 0, 0, 0}}}
(gdb) p $12.padding_constraint
$14 = {<Legion::LayoutConstraint> = {_vptr.LayoutConstraint = 0x5651f277b670 <vtable for Legion::PaddingConstraint+16>}, static constraint_kind = LEGION_PADDING_CONSTRAINT, delta = {static MAX_RECT_DIM = 3, static NO_DOMAIN = {static MAX_RECT_DIM = 3,
      static NO_DOMAIN = <same as static member of an already seen type>, is_id = 0, is_type = 0, dim = 0, rect_data = {0, 0, 0, 0, 0, 0}}, is_id = 0, is_type = 0, dim = 3, rect_data = {0, 0, 0, 0, 0, 0}}}
(gdb) p $11.padding_constraint.conflicts($12.padding_constraint)
$15 = true

This issue prevents a region requirement without padding constraints from being mapped on a region mapped with zero padding using the current logic available in the default mapper (https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/mappers/default_mapper.cc#L1976)

This is fixed by the following patch

diff --git a/runtime/legion/legion_constraint.cc b/runtime/legion/legion_constraint.cc
index e0a757f0f..8f327b296 100644
--- a/runtime/legion/legion_constraint.cc
+++ b/runtime/legion/legion_constraint.cc
@@ -1763,8 +1763,16 @@ namespace Legion {
     {
       if (other.delta.get_dim() > 0)
       {
-        if (delta.get_dim() != other.delta.get_dim())
-          return true;
+        if (delta.get_dim() != other.delta.get_dim()) {
+          if (delta.get_dim() == 0) {
+            for (int idx = 0; idx < other.delta.get_dim(); idx++) {
+              if (other.delta.lo()[idx] > 0) return true;
+              if (other.delta.hi()[idx] > 0) return true;
+            }
+            return false;
+          } else
+            return true;
+        }
         for (int idx = 0; idx < delta.get_dim(); idx++)
         {
           if ((delta.lo()[idx] >= 0) && (other.delta.lo()[idx] >= 0))
lightsighter commented 1 year ago

I pushed a less restrictive change that will at least handle the case. Conflict testing should only fail if both constraint sets strictly conflict with each other. If either one is "don't care" then they shouldn't ever conflict with each other. For example, if either constraint has a zero dimension, that is considered a "don't care" and therefore will never be able to conflict with another set of constraints.

mariodirenzo commented 1 year ago

Sounds good. Thanks for fixing the issue

lightsighter commented 1 year ago

Thanks for dealing with the churn. It was necessary to make the implementation more robust.