Legion: (runtime) leak of index partitions / index expressions

StanfordLegion / legion

The Legion Parallel Programming System

https://legion.stanford.edu

Apache License 2.0

685 stars 145 forks source link

Legion: (runtime) leak of index partitions / index expressions #1394

Open rohany opened 1 year ago

rohany commented 1 year ago

The following Legion program is adapted from the Legate code described in https://github.com/nv-legate/legate.core/issues/548, where I'm developing a sparse machine learning system. The Legion program is making a similar sequence of calls that the Legate program is making under the hood.

#include "legion.h"

using namespace Legion;

enum TASK_IDS {
  TID_TOP_LEVEL,
};

enum FIELD_IDS {
  FID_1,
};

void top_level_task(const Task* task, const std::vector<PhysicalRegion>& regions, Context ctx, Runtime* runtime) {
  auto n = 1000000000;
  auto ispace = runtime->create_index_space(ctx, Rect<1>{0, n - 1});
  auto fspace = runtime->create_field_space(ctx);
  {
    auto alloc = runtime->create_field_allocator(ctx, fspace);
    alloc.allocate_field(sizeof(double), FID_1);
  }
  auto region = runtime->create_logical_region(ctx, ispace, fspace);
  runtime->fill_field(ctx, region, region, FID_1, double(0));

  int64_t i = 0;
  int64_t batchsize = 100;
  while (true) {
    if (i % 100 == 0) {
      runtime->issue_execution_fence(ctx).wait();
      std::cout << "i = " << i << std::endl;
    }

    auto lo = i * batchsize;
    auto hi = (i + 1) * batchsize;
    auto ipart = runtime->create_index_partition(ctx, ispace, Rect<1>{0, 0}, {
        {DomainPoint(Point<1>{0}), Domain({Rect<1>{lo, hi - 1}})}
    });
    auto lpart = runtime->get_logical_partition(ctx, region, ipart);
    auto subreg = runtime->get_logical_subregion_by_color(ctx, lpart, DomainPoint(Point<1>(0)));
    auto preg = runtime->map_region(ctx, RegionRequirement(
        subreg, LEGION_READ_ONLY, LEGION_EXCLUSIVE, region).add_field(FID_1));
    preg.wait_until_valid();
    runtime->unmap_region(ctx, preg);
    runtime->fill_field(ctx, subreg, reg, FID_1, double(0));
    runtime->destroy_index_partition(ctx, ipart);
    i++;
  }
}

int main(int argc, char** argv) {
  Runtime::set_top_level_task_id(TID_TOP_LEVEL);
  {
    TaskVariantRegistrar registrar(TID_TOP_LEVEL, "top_level");
    registrar.add_constraint(ProcessorConstraint(Processor::LOC_PROC));
    registrar.set_replicable();
    Runtime::preregister_task_variant<top_level_task>(registrar, "top_level");
  }
  return Runtime::start(argc, argv);
}

When compiled against control_replication with -DLEGION_GC and ran with -ll:csize 4000 -dm:exact_region, the following data is reported by legion_gc.py -l after 300 and 600 iterations, respectively. To get this printout without a legion_gc.py error, the script needs to be modified so that report_references is a noop.

LEAK SUMMARY
  LEAKED FUTURES: 1
  Leaked Future Maps: 0
  LEAKED CONSTRAINTS: 3
  LEAKED MANAGERS: 307
  Pinned Managers: 0
  LEAKED VIEWS: 307
  LEAKED EQUIVALENCE SETS: 1
  LEAKED INDEX SPACES: 616
  LEAKED INDEX PARTITIONS: 307
  LEAKED INDEX EXPRESSIONS: 94507
  LEAKED FIELD SPACES: 1
  LEAKED REGIONS: 307
  LEAKED PARTITIONS: 307

LEAK SUMMARY
  LEAKED FUTURES: 1
  Leaked Future Maps: 0
  LEAKED CONSTRAINTS: 3
  LEAKED MANAGERS: 603
  Pinned Managers: 0
  LEAKED VIEWS: 603
  LEAKED EQUIVALENCE SETS: 1
  LEAKED INDEX SPACES: 1208
  LEAKED INDEX PARTITIONS: 603
  LEAKED INDEX EXPRESSIONS: 364188
  LEAKED FIELD SPACES: 1
  LEAKED REGIONS: 603
  LEAKED PARTITIONS: 603

As far as I can tell, the application is freeing runtime data structures as eagerly as possible -- the IndexPartition created at each iteration is deleted, and the logical subregion and logical partition are not data structures that can be destroyed with the Legion API. The batchsize here is small enough that Legate (rightfully so) does not want to create a single partition and index into it as the number of subregions would be too large. Is there something the application should be doing here, or is the runtime not collecting things that it should be?

rohany commented 1 year ago

Things I've tried (suggestions from @magnatelee)

Running under a tight memory limit so that batches cause GC to get invoked and collect instances
Running without the default mapper's cache enabled so physical instance references don't accumulate

rohany commented 1 year ago

Something seems a bit suspect, because even running this for 10 iterations (and killing it before clean exit) generates a legion gc log that legion_gc.py explodes on:

Traceback (most recent call last):
  File "/Users/rohany/Documents/research/legion/examples/slicing/../../tools/legion_gc.py", line 1335, in <module>
    main()
  File "/Users/rohany/Documents/research/legion/examples/slicing/../../tools/legion_gc.py", line 1331, in main
    state.check_for_leaks(assert_on_error, verbose)
  File "/Users/rohany/Documents/research/legion/examples/slicing/../../tools/legion_gc.py", line 1203, in check_for_leaks
    if not index_space.check_for_leaks(assert_on_error, verbose):
  File "/Users/rohany/Documents/research/legion/examples/slicing/../../tools/legion_gc.py", line 398, in check_for_leaks
    self.report_references(printer, RESOURCE_REF_KIND, verbose)
  File "/Users/rohany/Documents/research/legion/examples/slicing/../../tools/legion_gc.py", line 512, in report_references
    src.report_references(printer, kind, verbose)
  File "/Users/rohany/Documents/research/legion/examples/slicing/../../tools/legion_gc.py", line 512, in report_references
    src.report_references(printer, kind, verbose)
  File "/Users/rohany/Documents/research/legion/examples/slicing/../../tools/legion_gc.py", line 512, in report_references
    src.report_references(printer, kind, verbose)
  [Previous line repeated 988 more times]
  File "/Users/rohany/Documents/research/legion/examples/slicing/../../tools/legion_gc.py", line 510, in report_references
    printer.println('Nested Resource '+repr(src)+' (Refs='+str(refs)+')')
  File "/Users/rohany/Documents/research/legion/examples/slicing/../../tools/legion_gc.py", line 668, in __repr__
    return 'Partition '+str(self.did)+' (Node='+str(self.node)+') ('+str(self.index_partition)+','+str(self.field_space)+','+str(self.tree_id)+')'
RecursionError: maximum recursion depth exceeded while getting the str of an object

lightsighter commented 1 year ago

This isn't that surprising to me. The managers are still alive (notice their counts double too) and they are holding references to their index sub-spaces to keep them alive. You're not even close to filling up the system memory with these instances so there is no garbage collection to trigger manager deletion. Try running with -ll:csize 1 and see what happens once you hit that 1MB footprint. The instances should start being collected, which will clean up the managers, and thereby release references on the index sub-spaces. If that occurs then we can discuss how to more eagerly collect instances. (Note this experiment still requires the default mapper not to be holding any references to the managers either.) If the manager count stops doubling but the index space count keeps doubling then I'll agree there is an issue with the runtime; as long as they are correlated though then everything is currently still working as intended.

Something seems a bit suspect, because even running this for 10 iterations (and killing it before clean exit) generates a legion gc log that legion_gc.py explodes on

The hand-over-hand reference counting that Legion does will sometimes temporarily introduce cycles into the reference graph which are resolved when the new references are added and the old ones removed. If you happen to kill Legion in the middle of it's execution when once of these cycles is in existence, then Legion GC will go into an infinite loop around the cycle (unless you run with the -c option to check for cycles), which will eventually overflow the Python interpreter's stack.

rohany commented 1 year ago

The managers are still alive (notice their counts double too) and they are holding references to their index sub-spaces to keep them alive. You're not even close to filling up the system memory with these instances so there is no garbage collection to trigger manager deletion. Try running with -ll:csize 1 and see what happens once you hit that 1MB footprint.

I tried this -- increased the batch size to 10000 and ran with -ll:csize 1 and still saw the same pattern of growth in legion_gc.py of index spaces, but not managers.

LEAK SUMMARY
  LEAKED FUTURES: 189
  LEAKD FUTURE MAPS: 87
  LEAKED CONSTRAINTS: 3
  LEAKED MANAGERS: 26
  Pinned Managers: 0
  LEAKED VIEWS: 27
  LEAKED EQUIVALENCE SETS: 2
  LEAKED INDEX SPACES: 602
  LEAKED INDEX PARTITIONS: 300
  LEAKED INDEX EXPRESSIONS: 23305
  LEAKED FIELD SPACES: 1
  LEAKED REGIONS: 213
  LEAKED PARTITIONS: 300

LEAK SUMMARY
  LEAKED FUTURES: 194
  LEAKD FUTURE MAPS: 92
  LEAKED CONSTRAINTS: 3
  LEAKED MANAGERS: 26
  Pinned Managers: 0
  LEAKED VIEWS: 27
  LEAKED EQUIVALENCE SETS: 2
  LEAKED INDEX SPACES: 1002
  LEAKED INDEX PARTITIONS: 500
  LEAKED INDEX EXPRESSIONS: 84305
  LEAKED FIELD SPACES: 1
  LEAKED REGIONS: 408
  LEAKED PARTITIONS: 500

(Note this experiment still requires the default mapper not to be holding any references to the managers either.)

The main place I know of this is the cache of mappings the default mapper maintains, which I disabled.

The hand-over-hand reference counting that Legion does will sometimes temporarily introduce cycles into the reference graph which are resolved when the new references are added and the old ones removed. If you happen to kill Legion in the middle of it's execution when once of these cycles is in existence, then Legion GC will go into an infinite loop around the cycle (unless you run with the -c option to check for cycles), which will eventually overflow the Python interpreter's stack.

Ok, that makes sense then. It's annoying because I can't see what's holding the references, and legion cleans eveyrthing up successfully on a normal exit, so I have a kill it in the middle.

lightsighter commented 1 year ago

Provide a diff for the change that you made to the default mapper.

rohany commented 1 year ago

➜  slicing git:(control_replication) ✗ cat diff.txt                                                                                                                           (base) 17:06:56
diff --git a/runtime/mappers/default_mapper.cc b/runtime/mappers/default_mapper.cc
index 624b5b374a..32ac8da69a 100644
--- a/runtime/mappers/default_mapper.cc
+++ b/runtime/mappers/default_mapper.cc
@@ -1692,7 +1692,7 @@ namespace Legion {
     //--------------------------------------------------------------------------
     {
       // Always cache task result.
-      return DEFAULT_CACHE_POLICY_ENABLE;
+      return DEFAULT_CACHE_POLICY_DISABLE;
     }

     //--------------------------------------------------------------------------
@@ -2394,6 +2394,7 @@ namespace Legion {
                                 bool meets_fill_constraints, bool reduction)
     //--------------------------------------------------------------------------
     {
+      return LEGION_GC_FIRST_PRIORITY;
       // Pretty simple: keep our big instances around
       // as long as possible, delete reduction instances
       // as soon as possible, otherwise we are ambivalent

rohany commented 1 year ago

Also, to get the GC behavior, the application must be modified slightly -- i've edited it in the post now.

lightsighter commented 1 year ago

I've dug into this program and the runtime is still behaving as expected. The reason the index spaces and index partitions are not being deleted is because they are still necessary to represent valid data of the root region. Consider the first iteration of the main loop which touches the points [0,99]. It then maps them, overwrites those subpoints with a fill, and then deletes the partition. However, that doesn't mean that the data in [0,99] are suddenly cleared. (Legion doesn't do value testing so it can't see that the fill to the subregion is the same value as the fill to the root at the beginning of the program; it thinks there is "dirty" fill data in [0,99].) If the program were to come along and ask to map the top-level region, Legion is still responsible for reporting the most recent updates to those points, which was done in that particular partition and subregion, so Legion keeps references to them around until they are no longer needed. It is precisely these references that are keeping the logical partition, it's logical subregion, and the corresponding index partition/subspace alive even after they have been deleted. You can test this hypothesis by changing the fill inside the loop to fill the root region again instead of just filling the subregion. Alternatively if you don't want to fill the whole root, you can create a new partition that aliases with the deleted one and do the fill there; it will have the same effect. If you do either of those, then the index spaces will still hang around until the managers are deleted by the garbage collector, but after that everything else gets cleaned up and we hit a steady state of memory usage.

You also had an additional leak of index spaces from using a deprecated index partitioning API. The color spaces for every single partition were leaking because the runtime was having to make them implicitly for the deprecated API and then the application wasn't cleaning them up.

With fixes for the two issues mentioned above, I can run the following program like this: -ll:csize 1 -dm:exact_region and see no memory growth.

#include "legion.h"

using namespace Legion;

enum TASK_IDS {
  TID_TOP_LEVEL,
};

enum FIELD_IDS {
  FID_1,
};

void top_level_task(const Task* task, const std::vector<PhysicalRegion>& regions, Context ctx, Runtime* runtime) {
  auto n = 1000000000;
  auto ispace = runtime->create_index_space(ctx, Rect<1>{0, n - 1});
  auto fspace = runtime->create_field_space(ctx);
  {
    auto alloc = runtime->create_field_allocator(ctx, fspace);
    alloc.allocate_field(sizeof(double), FID_1);
  }
  auto region = runtime->create_logical_region(ctx, ispace, fspace);
  runtime->fill_field(ctx, region, region, FID_1, double(0));

  int64_t i = 0;
  int64_t batchsize = 10000;
  while (true) {
    if (i % 100 == 0) {
      runtime->issue_execution_fence(ctx).wait();
      std::cout << "i = " << i << std::endl;
    }

    auto lo = i * batchsize;
    auto hi = (i + 1) * batchsize;
    auto ipart = runtime->create_index_partition(ctx, ispace, Rect<1>{0, 0}, {
        {DomainPoint(Point<1>{0}), Domain({Rect<1>{lo, hi - 1}})}
    });
    auto lpart = runtime->get_logical_partition(ctx, region, ipart);
    auto subreg = runtime->get_logical_subregion_by_color(ctx, lpart, DomainPoint(Point<1>(0)));
    auto preg = runtime->map_region(ctx, RegionRequirement(
        subreg, LEGION_READ_ONLY, LEGION_EXCLUSIVE, region).add_field(FID_1));
    preg.wait_until_valid();
    runtime->unmap_region(ctx, preg);
    runtime->fill_field(ctx, region, region, FID_1, double(0));
    auto color_space = runtime->get_index_partition_color_space_name(ctx, ipart);
    runtime->destroy_index_partition(ctx, ipart);
    runtime->destroy_index_space(ctx, color_space);
    i++;
  }
}

int main(int argc, char** argv) {
  Runtime::set_top_level_task_id(TID_TOP_LEVEL);
  {
    TaskVariantRegistrar registrar(TID_TOP_LEVEL, "top_level");
    registrar.add_constraint(ProcessorConstraint(Processor::LOC_PROC));
    registrar.set_replicable();
    Runtime::preregister_task_variant<top_level_task>(registrar, "top_level");
  }
  return Runtime::start(argc, argv);
}

rohany commented 1 year ago

Alternatively if you don't want to fill the whole root, you can create a new partition that aliases with the deleted one and do the fill there; it will have the same effect.

Can you elaborate here?

Ok, you've convinced me that the batches here (as written) are necessary for maintaining valid copies of the data. As expected, if I remove the fill on the subregion, then I will OOM, as the valid instances that are created from the inline mapping can't be collected (because they are the only valid instances). This would be the case even if the original region had been mapping to an instance and I was slicing out subregions of it into separate instances, unless I used untrack_valid_regions (however, this doesn't have the semantics that I think we'd want for Legate right now).

So the real question I have is how do I effectively loop through a region and slice out subregions (for a sequence of read-only operations) of it without blowing up any sort of internal Legion data structure? I'm working on a machine learning application which does exactly this -- it creates a large NumPy array of the full data (therefore having some valid instances across the machine), then loops over the array in batches, slicing out each batch to train the network on. The instances created by the slice should be mostly ephemeral, as they only have a lifetime of one iteration in the training. I guess that I could have a task that writes to the main partition of the full data at each iteration, but this sounds wasteful. I can't have the accumulation of these index expressions, as I've observed non-trivial slowdowns in the runtime as training goes on.

Does FlexFlow have a similar problem like this? How do you guys manage batches @lockshaw @jiazhihao?

You also had an additional leak of index spaces from using a deprecated index partitioning API. The color spaces for every single partition were leaking because the runtime was having to make them implicitly for the deprecated API and then the application wasn't cleaning them up.

Good to know, I have to check if Legate is using the old API then.

rohany commented 1 year ago

Your insights were helpful in helping me translate some of this into my Legate application, and I think the question now comes back to eager collections. I tested in my application doing a write to the equivalent of region in this application, and see that my per-iteration times grow until a memory limit is hit, and then there's a collection, the iteration time goes back to normal then slowly grows again.

Since I'm doing a write to region, deleting the created index partitions (and using the new API now, so not leaking color spaces anymore), the references to the index spaces and index expressions are the physical instances that aren't getting collected until memory pressure kicks in. As I mentioned before, that doesn't work for me here because the constraint isn't memory usage, but runtime cost of managing these extra data structures slowing things down.

So the question is how do we think eager collection of physical instances in this case might work? The recently proposed discard API is kind of related but not what we want. Legate at a high level knows when slices get dropped, so it theoretically could issue deletions, though there are heuristics around here. I don't know if such a thing should be mapper controlled, because I'm not sure the Legate mapper at least has the information to drive such a decision.

cc @magnatelee @manopapad if you have had any thoughts about this.

lightsighter commented 1 year ago

Can you elaborate here?

If you made a different partition that overlapped with [0,99], say [0,199], and wrote to that instead, that would invalidate the data in the [0,99] partition and allow the reclamation of the index space and index partition.

So the real question I have is how do I effectively loop through a region and slice out subregions (for a sequence of read-only operations) of it without blowing up any sort of internal Legion data structure?

I suspect it will involve a combination of the the discard feature that we discussed in the Legion meeting, which is a way for the application to promise that it doesn't need that data anymore which will ensure that instances are invalidated, and then a separate set of features in the mapper API for the mapper to request early deletion of the instances.

Does FlexFlow have a similar problem like this? How do you guys manage batches

The FlexFlow guys can correct me if I'm wrong, but from my observation of their code they don't do what your ML application is doing which is making one big region, and then slicing for each of the batches. Instead they make a single logical region for doing training on a batch, and they map multiple batches onto that same logical region across top-level training iterations. They can do that because they know they are doing training.

rohany commented 1 year ago

I suspect it will involve a combination of the the discard feature that we discussed in the Legion meeting, which is a way for the application to promise that it doesn't need that data anymore which will ensure that instances are invalidated,

I don't see how discard could be used, because it also invalidates that subset of the data in the parent region. The numpy array needs to be valid at all indices, so we can't just discard slices as we GC them in Python.

lightsighter commented 1 year ago

I tested in my application doing a write to the equivalent of region in this application, and see that my per-iteration times grow until a memory limit is hit, and then there's a collection, the iteration time goes back to normal then slowly grows again.

That's what I would expect with the current implementation as well.

So the question is how do we think eager collection of physical instances in this case might work?

Right, that is the part that is missing (assuming we have discard to invalidate instances).

I don't see how discard could be used, because it also invalidates that subset of the data in the parent region. The numpy array needs to be valid at all indices, so we can't just discard slices as we GC them in Python.

If that is the case, then it is not about invalidating and deleting instances, but "spilling" data from instances in a "constrained" memory to a less-constrained memory. Note though that this will keep around the index spaces and partitions, so your memory problem might still be an issue. I guess I'm interested in why the user felt compelled to write it like this, rather than do it like the FlexFlow people do where they map multiple batches onto the same array across iterations.

rohany commented 1 year ago

Note though that this will keep around the index spaces and partitions, so your memory problem might still be an issue.

This seems to be the issue for me, not the actual memory. The runtime overhead increasing I see seems to be related to the number of index expressions/spaces/instances that are "leaked".

I guess I'm interested in why the user felt compelled to write it like this

I think that was just the natural "numpy" way of writing the application. See here: https://github.com/elibol/sparse-autograd/blob/main/train-movielens.py#L59 and its usage https://github.com/elibol/sparse-autograd/blob/main/train-movielens.py#L263. The application loads up a large sparse matrix, randomly sorts the coordinates, and then selects slices from the matrix to use as the training set for that particular epoch. The application originally did an indirect gather where it generated a random set of indices and then selected those indices from the sparse matrix, but we moved to this approach because of the slowness of having indirect copies on the hotpath. That approach would actually avoid this problem I think, but I don't see another way of doing this.

rohany commented 1 year ago

The FlexFlow guys can correct me if I'm wrong, but from my observation of their code they don't do what your ML application is doing which is making one big region, and then slicing for each of the batches. Instead they make a single logical region for doing training on a batch, and they map multiple batches onto that same logical region across top-level training iterations. They can do that because they know they are doing training.

This was a good hint though. I think I can make progress on this by mapping the "data" array onto a numpy array, rather than a cunumeric array, and then attach slices of the numpy array into cunumeric arrays, avoiding this problem. An initial test seems to be working out for me, though it's kind of a hack IMO.

lightsighter commented 1 year ago

The runtime overhead increasing I see seems to be related to the number of index expressions/spaces/instances that are "leaked".

We should profile that. I'm betting the issue comes with the congruence testing. If you have a bunch of index spaces that are all the same size, it's really hard for Legion to test for congruence to deduplicate index space expressions right now. I know where the acceleration data structure for that lookup needs to go, but I would want to confirm that that is the cause of the problems before I implement it.

The application loads up a large sparse matrix, randomly sorts the coordinates, and then selects slices from the matrix to use as the training set for that particular epoch.

In general Legion partitions are much better when they correspond to something associated with the physical machine rather than something associated with the properties of the data, but we can try to make it work better when we have tons of partitions.

rohany commented 1 year ago

It doesn't show up that noticeably in profiles (at least to me) -- http://sapling.stanford.edu/~rohany/sparseml-no-mapper-rec-inst. All I can tell is that there's an increase in utility processor utilization near the end of this profile.

In general Legion partitions are much better when they correspond to something associated with the physical machine rather than something associated with the properties of the data, but we can try to make it work better when we have tons of partitions.

Yeah, I agree. However, that's not something a NumPy user is thinking about when slicing up their data.

lightsighter commented 1 year ago

All I can tell is that there's an increase in utility processor utilization near the end of this profile.

That profile still looks pretty limited on the Python processor to me. The runtime overhead is ticking up very gradually, but I'm not sure that is a meaningful signal. It could be but I wouldn't bother worrying about it until it starts to dominate performance at which point it would be much more obvious what is going on.

Yeah, I agree. However, that's not something a NumPy user is thinking about when slicing up their data.

I agree, I just think that it's Legate's job to figure out the partitions and map them on to Legion better. A little bit of bounded lazy evaluation and analysis of slices of the program would go a long way here.

rohany commented 1 year ago

That profile still looks pretty limited on the Python processor to me.

it's true, but there's nothing I can do about that for this application -- the workload is fixed.

It could be but I wouldn't bother worrying about it until it starts to dominate performance at which point it would be much more obvious what is going on.

The performance difference does become noticeable though (I guess I can't take a long enough profile to see it), but this is ~3 epochs. After 10-15 training epochs the epoch time becomes double the initial time.

lightsighter commented 1 year ago

The performance difference does become noticeable though (I guess I can't take a long enough profile to see it), but this is ~3 epochs. After 10-15 training epochs the epoch time becomes double the initial time.

Take a large profile and then prune out all but the last 20 seconds or so of execution time.