StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
669 stars 146 forks source link

Remap what memory an instance lives in during the mapper #1615

Open rohany opened 8 months ago

rohany commented 8 months ago

As a follow up to https://github.com/StanfordLegion/legion/issues/1614, a missing piece needed for supporting linear algebra computations like cholesky effectively is having a reasonable story for out-of-core support in Legion, where the GPU memory can be used, and then gradually fall back to the CPU memory needed. Concretely, I wanted to be able to implement a strategy like described here https://www.cs.huji.ac.il/w~feit/parsched/jsspp15/p4-tsujita.pdf. I will note that having out of core support is how all of the "good" results for direct solves for things like Cholesky or LU factorizations are achieved by systems like StarPU, PARSEC, cuSOLVER, as the matrices need to be massive to see any benefits during strong/weak scaling.

At a high level, the cholesky strategy maintains tiles on the GPU, and then flushes those tiles back to the CPU when the GPU memory becomes full. Such an operation is not possible in the mapper right now, as it requires a task launch to "write" to the data in a new location to force the data to move. As with #1614, the application doesn't know when to do this task launch, as it depends on the target machine.

I have a proposal for an API that I know is too hard to implement, but it's a starting point for a discussion.

class MapperRuntime {
  …
  bool remap_physical_instance(
                             MapperContext ctx, 
                             const PhysicalInstance& to_remap,
                             Memory target_memory,
                             const LayoutConstraintSet &constraints,
                             PhysicalInstance &result, bool acquire=true,
                             GCPriority priority = 0,
                             bool tight_region_bounds = false,
                             size_t *footprint = NULL,
                             const LayoutConstraint **unsat = NULL) const;
  …
};

Such an operation is easy to implement when the mapper has privileges on the logical region data being moved, and extremely difficult when it does not have privileges. However, requiring privileges is akin to having control of this from the application side, so it is not a satisfactory solution.

What I've described above is an "optimistic" approach to memory management on the GPU, where we assume the GPU can fit all of our data and roll back allocations if it cant. With #1614, a "pessimistic" strategy is possible -- using postmap_task, always copy back out the region data produced by each task onto the CPU, and leave a weak reference of the data on the GPU for use in the future. However, something like this is much more expensive than necessary for when data actually does fit in the GPU.

lightsighter commented 8 months ago

I will note that having out of core support is how all of the "good" results for direct solves for things like Cholesky or LU factorizations are achieved by systems like StarPU, PARSEC, cuSOLVER,

Pretty much all those systems make two assumptions that I don't ever want to introduce into Legion:

  1. They assume that they can have an "infinite" amount of allocations on the host so that there is always a spill location for any tile of data from the GPU back to the host.
  2. They assume that tasks always asks for privileges on whole tiles of data. They don't handle the case where tasks ask for privileges on partial subsets of tiles and don't consider how that might impact memory management.

In my opinion, they've all over-fit their implementation for dense linear algebra, which is an important subset, but hardly the majority of interesting memory-constrained problems.

At a high level, the cholesky strategy maintains tiles on the GPU, and then flushes those tiles back to the CPU when the GPU memory becomes full.

I think an "instance" in Legion terminology should always refer to an allocation in a specific memory. If we're going to talk about spilling, I think we should frame it in terms of copying data from one instance to another so that the first instance can be collected. (Yes, I've toyed around with the idea of "virtual instances" that were bound to different Realm memories similar to the difference between an architectural and physical register file in a processor, but for a number of reasons I concluded that was a bad idea.)

Such an operation is easy to implement when the mapper has privileges on the logical region data being moved, and extremely difficult when it does not have privileges.

To be more precise, the mapper needs to have privileges on all the points and all the fields of the instance in order to be able to spill it to an instance in another memory. While it's certainly possible to implement the feature under those circumstances, I'm skeptical that it will actually be useful because usually at the point where you are trying to spill a piece of data, you're usually not trying to use that data (otherwise why would you be spilling it); instead you're usually trying to map unrelated data into the constrained memory, so it seems unlikely you would have privileges on the data you're trying to spill.

With https://github.com/StanfordLegion/legion/issues/1614, a "pessimistic" strategy is possible -- using postmap_task, always copy back out the region data produced by each task onto the CPU, and leave a weak reference of the data on the GPU for use in the future. However, something like this is much more expensive than necessary for when data actually does fit in the GPU.

I actually come down on the side of it being a reasonable approach. I don't think it's unreasonable to expect mappers to have some expectation of whether they are in a memory-limited regime or not and whether to take the optimistic or pessimistic approach to mapping data in the GPU. At least giving them both options allows them the flexibility to make a reasonable decision in both scenarios. Sure there might be some edge cases where the mapper guesses wrong and has to be pessimistic when it could be optimistic, but I suspect that will happen infrequently as most applications fall pretty concretely into one side or the other.

I'll note two other alternatives here for completeness, each of which come with significant caveats:

  1. Speculative execution: allowing tasks to map speculatively and then rolling back and rerunning them if we run out of memory later is certainly an option. The downside to this is that we'll need to wait for all the speculative execution machinery to be in place and working to use it, but it would allow mapping into the future with rollback when memory is exhausted.
  2. Program order mapping: if we mapped all tasks in program order then when mapping each task you can effectively manipulate all the instances without needing privileges on them. I think we'll be sad about this for other reasons as it would effectively eliminate any physical analysis parallelism (the most expensive part of the dependence analysis) and not allow for latency hiding in the communication required. You'll make up a little bit of this cost by eliminating logical analysis, but it will require a fence between every single task/operation across all the shards to keep them in sync when doing the physical analysis.
rohany commented 8 months ago

I actually come down on the side of it being a reasonable approach. I don't think it's unreasonable to expect mappers to have some expectation of whether they are in a memory-limited regime or not and whether to take the optimistic or pessimistic approach to mapping data in the GPU. At least giving them both options allows them the flexibility to make a reasonable decision in both scenarios. Sure there might be some edge cases where the mapper guesses wrong and has to be pessimistic when it could be optimistic, but I suspect that will happen infrequently as most applications fall pretty concretely into one side or the other.

I don't have an comment on anything else you said, but I want to push back a little bit here. I don't think the problem is a binary choice between memory limit and no memory limit. Different policies lie on a spectrum here -- if the amount of memory available on the GPU is short by 1 "tile" of cholesky, a pessimistic algorithm would flush much more data back over the PCIE than an optimistic algorithm (only 1 tile is every flushed at a time vs after every task). I haven't worked out the details, but I suspect that a pessimistic algorithm may approximate this behavior by flushing tiles less frequently than every task; however, I'm not sure how the mapper would always know which are the "right" tiles to flush.

lightsighter commented 8 months ago

I don't think the problem is a binary choice between memory limit and no memory limit.

Whether a (parallel) program fits in the available memory is dependent upon five factors:

  1. The size of the input data
  2. The algorithm selected
  3. The sizes of different memories in the target machine
  4. The mapping of data to particular memories for specific computations
  5. Whether the schedule always exploits the maximum available parallelism (while still maintaining correctness) or whether it artificially reduces the available parallelism to limit the amount of live data at a time

Let's assume that we fix 1-4, and for factor 5 we always want the best performance so we exclude all schedules except the one that exploits the maximum available parallelism at any given point in the program. Under these conditions whether the program fits in memory or not is absolutely a binary condition. Now, it might not be decideable (for the mapper) to determine whether it fits in memory or not without running the program (depending on the algorithm and input data), but the program absolutely will or will not fit in memory when you run it, and it will do so deterministically or not given the conditions above.

Obviously the conditions above represent the ideal execution scenario for peak performance: maximizing parallelization for the given mapping/algorithm/machine/input, and if we can do that we absolutely want to run in that regime. If we think that that strategy will execute without running out of memory, then we should optimistically allocate memory freely and without constraints. Why would we ever want to do anything else? However, if we think (know) we're going to run out of memory trying to do that first approach then and only then will we consider the pessimistic scheme that will introduce additional resource constraints on the schedule that will limit the available parallelism so that there are fewer overlapping live ranges on allocations in the constrained memories.

At least for me, the first and foremost decision for mappers is to determine which of these two modes they are operating in. If it is the first, then there is nothing they need to do since that is the regime that most mappers operate in Legion today and will yield the best performance. If and only if they are in the second regime will they need additional features and will they have to grapple with the "spectrum" of alternative schedules that reduce memory footprint at the cost of parallelism.