StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
675 stars 145 forks source link

feature request: support virtual mappings for reduction instances #1119

Open rohany opened 3 years ago

rohany commented 3 years ago

Is it possible to support virtually mapping regions with reduction privileges? Or at least support this for inner tasks only (which is all that I need). In my current use case, I have tasks that run on each node that operate on virtually mapped regions, and then launch a task to run on each GPU within the node that actually perform the desired operation. Without being able to virtually map the region for the task on each node, an unnecessary buffer is allocated and filled in system memory, and then the results of the reduction appear to be reduced into the instance in system memory from the framebuffers, even though the operation that consumes the result of the reduction uses instances in the framebuffer memory.

While it is possible to not have a hierarchical task launch structure like this, I find on other applications that this can lead to performance improvements, so I would like to keep this pattern for my applications with reduction privileges.

lightsighter commented 3 years ago

I think I would like to see some evidence that there's actually a performance penalty for not supporting virtual mappings on requirements with reduction privileges. Unlike normal instances, reduction instances are going to have to be folded back to a normal instance eventually anyway. Whether that happens directly or hierarchically through an intermediate copy shouldn't really matter that much, and in practice is likely to be significantly cheaper than the overhead of moving the meta-data back up the task tree for a virtual mapping.

an unnecessary buffer is allocated and filled in system memory

Why aren't you mapping this instance into the framebuffer also?

rohany commented 3 years ago

I think I would like to see some evidence that there's actually a performance penalty for not supporting virtual mappings on requirements with reduction privileges.

I have a profile here that has some evidence: http://sapling.stanford.edu/~rohany/johnson-cuda/johnson-cuda/. Disregarding the very slow CPU fill, there's a stage where reduction buffers from the child tasks on the GPU are sent back over the pcie to the buffer in system memory, even though the following task that really consumes the reduction buffers runs on the GPU + framebuffer.

Perhaps this is an artifact of just the CPU fills being slow, but this unnecessary memory movement seems like a problem.

Unlike normal instances, reduction instances are going to have to be folded back to a normal instance eventually anyway.

It seems like for inner tasks this isn't necessarily true (at least in my case). Unless the partition for the interior task is mapped directly then my program only operates on the sub-partitions at the leaves.

Why aren't you mapping this instance into the framebuffer also?

Because there isn't enough memory to place it into a framebuffer region. The application has a node-level partition, then a partition within the nodes for each GPU (which gives it some locality within the GPUs). There isn't enough memory on any one GPU to hold the entire node level partition, which is why I want to virtually map the node level partition. (Also, isn't it illegal to map a region into a memory that the target processor can't reach?)

Whether that happens directly or hierarchically through an intermediate copy shouldn't really matter that much, and in practice is likely to be significantly cheaper than the overhead of moving the meta-data back up the task tree for a virtual mapping.

That's surprising -- at least in my programs the region are pretty large (order of gigabytes), so I imagine that fills / data movement would dominate any runtime analysis.

lightsighter commented 3 years ago

Disregarding the very slow CPU fill, there's a stage where reduction buffers from the child tasks on the GPU are sent back over the pcie to the buffer in system memory, even though the following task that really consumes the reduction buffers runs on the GPU + framebuffer.

This is an artifact of your mapping, not anything the runtime is doing. Map those reduction instances on the GPU.

It seems like for inner tasks this isn't necessarily true (at least in my case). Unless the partition for the interior task is mapped directly then my program only operates on the sub-partitions at the leaves.

That's not what I mean. Eventually all reduction instances have to be applied to a normal instance in order to be read. Where the read takes place in the region tree is irrelevant to the need to apply reduction instances back to normal instances.

The application has a node-level partition, then a partition within the nodes for each GPU (which gives it some locality within the GPUs). There isn't enough memory on any one GPU to hold the entire node level partition, which is why I want to virtually map the node level partition.

This feels like we are back to the case you had before where your logical regions are not precisely describing exactly the data that needs to be accessed by the sub-tasks.

Also, isn't it illegal to map a region into a memory that the target processor can't reach?

It is not illegal if you add the NO_ACCESS flag to the region requirement's flags.

That's surprising -- at least in my programs the region are pretty large (order of gigabytes), so I imagine that fills / data movement would dominate any runtime analysis.

Running reduction kernels on modern GPUs with 800 GB/s to 2 TB/s of memory bandwidth should require no more than a few a few hundred microseconds at most to churn through a few GB of reduction data, much faster than the 1ms on average of amortized runtime analysis overhead.

rohany commented 3 years ago

I'm pretty confused now. I don't see how to map the instances onto the GPU without running out of memory. So we're on the same page, my generated code looks like this:

a, b, c = partition matrices into tiles
index launch #1 for i, j, k in tiles onto each node:
  a, b, c = sub-partition this task's tile into smaller tiles
  index launch #2 ii, jj in tiles onto each GPU in a node:
    do a gemm, reduce the result into A

At each task launch here, the bounds are as tight as I can make them. At index launch #1, the regions for a, b and c are too large to fit in a single GPU's framebuffer. Can you describe how I should be mapping the regions for index launch #1? I don't think that I can use NO_ACCESS, because doesn't that mean that task #2 can't read the regions? At least in the mapping vocabulary that I know so far, a virtual mapping seems like what I want, since I don't really need the first level tiling of a,b,c materialized in a memory.

lightsighter commented 3 years ago

I guess I'm wondering what value there is in doing the hierarchical decomposition of the problem like that? I can see value in doing the hierarchical decomposition if one level was dynamic (Legion) and any lower levels were static for mapping onto individual GPUs (and maybe a static recursion onto the warps, threads, etc). Figuring out the top-level dynamic decomposition as a function of the inputs should be the only thing that really needs to be dynamic I would expect.

rohany commented 3 years ago

At least for matrix multiply, I find that there is a ~10% improvement from hierarchically decomposing the problem like this on different algorithms that don't use reductions, as there is some locality in the larger tile that needs to be operated on. You're right that the top level is the only dynamic level, the lower levels are pretty much static, but I'm not sure how to expose that parallelism without launching tasks for each GPU. It does seem possible to do this with a flat index launch over all GPUs and then maybe doing a recursive task slicing, but I don't see a direct approach from my end to generate a mapper from my IR that would do this

rohany commented 3 years ago

Hey Mike, I wanted to follow up here and see what your thoughts are on this feature so I can plan around it!

lightsighter commented 3 years ago

It's something that I can put on the list but it probably is not going to get done until sometime in mid-2022 at the earliest. It is a very invasive change that messes with a bunch of different parts of the physical analysis.

rohany commented 3 years ago

I see. I'm shooting for the PLDI deadline on Nov 20, and having this would be necessary for some experiments that I want to run. Given that it's an invasive change can either

lightsighter commented 3 years ago

something hacky be done that I can keep on a separate branch for my project?

There's no good way to hack this. You're welcome to try if you want and I'll answer questions if you do decide to go that route, but you'll be on your own in terms of writing code.

earlier you seemed to indicate that there was a way to map these instances to avoid the problems that I mentioned

I still think doing GEMM with a single-level index space task launch is going to be best for Legion. The recursive GEMM structure works much better in static cases where you are generating code.

rohany commented 3 years ago

There's no good way to hack this. You're welcome to try if you want and I'll answer questions if you do decide to go that route, but you'll be on your own in terms of writing code.

How long would you estimate this to take for someone who hasn't touched any legion internals? I could maybe give it a try if you gave me a checklist of things that needed to be done.

I still think doing GEMM with a single-level index space task launch is going to be best for Legion. The recursive GEMM structure works much better in static cases where you are generating code.

I'm waiting on some gasnet bugs to be fixed before I can really test if one is better than the other. I guess if my experiments are conclusive one way or the other then I can decide to implement this.

rohany commented 3 years ago

I still think doing GEMM with a single-level index space task launch is going to be best for Legion. The recursive GEMM structure works much better in static cases where you are generating code.

Even if that's the case, I'd still like to know if there is a way to do this mapping. Perhaps it's not a good idea for GEMM, but maybe there are other operations where it would be beneficial, which would be good if my compiler could handle it.

lightsighter commented 3 years ago

How long would you estimate this to take for someone who hasn't touched any legion internals? I could maybe give it a try if you gave me a checklist of things that needed to be done.

There have been two other people (one PhD student and one postdoc) who've learned enough of the Legion runtime internals to add a feature like this, and it took each of them more than a year of study and work to get to that point. If it's the only thing you worked on, I think you could probably learn enough in 6-8 months to do it, but it's not something that comes quickly in general.

Even if that's the case, I'd still like to know if there is a way to do this mapping.

Which mapping are you referring to here?

rohany commented 3 years ago

Which mapping are you referring to here?

Sorry, in response to this comment of mine above:

"earlier you seemed to indicate that there was a way to map these instances to avoid the problems that I mentioned. I don't actually need virtual instances if I can avoid the allocation/fill of the instance + folding back of the instances into the node level reduction instance. If that's simpler I'll try to implement that! The fills i'm trying to avoid are visible here -- http://sapling.stanford.edu/~rohany/johnson-bgwork/."