Make LDS one big pool so we can allocate/deallocate/reuse it

dhernandez0 commented 1 month ago

In this PR, we optimize LDS usage by applying graph coloring to create a single large allocation and replace existing allocations with views.

In some cases, this PR introduces an LDS barrier when reusing an existing memory chunk.

I implemented a greedy algorithm that sorts allocations by size in increasing order. Colors (a vector of consecutive colors) are then assigned to each allocation. This does not have any performance regression.

Given that LDS barriers are sometimes necessary, there’s a trade-off between reusing LDS and not reusing it. In some cases, it might be more efficient to avoid reuse. However, I opted to reuse LDS wherever possible, as avoiding it might reduce occupancy. I’d appreciate your input on heuristics to determine whether to apply this pass.

ticket: https://github.com/ROCm/rocMLIR-internal/issues/1487

manupak commented 1 month ago

Attention reuse was the original motivation for the ticket before we had output swizzling.

manupak commented 1 month ago

So in attention, I found re-use helps as opposed to not having the barriers.

However, my intuition is less LDS is better than less barriers -- if we had to trade. We only had to do this attention so far (pre output swizzling) and there it was always the case that it was better to re-use. I assume that is the case with OutputSwizzling as well.

Thus, I d say we can consider re-use to be the general strategy and we can make it the exception where certain things are not meant to be re-used.

In my past experience, the general solution led to the fact the create of alloc s define a "memory pool". By default, they all be pooled to a single buffer. However, if the user does not want certain (group of) allocs to be shared with another (group of) allocs, they can seperate them by the notion of pools.

manupak commented 1 month ago

Also, i didn't quite catch why you need to keep track of deadAllocs up to certain point. Can you elaborate?

To my mind, once you have the interference graph, you should be able to replace all the allocs with offsets off of a single pool.

manupak commented 1 month ago

Shall we move the discussion of the regression to the ticket please ? (Just to keep the review of the code seperate)

dhernandez0 commented 3 weeks ago

Also, i didn't quite catch why you need to keep track of deadAllocs up to certain point. Can you elaborate?

To my mind, once you have the interference graph, you should be able to replace all the allocs with offsets off of a single pool.

I use deadAllocs to keep track of colors that are not used anymore, that's needed to avoid introducing extra lds barriers.

dhernandez0 commented 3 weeks ago

I've modified the code so that we create an alloc per color. That fixes the performance regression, so I think that confirms this is probably an aliasing issue. When the output swizzle pass is performed, we create a single gpu alloc, so that means there's a potential reason to fix aliasing because it would improve performance, I've created a ticket: https://github.com/ROCm/rocMLIR-internal/issues/1581

codecov[bot] commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 82.73196% with 67 lines in your changes missing coverage. Please review.

Project coverage is 77.82%. Comparing base (26c8d17) to head (dc2a60a). Report is 11 commits behind head on develop.

Files	Patch %	Lines
mlir/lib/Dialect/Rock/Transforms/ReuseLDS.cpp	80.27%	39 Missing and 19 partials :warning:
...ialect/Rock/Transforms/GridwiseGemmToBlockwise.cpp	85.71%	5 Missing :warning:
mlir/lib/Dialect/Rock/IR/RockDialect.cpp	88.88%	4 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## develop #1611 +/- ## =========================================== + Coverage 77.67% 77.82% +0.14% =========================================== Files 95 96 +1 Lines 25782 26004 +222 Branches 3637 3695 +58 =========================================== + Hits 20026 20237 +211 - Misses 4284 4292 +8 - Partials 1472 1475 +3 ``` | [Flag](https://app.codecov.io/gh/ROCm/rocMLIR/pull/1611/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ROCm) | Coverage Δ | | |---|---|---| | [mfma](https://app.codecov.io/gh/ROCm/rocMLIR/pull/1611/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ROCm) | `77.82% <82.73%> (+0.14%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ROCm#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ROCm / rocMLIR

Make LDS one big pool so we can allocate/deallocate/reuse it #1611

Codecov Report