Closed dhernandez0 closed 3 weeks ago
Attention reuse was the original motivation for the ticket before we had output swizzling.
So in attention, I found re-use helps as opposed to not having the barriers.
However, my intuition is less LDS is better than less barriers -- if we had to trade. We only had to do this attention so far (pre output swizzling) and there it was always the case that it was better to re-use. I assume that is the case with OutputSwizzling as well.
Thus, I d say we can consider re-use to be the general strategy and we can make it the exception where certain things are not meant to be re-used.
In my past experience, the general solution led to the fact the create of alloc s define a "memory pool". By default, they all be pooled to a single buffer. However, if the user does not want certain (group of) allocs to be shared with another (group of) allocs, they can seperate them by the notion of pools.
Also, i didn't quite catch why you need to keep track of deadAllocs up to certain point. Can you elaborate?
To my mind, once you have the interference graph, you should be able to replace all the allocs with offsets off of a single pool.
Shall we move the discussion of the regression to the ticket please ? (Just to keep the review of the code seperate)
Also, i didn't quite catch why you need to keep track of deadAllocs up to certain point. Can you elaborate?
To my mind, once you have the interference graph, you should be able to replace all the allocs with offsets off of a single pool.
I use deadAllocs to keep track of colors that are not used anymore, that's needed to avoid introducing extra lds barriers.
I've modified the code so that we create an alloc per color. That fixes the performance regression, so I think that confirms this is probably an aliasing issue. When the output swizzle pass is performed, we create a single gpu alloc, so that means there's a potential reason to fix aliasing because it would improve performance, I've created a ticket: https://github.com/ROCm/rocMLIR-internal/issues/1581
Attention: Patch coverage is 82.73196%
with 67 lines
in your changes missing coverage. Please review.
Project coverage is 77.82%. Comparing base (
26c8d17
) to head (dc2a60a
). Report is 11 commits behind head on develop.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
In this PR, we optimize LDS usage by applying graph coloring to create a single large allocation and replace existing allocations with views.
In some cases, this PR introduces an LDS barrier when reusing an existing memory chunk.
I implemented a greedy algorithm that sorts allocations by size in increasing order. Colors (a vector of consecutive colors) are then assigned to each allocation. This does not have any performance regression.
Given that LDS barriers are sometimes necessary, there’s a trade-off between reusing LDS and not reusing it. In some cases, it might be more efficient to avoid reuse. However, I opted to reuse LDS wherever possible, as avoiding it might reduce occupancy. I’d appreciate your input on heuristics to determine whether to apply this pass.
ticket: https://github.com/ROCm/rocMLIR-internal/issues/1487