Code sinking pessimizes common GPU code patterns

Quuxplusone commented 4 years ago


Bugzilla Link	PR43785
Status	NEW
Importance	P enhancement
Reported by	Connor Abbott (cwabbott0@gmail.com)
Reported on	2019-10-24 06:43:11 -0700
Last modified on	2019-11-01 03:05:11 -0700
Version	unspecified
Hardware	PC Linux
CC	hfinkel@anl.gov, htmldeveloper@gmail.com, jay.foad@gmail.com, llvm-bugs@lists.llvm.org, Matthew.Arsenault@amd.com, nhaehnle@gmail.com, tpr.ll@botech.co.uk
Fixed by commit(s)
Attachments	`spilling_test.shader_test` (631 bytes, text/plain) `spilling_test.ll` (40627 bytes, text/plain)
Blocks
Blocked by
See also

In GPU shaders used by games, it's pretty common to average a bunch of texture
samples in a loop in the fragment shader, e.g. for blurring an image:

vec4 sample = vec4(0);
for (int i = 0; i < NUM_SAMPLES; i++)
    sample += texture(...);

...
... = sample;

The problem happens after we unroll this loop, especially when doing very
aggressive unrolling (as is usually beneficial on GPU's) and NUM_SAMPLES is
very large (say, 8 to 16). If there's any intervening control flow, then LLVM's
code sinking pass will try to pull the entire unrolled loop downwards, but
since texture() is convergent, it can't, so it leaves just the texture
operations:

sample1 = texture(...);
sample2 = texture(...);
...
sampleN = texture(...);
...
... = sample1 + sample2 + ... + sampleN;

And now all of the samples are live at the same time, which results in a huge
increase in register pressure. We noticed this when changing RadeonSI to use
NIR, since this resulted in a few shaders in Civilization: Beyond Earth
spilling after unrolling started happening before LLVM.

I'm not entirely sure what to do here. Code sinking can definitely be
beneficial in some cases for register pressure, and disabling it is largely a
wash. We could just disable it in LLVM for AMDGPU/Mesa and do something much
less aggressive in our stack, although this is a problem that likely affects
other non-Mesa GPU stacks using LLVM.

Quuxplusone commented 4 years ago

That sounds plausible. Do you have an .ll example that shows the problem?

Quuxplusone commented 4 years ago

Attached spilling_test.shader_test (631 bytes, text/plain): original GLSL source

Quuxplusone commented 4 years ago

Attached spilling_test.ll (40627 bytes, text/plain): LLVM file

Quuxplusone commented 4 years ago

This is a really interesting problem; take this with a grain of salt because I'm just basing this on your description, but in theory, I think that you want some mechanism to "spill into a reduction", where you recognize that, when spilling, the result is just going into a reduction, and perform the part of the reduction at the spill point instead of actually saving the value itself.

Quuxplusone commented 4 years ago

(In reply to Hal Finkel from comment #4)
> This is a really interesting problem; take this with a grain of salt because
> I'm just basing this on your description, but in theory, I think that you
> want some mechanism to "spill into a reduction", where you recognize that,
> when spilling, the result is just going into a reduction, and perform the
> part of the reduction at the spill point instead of actually saving the
> value itself.

So, it's a bit more complicated than that. First off, this can cause problems
even when it isn't so bad that you start spilling, since on AMDGPU and probably
other GPU architectures, using more registers means that you have less threads
running in parallel. Of course, scheduling memory reads in parallel also helps,
but only up to a point. These sorts of trade-offs are normally made in the
scheduler, so having another component also try to might not be a great idea,
or be pretty hard to pull off.

Secondly, what you're describing might not be possible. In my example, code
sinking is being conservative and refusing to sink the sample intrinsics, but
if the use is inside divergent control flow, then you legitimately can't sink
them down. And if if we add the infrastructure to make it possible to determine
when we can rematerialize the texture sample wrt convergence, then code sinking
could use the same logic to determine when it can be sunk, and then we could
fix those cases by making sinking more aggressive anyways.

Quuxplusone commented 4 years ago

(In reply to Connor Abbott from comment #5)
> (In reply to Hal Finkel from comment #4)
> > This is a really interesting problem; take this with a grain of salt because
> > I'm just basing this on your description, but in theory, I think that you
> > want some mechanism to "spill into a reduction", where you recognize that,
> > when spilling, the result is just going into a reduction, and perform the
> > part of the reduction at the spill point instead of actually saving the
> > value itself.
>
> So, it's a bit more complicated than that. First off, this can cause
> problems even when it isn't so bad that you start spilling, since on AMDGPU
> and probably other GPU architectures, using more registers means that you
> have less threads running in parallel.

That's correct. On most GPUs, you have some number of registers per thread that
can be used at maximum occupancy, and beyond that, using more registers hurts
occupancy. The trade-offs here are complicated, and we might be able to do a
better job generally, but that seems independent of this problem.

> Of course, scheduling memory reads in
> parallel also helps, but only up to a point. These sorts of trade-offs are
> normally made in the scheduler, so having another component also try to
> might not be a great idea, or be pretty hard to pull off.

I don't understand what you mean, exactly, but if you're saying that for this
to work then the register-pressure heuristic in the scheduler would need to be
aware of it, then that makes sense to me. It's not clear to me that it's hard
relative to other solutions here, but it would be some work. The good news is,
however, that live ranges that end in reductions are easy to identify locally.

>
> Secondly, what you're describing might not be possible. In my example, code
> sinking is being conservative and refusing to sink the sample intrinsics,
> but if the use is inside divergent control flow, then you legitimately can't
> sink them down. And if if we add the infrastructure to make it possible to
> determine when we can rematerialize the texture sample wrt convergence, then
> code sinking could use the same logic to determine when it can be sunk, and
> then we could fix those cases by making sinking more aggressive anyways.

Okay. What's the property of the sample intrinsic that allows the sinking even
if it is convergence?

Quuxplusone commented 4 years ago

(In reply to Hal Finkel from comment #6)
> (In reply to Connor Abbott from comment #5)
> > (In reply to Hal Finkel from comment #4)
> > > This is a really interesting problem; take this with a grain of salt
because
> > > I'm just basing this on your description, but in theory, I think that you
> > > want some mechanism to "spill into a reduction", where you recognize that,
> > > when spilling, the result is just going into a reduction, and perform the
> > > part of the reduction at the spill point instead of actually saving the
> > > value itself.
> >
> > So, it's a bit more complicated than that. First off, this can cause
> > problems even when it isn't so bad that you start spilling, since on AMDGPU
> > and probably other GPU architectures, using more registers means that you
> > have less threads running in parallel.
>
> That's correct. On most GPUs, you have some number of registers per thread
> that can be used at maximum occupancy, and beyond that, using more registers
> hurts occupancy. The trade-offs here are complicated, and we might be able
> to do a better job generally, but that seems independent of this problem.
>
> > Of course, scheduling memory reads in
> > parallel also helps, but only up to a point. These sorts of trade-offs are
> > normally made in the scheduler, so having another component also try to
> > might not be a great idea, or be pretty hard to pull off.
>
> I don't understand what you mean, exactly, but if you're saying that for
> this to work then the register-pressure heuristic in the scheduler would
> need to be aware of it, then that makes sense to me. It's not clear to me
> that it's hard relative to other solutions here, but it would be some work.
> The good news is, however, that live ranges that end in reductions are easy
> to identify locally.

By "live ranges that end in reductions" you mean instructions where all the
sources are killed? We could certainly try to hoist those up, and such a pass
would probably help with this. And yes, what I meant is that the scheduler
should be aware that it's possible to have both the original, low-register-
pressure schedule *and* a more aggressive schedule where we execute more
texture sample instructions in parallel, so that it can make the choice between
them based off its own heuristics. If this hypothetical hoisting pass sees
something like:

foo = texture(...)
bar = texture(...)
...
// in a different basic block
baz = foo + bar

and doesn't hoist up the addition, then because the scheduler doesn't work
across basic blocks, it forces RA to allocate different registers for foo and
bar, so it's decided (in this instance) to put latency hiding over register
pressure, and it doesn't have the sort of information that the scheduler has to
decide if that's a good idea. We can't do it as part of spilling for the same
reason.

>
> Okay. What's the property of the sample intrinsic that allows the sinking
> even if it is convergence?

Sorry, I think I misunderstood your original proposal the first time, but for
the sake of completeness: I was just referring to the fact that, for convergent
operations, it's still possible to sink them across uniform branches. That is,
for something like,

foo = texture(...)
if (cond) {
    ... = foo;
}

it's possible to sink the texture down to its use if (and only if) "cond" is
the same for all threads. We probably can't start to do things like this until
something like D68994 lands and the semantics of "convergent" are put on a more
solid footing though.

Quuxplusone commented 4 years ago

Leaving the convergent issue aside, there could be other factors such as memory aliasing preventing the texture loads from being moved below the branch.

So it does seem that either the earlier code sinking should be register pressure aware or (if we consider the sinking to we a kind of canonicalization) we need some cross-basic block scheduling later which helps clean this situation up again.

We have run into examples in the past where being able to move a memory instruction across if-statements would have been beneficial to get more latency hiding...

Quuxplusone commented 4 years ago

These sorts of trade-offs are normally made in the scheduler

The default machine scheduler does consider register pressure and latency and tries to cluster memory operations, so it has all the information it needs to do a good job here, but I don't think it's sophisticated enough to make really good trade-offs. My gut feeling is that this is because it's a list scheduler, so it's always making local greedy decisions, and doesn't have a good way to balance their impact over a whole (large) basic block like yours.

Have you tried any alternative schedulers, e.g. with -mattr=si-scheduler ?

Quuxplusone commented 4 years ago

The default machine scheduler does consider register pressure and latency and tries to cluster memory operations, so it has all the information it needs to do a good job here

Sorry, I take that back. I have only just noticed that "add" instructions are sunk into a different basic block, so there's no way a per-basic-block scheduler can move them back again.

Quuxplusone / LLVMBugzillaTest

Code sinking pessimizes common GPU code patterns #42755