Discards are always emitted at the ends of shaders.

gfxstrand commented 5 years ago

This is a significant performance problems on some apps. On Megadimension Neptunia VIIR, for instance, moving discards early appears to give about a 20x perf improvement over keeping them late. The fundamental problem here is that D3D defines discards in such a way that helper invocations (required for derivatives) are well-defined after a D3D discard but not after an OpenGL or Vulkan discard.

I've experimented with trying to fix this inside the driver by writing a pass that attempts to move discards as far up the shader as possible. The pass just checks that it's not moving the discard past any derivatives or texture operations with implicit LOD. However, I'm not sure if driver fixing is really what we want because we can likely do better even if there are derivatives or texture operations with implicit LOD. For example, if you can make the assumption that derivatives (both explicit and those used for texturing) use a 2x2 quad which is always groups of 4 consecutive subgroup invocations, one could emit something like this:

void d3d_kill()
{
    do_discard = true;
    if (subgroupClusteredAnd(do_discard, 4))
        discard;
}

In this case, you can jump for some of the invocations even if not all invocations discard and there are derivatives so long as an entire 2x2 quad is killed. This is basically how we would implement a D3D-style discard in our driver.

Another option would be an extension which allows a new SPIR-V execution mode which says "discards don't affect derivatives" and then just let the driver do what it wants to do. This would likely be easier for DXVK but maybe harder to get the Vulkan working group to swallow as an extension. I'm pretty sure providing the subgroup invocation -> quad mapping is something I could sell as an extension fairly easily.

doitsujin commented 5 years ago

I did consider this before, but dropped the idea because derivatives are technically still undefined within an entire invocation group, so this approach might not work on all hardware. I'm open to revisiting this, but as far as I know there is no spec-compliant way to implement fast early discard.

~~How does one compile your code sample btw? glslangValidator -V just complains about it requiring SPIR-V 1.3 and I do not know which SPIR-V instruction subgroupClusteredAnd corresponds to.~~ Edit: Figured it out.

gfxstrand commented 5 years ago

Not sure. I didn't try to compile it. :stuck_out_tongue_winking_eye:. I immagine something like this is needed:

#extension GL_KHR_shader_subgroup_clustered : require

I did consider this before, but dropped the idea because derivatives are technically still undefined within an entire invocation group, so this approach might not work on all hardware.

That's what extensions are for. :smile: I know the subgroup relation mentioned always holds on Intel since it's baked into the hardware. I'm pretty sure it also holds on AMD and NVIDIA. I'd be happy to draft the extension if you're willing to wire it up.

doitsujin commented 5 years ago

Sounds good. In the meantime I'll try to get your suggestion to work.

gfxstrand commented 5 years ago

FYI: Intel hardware always puts the 2x2 quads in groups of 4 consecutive invocations starting at a multiple of 4 and that's baked into the HW. I talked to someone working on radv and they do as well. So the above should "just work" there. I believe this is also true on NVIDIA. I've drafted a tiny EXT extension to advertise the relationship. A preview version of the spec and an implementation for anv can be found here:

https://patchwork.freedesktop.org/series/52122/

doitsujin commented 5 years ago

The wip-dxbc-fast-discard-v2 implements this currently without requiring the extension (v1 uses the extension).

Is there a benefit of using clustered subgroup operations over non-clustered ones in this case? AMD does not support clustered, and I would like to avoid having multiple code paths, but I will add it if it is potentially faster on your hardware.

gfxstrand commented 5 years ago

With non-clustered, you'll get the skip only if the entire wave (or subgroup) discards whereas with clustered, you can jump for single quads. How much does that improve perf? Depends on the shader. If three out of four quads jump, you still execute the shader for the one quad but it may take less control flow and will probably have lower memory bandwidth consumed. So doing clustered is likely better if you can. Ultimately, one would have to find some examples and benchmark them to know how much it helps but unless the code to do the subgroup op is terrible, it will help a non-zero amount.

Even without clustered, you can still jump for single quads with ballot:

void d3d_discard()
{
    do_discard = true;
    uvec4 quad_ballot = uvec4(0);
    quad_ballot[gl_SubgroupInvocationId >> 5] = 0xf << (gl_SubgroupInvocationId & 0x1c);
    if ((subgroupBallot(do_discard) & quad_ballot) == quad_ballot)
        discard;
}

The code is a decent bit uglier but it's possible. I'd have to chat with some RADV people a bit about what will generate the best code for them. I think the ballot solution might actually be what they want. For Intel, it's impossible to express the "best" thing because "best" is to just do the regular comparison and then use an ALL4H predicate on the if which does "if (subgroupClusteredAnd(cond, 4))" in one smooth motion. For that, the clustered op is probably a bit easier to detect and optimize but I think we could probably optimize the ballot easily enough.

gfxstrand commented 5 years ago

@cwabbott0

doitsujin commented 5 years ago

I just merged this into master since it works as intended on all relevant implementations.

subgroupClusteredAnd is used on (Intel) GPUs which support it.
subgroupAll is used on (AMD) GPUs which don't support clustered subgroup operations. In my testing this was the fastest solution, I tried messing around with ANDing together the result of two quad shuffles to achieve finer granularity, but performance in practice seems to be the same, if not worse.
For Nvidia, it's disabled due to a rather substantial performance hit in many games, although Clustuered does perform better than non-Clustered.

doitsujin commented 5 years ago

Disabled on AMD due to GPU hangs, so Intel GPUs are now the only ones that benefit.

doitsujin commented 5 years ago

Reopening. The concept of using subgroupAll and subgroupClusteredAnd seems flawed when the discard condition itself is being checked in non-uniform control flow. I've seen a (somewhat silly) shader like this:

if (someCondition) {
  ps_kill = ps_kill || true;
  if (subgroupAll(ps_kill))
    discard;
}

If I understand correctly, subgroupAll only takes active invocations into account, so it looks like we have to use subgroupBallot instead for correctness.

gfxstrand commented 5 years ago

Ugh... You're correct about that.

Unfortunately, the ballot trick, if done within non-uniform control-flow doesn't really work as intended either. Consider this case:

if (someCondition) {
  ps_kill = ps_kill || true;
  if (ballot_quad_all(ps_kill))
    discard;
} else if (otherCondition) {
  ps_kill = ps_kill || true;
  if (ballot_quad_all(ps_kill))
    discard;
}

In this case, at least on Intel hardware, if someCondition | otherCondition is true on some quad but neither are when considered independently, the if would get executed first followed by the else if and you would get a true out of the second ballot_quad_all() but only some of the invocations would make the jump.

This case is still valid since the only thing left is helper invocations; just not as efficient.

gfxstrand commented 5 years ago

Also, FYI, a spec change was landed as part of this week's update which makes all of this legal:

If the subgroupSize field of VkPhysicalDeviceSubgroupProperties is at least 4, the 2x2 neighborhood of fragments corresponds exactly to a subgroup quad. The order in which the fragments appear within the quad is implementation defined.

doitsujin commented 5 years ago

It is true that your example doesn't hit the fast path, but I don't think it's common enough to worry about it. Even the non-uniform control flow issue appears to be rather uncommon, I only encountered it in a shader from Nier that hangs on RADV (unfortunately still does with a simple ballot implementation).

gfxstrand commented 5 years ago

Yeah, I don't think it's hugely important to cover the crazy multiple non-uniform control flow case. As long as what we have is valid in that case, it should be ok.

I took a quick look at your ballot implementation and it won't work well on Intel. We frequently don't run with full subgroups (no, that's not a performance problem; I can explain why if needed) so doing a bitcount(ballot(x)) == subgroupSize means we almost never get the optimization. Instead, it'd be better to something where you have a global variable allBallot = ballot(true) at the top of the shader and then ballot(x) == allBallot when you actually go to test for a jump. The other option would to do the slightly more complicated quad thing I suggested above:

void d3d_discard()
{
    ps_kill = true;
    uvec4 quad_ballot = uvec4(0);
    quad_ballot[gl_SubgroupInvocationId >> 5] = 0xf << (gl_SubgroupInvocationId & 0x1c);
    if ((subgroupBallot(ps_quad) & quad_ballot) == quad_ballot)
        discard;
}

doitsujin commented 5 years ago

Thanks for the heads-up, I'll change the implementation to use ballot(true) before merging the fix.

I'm not entirely sure about the quad granularity hack though. It adds a lot of extra instructions to the point where it might perform worse in the common case where the early discard can actually not be performed. I can experiment with that for sure, but I don't want to make it default without prior testing.

gfxstrand commented 5 years ago

Should be easy enough to build the quad_ballot at the top of main() instead of doing ballot(true). Assuming some optimization (which I haven't written yet!), I would expect it should only take 3 or 4 instructions on Intel. However, if we really only care about the uniform (or mostly uniform) CF case, ballot(true) at the top of the shader should be sufficient.

doitsujin / dxvk

Discards are always emitted at the ends of shaders. #753