Expose primitive ordered pixel shaders

Degerz commented 5 years ago

According to the Vega ISA documentation, this feature uses the SOPP scalar microcode format. Currently, this feature is only exposed in AMD's D3D12 drivers as "rasterizer ordered views" so I'd like to see a mirror equivalent supported in Vulkan as well known as VK_EXT_fragment_shader_interlock.

We need this feature to emulate a certain PowerVR GPU for our use case and particularly we want the fragmentShaderPixelInterlock feature exposed from the extension so can your team enable this for the Vulkan drivers ? (Bonus if the team can also get fragmentShaderSampleInterlock exposed too) Also if you are working on this extension can we also get an estimate of when a driver will be released to support the extension/feature ?

amayra commented 4 years ago

I guess my next GPU is Nvidia then ?

jarrard commented 4 years ago

I'm sure somebody will step up and produce some code towards making it happen. AMD drivers are open source afterall. Nvidia on the other hand, no chance.

Joshua-Ashton commented 4 years ago

@jarrard I really don't think you're going to find anyone who would want to spend time adding it to AMDVLK that isn't at AMD.

jarrard commented 4 years ago

maybe not AMDVLK, but possibly RADV driver

oddMLan commented 4 years ago

@jinjianrong

Our stance is that we don’t want to implement it. It messes with subpasses and is not the right formulation

WTF? GL_INTEL_fragment_shader_ordering worked perfectly fine in the 17.x Radeon drivers. Actually it's a requirement for certain emulation software. You're telling me you don't give us any alternative for fragment interlocking for OpenGL?. Seems that you only care about propietary APIs (DirectX). Your OpenGL drivers suck big time. Literally everyone agrees on that point.

https://github.com/PCSX2/pcsx2/wiki/OpenGL-and-AMD-GPUs---All-you-need-to-know https://dolphin-emu.org/blog/2013/09/26/dolphin-emulator-and-opengl-drivers-hall-fameshame/

Maybe if I told everyone I know to not touch AMD GPUs with a 10 ft pole, and see your sales dwindle, you'll start wanting to implement it. Oh yeah but big thanks for a 3% speedup in ceirtain PC games with each 1.2GB driver update, while your OpenGL support remains terribly pitiful.

amayra commented 4 years ago

Friendship ended with ayymd Now NVIDIA is my best friend

SaltyBet commented 4 years ago

As a heads up, official AMD ROV support might be coming (I'm guessing post-Navi):

AMD ROV Patent US20200202815A1

Joshua-Ashton commented 4 years ago

@SaltyBet That was filed in 2018.

RinMaru commented 4 years ago

yea its not coming. already got allot of Emu devs looking for alternatives in fear of pissing off AMD users

gharland commented 4 years ago

I only skim read it, correct me if I'm wrong, but it looks like a software solution that would be just as slow as a roll-your-own.

A fast non-order-dependant critical section would still be nice.

RinMaru commented 4 years ago

I only skim read it, correct me if I'm wrong, but it looks like a software solution that would be just as slow as a roll-your-own.

A fast non-order-dependant critical section would still be nice.

That would be Per-Pixel-Linked OIT iirc its been done on Redream and other DC Emus as of recently to basically work around the issue

Joshua-Ashton commented 4 years ago

Why is this closed? The issue is still not resolved.

RinMaru commented 4 years ago

Why is this closed? The issue is still not resolved.

Because AMD isnt going to resolve it its a feature that is hardly used outside the Emulation community. Some devs are looking at other less faster ways to do this because they afraid to piss off the AMD Users

Joshua-Ashton commented 4 years ago

You could work around it for programmable blending if the resource is in GENERAL layout (ie. no DCC) and you emit a readback barrier and sample the current framebuffer's image as a normal image and blend that way.

Triang3l commented 4 years ago

So I assume the only option we have on Vulkan on AMD is a "mutex buffer" with an R32_UINT per pixel + atomic CAS spinlock, though I'm not sure if that would preserve the order of polygons, especially translucent ones with programmable blending (for opaque, apart from manual depth testing, a "primitive index" buffer could possibly be used, rejecting if new draw–instance–primitive index < last written index, but with blending there's no 1:1 association between a pixel/sample and a primitive/draw anymore, also not sure how wrapping could be handled).

The issue with the readback barrier partial workaround is that it needs to be placed in the command buffer, and thus can't work for self-overlapping draws.

Per-pixel linked lists and sorting by draw/instance/primitive during a resolve pass could work, but it would have a huge memory overhead, and it would impose a limitation on the number of overlaps, unlike what ROV or fixed-function blending provides.

RinMaru commented 4 years ago

Per Pixel-Linked is what Redream does for DC's OIT and yea big memory overhead the higher the internal res is.

Triang3l commented 4 years ago

@JacobHeAMD @jinjianrong Anyway arbitrarily dropping features that are "not recommended because of being inefficient" (poor architectural compatibility with subpasses is not a dead-end issue considering Intel and Nvidia are implementing this feature in Vulkan fine) sets a bad precedent and contributes to stagnation of the API (and graphics as a whole) and even more divergence between APIs and issues for those who want to support them all.

Any tool may be used well, just like any tool may be used badly, but it still has its uses. When we choose the tool to do the job, we know the goals and requirements of the task, the limitations that we can agree upon, and we evaluate the advantages of potential solutions and the drawbacks of each.

Let's take the debatable topic of antialiasing as an example. MSAA offers a sharp, stable (no "defocusing" every time a small movement happens) image, preserving small details, and also provides some transparency effects through sample masking, so it's very nice for vivid, sandbox-feeling, immersive games, and pretty much the only option for VR. However, it's noticeably more expensive, for this reason relatively rarely used in this generation (because of consoles, and because it's a bit complicated to integrate into a rendering pipeline with post-processing effects that use depth), and does not help with shading aliasing. Would those be good reasons to drop MSAA completely in the API even though your GPU supports it fine? Leaving developers with two choices — either blurry options eating small details (TAA, less blurry, but jaggy in motion — FXAA, MLAA, SMAA), or falling back to true supersampling, which would take us to the goal, but result in something even worse than what you wanted by removing MSAA, because that would have much higher performance costs. But if we have some milliseconds to spare on our target hardware, or can simplify some art, and not that many high-frequency low-roughness objects so specular aliasing is not a big issue (or can selectively supersample some parts of shaders or the frame, that cause the most aliasing, but not everything) — why not? MSAA would solve our goal perfectly considering our requirements and limitations. (I really hope you never ever treat this paragraph as a suggestion… at least you're still implementing EQAA even and are not advocating for reconstruction techniques like DLSS, so I guess here we're pretty safe.)

ROV is also a tool with good and bad sides, but those are factors to consider when using, not a reason to outright make the concept unusable. Yes, it has a flaw of being able to interlock only within individual pixels/samples and thus not letting users benefit from or recreate optimizations used in the fixed-function output-merger like color compression and sample deduplication. But we are aware of that, everything has limitations, rasterization has them, ray tracing has them. It's still a powerful and a valid option for various tasks.

If you're doing programmable blending or order-independent transparency, you don't have to use ROV for every single translucent effect in your frame — you can sort on a coarse level, and only use ROV for fine sorting within certain objects that need that, maybe through an additional framebuffer with premultiplied alpha (possibly with even lower bandwidth and memory usage than with per-pixel linked lists, and without potential overflow in case of large overdraw).

If, like in the emulation case, you're using it for pixel packing — first of all, it's the only solution (apart from TBR-like subpass input, which, however, turns multisampling into supersampling) that allows for maximum accuracy (and unlike, maybe, art fidelity, it's not just some subjective "looks good to me" thing that allows tradeoffs, either the original visuals are reproduced correctly, or it's just broken, in some games less, in some games totally). You may also have plenty of milliseconds to use if you're emulating some 2000s console on a 2020 system, so performance may just not be an issue for you (remember cycle-accurate bsnes and higan also). Actually in Xenia, the naïve purely ROV-based output path (even for depth/stencil) is significantly faster than the traditional render target-based one, since the latter involves a lot of copying to support reinterpreting EDRAM data. However, even with ROV, there's still space for optimization. A conservative host depth buffer may be added so true early Z can work (unlike discarding in the shader based on whether the manual depth test passed in the whole quad via ddx/ddy, which still requires the wave to be launched). Another possible optimization is using the fixed-function output-merger where suitable, and only using ROV for formats that don't exist on the PC or for parts of the frame requiring unusual blending — that's not uncommon practice in GPU design too, for instance, where you have fallbacks for cases when cmask/htile may become unusuable.

"Inefficient" is not absolutely measurable. It may be relatively less efficient in some cases, relatively more efficient in other, sometimes far more efficient, and it depends on what kind of efficiency you need in each individual case — efficient as in short frame time, or as in completely solving the problem, or even as in development time (Doom Eternal with its 500 FPS is more of an exception than a rule). And you don't need perfect efficiency in all cases, you just need the solution to be efficient enough for your requirements. ROV is efficient enough for us (including on AMD hardware where it's still faster than RTV/DSV in our case), and could work just fine. But now, instead of having it work just fine in our planned Vulkan version, we'll likely have to put a warning asking users to switch to the Direct3D 12 renderer, or to RADV if ROV support is present there, if they're using an AMD GPU. A situation with no winners. At least it's not some S3TC patent status that crippled the functionality for no good, but still there's nothing positive that comes out of simply removing useful tools.

Addition: Could you provide some clarification on "messes with subpasses"? Interlocks exist purely within the fragment shader stage of the pipeline, from the point of view of Vulkan resource dependencies, there are hardly any differences from using regular image/buffer load/store with shader atomics — which is handled fine by Vulkan (it's even stricter actually in the ROV case as you're not supposed to scatter when using a ROV). If the mapping between pixel/sample indices and image/buffer addresses changes, you need to insert what would be the Vulkan equivalent of a D3D UAV barrier into the command buffer — that's fine and make complete sense, as you're losing interlock-based synchronization for accesses through those addresses, thus you need synchronization on another (pipeline) level. The only thing I can think of in regards to how interlocks may interact with the pipeline is that there's no way in the extension to explicitly break interlocking between subpasses if you need that — but who really cares about such a tiny optimization just to reduce false positives (we're not getting false negatives, so interlocking is still fully functioning), which can be added in another extension anyway?

Passingby1 commented 3 years ago

@jinjianrong @JacobHeAMD It's ok, really, our stance is that the green camp is a better all around choice for PC users that want to use their machines in whatever way they see fit.

Awesome job, never change.

plonk264 commented 3 years ago

@jinjianrong @JacobHeAMD
please don't "just close it" without responding to these well thought out and written responses by Triang3l, Degerz, and ryao

chrismile commented 3 years ago

Just in case someone is interested: I tested the performance of fragment shader interlock (i.e., the OpenGL/Vulkan counterpart of ROVs) for order-independent transparency on NVIDIA hardware: https://chrismile.net/blog/2020/mlab-sync-comp/

TL;DR: In the tests I performed, using ordered fragment shader interlock for Multi-Layer Alpha Blending (MLAB) on NVIDIA hardware was 4% faster than using spinlocks. Furthermore, fragment shader interlock and ROVs can guarantee memory access ordering, while spinlocks can't. Using per-pixel linked lists for alpha compositing was significantly slower than MLAB with fragment shader interlock and has an unbounded memory requirement. I couldn't test the performance on AMD hardware (at least without first having to fully rewrite my program to use Direct3D), as fragment shader interlock is unfortunately not supported.

Furthermore, the authors of Moment-Based Order-Independent Transparency (MBOIT) used ROVs in their DirectX reference implementation of their rendering technique and got even better performance with ROVs than when using hardware-accelerated blending. A quote from the supplementary material of their paper (https://cg.cs.uni-bonn.de/en/publications/paper-details/muenstermann2018-mboit/): "Surprisingly, the resulting frame times on our test hardware are actually shorter than those obtained with hardware-accelerated additive blending. We do not have a conclusive explanation for this phenomenon but note that a low overhead from using rasterizer ordered views is expected since the shader program is very short."

I hope that maybe the decision to not support fragment shader interlock will be reconsidered at some point in the future. It would be a really handy feature to have, even if it were slightly less performant on AMD hardware than on Intel and NVIDIA hardware.

OtavioRaposo commented 3 years ago

Just don't use AMD gpus.

jarrard commented 3 years ago

Just don't use AMD gpus.

Not really sensible solution since AMD is the only Open-Source driver gpu for Linux, except for Intel whom has yet to release their GPU(s). Does Intel support this in their open-source drivers yet? if so then perhaps someone can hack that over to AMD's driver :)

v-fox commented 3 years ago

Just don't use AMD gpus.

Not really sensible solution since AMD is the only Open-Source driver gpu for Linux, except for Intel whom has yet to release their GPU(s). Does Intel support this in their open-source drivers yet? if so then perhaps someone can hack that over to AMD's driver :)

Well, it actually does.

diego-rbb-93 commented 3 years ago

@jinjianrong still no feedback to all the emu dev users asking around? :/

jarrard commented 3 years ago

Shrug, If I was desperate for this feature I'd be looking at what intel is doing to support it then trying to slap that into the radv driver. But it seems 100% of the people wanting this feature have absolutely no coding or open-source experience, unfortunately.

gharland commented 3 years ago

Waste of time, Jarrard. Needs to be cross-platform.

devshgraphicsprogramming commented 3 years ago

@Degerz Here is the feedback from our Vulkan team regarding the extension:

Whilst we could potentially support this feature, we don't see any use of this functionality in typical applications using our drivers (mostly desktop gaming). This functionality is exposed via DirectX (ROVs) and sees no real use outside of a handful of demo applications.

Additionally, this is an inefficient method of performing the typical thing it's often advocated for - order independent transparency. For such an effect we would usually recommend using one of the many two-pass OIT algorithms out there, and making use of multiple subpasses, with the second pass doing the resolve. This is likely the most portably efficient mechanism you can use that works between desktop and mobile parts. We're thus not inclined to support it, as we'd rather not promote an inefficient technology.

However, if you're looking to do direct emulation, we are not sure that really helps you - perhaps you could elaborate on what it is you're trying to emulate exactly and we may be able to advise on an alternative method?

Except for Total War : Three Kingdoms

The funny thing is that we're porting a game to ChromeOS to run as an android application, inside the ANGLE sandbox prison, and even GLES 3.1 implemented by ANGLE reports NV_fragment_shader_interlock

And seems AMD is a special boy.

marekolsak commented 2 years ago

We could add this into AMD's Mesa GL driver, or we would accept a 3rd-party contribution adding this feature there, at least to the extent of what DX supports.

Triang3l commented 1 year ago

What's happening on RDNA 3 with POPS, by the way, with src_pops_exiting_wave_id and HW_REG_POPS_PACKER having been removed? Are you not scheduling overlapping wavefronts until the overlapped ones have completed execution now (like EXEC_IF_OVERLAPPED = 0 on the earlier architecture revisions if I understand correctly what it means), or something more interesting?

ryao commented 1 year ago

We could add this into AMD's Mesa GL driver, or we would accept a 3rd-party contribution adding this feature there, at least to the extent of what DX supports.

In hindsight, that would be better than nothing.

Triang3l commented 1 year ago

@ryao I'm currently researching this (being an ISV and a wannabe contributor (froghacker specifically 🐸), not an AMD engineer), most unanswered questions currently are on the register setup side and things like potentially needed implicit barriers between changes related to multisampling and VRS, though I can't promise anything.

My current plan on the shader side (there on GCN5/RDNA/RDNA2 it consists of two parts — overlapped wave awaiting, and then a loop running the critical section code for each overlap layer within the current wave, effectively splitting a part of the shader into smaller "subgroups") so far is:

Find the locations that dominate all begins and post-dominate all ends (taking advantage of SPV_EXT_fragment_shader_interlock's requirement that an end must be executed dynamically after a begin exactly once — thus even if the begin and the end are in different conditionals, it's still possible to estimate a conservative lower bound).
Move the candidate critical section boundaries out of all outer loops, to handle cases when static domination doesn't imply dynamic precedence (like a loop with if (i == 0) { beginInvocationInterlock(); } else if (i == 1) { endInvocationInterlock(); }), and so that there are no breaks from the original shader that need to be routed from the new loop to the original outer loop.
Find the rectangle in the control flow tree that spans all the beginning and the end points moved outside loops. Check if all ends don't precede all begins (in this case, treat an unclosed critical section like a critical section until the end of the shader for simplicity).
Maybe narrow this region to memory accesses inside it as long as loops are not re-entered as a result. This is especially useful if all the memory accesses in the critical section are conditional, while the begin/end are not (per both the GLSL extension requirement statically and the SPIR-V extension requirement dynamically) — AMD GPUs support returning without entering the critical section according to the implementation of D3D ROVs. Note that coherent should have no effect on this — ordered writes without reads may still be done with FSI without coherent in the Vulkan memory model, I think, availability/visibility is probably not needed for write-only access, though I'm not yet sure how this maps to L1 cache usage on AMD specifically with POPS.
Expand the critical section to include all lane-aware operations such as ballots, so that they're unaffected by the new loop narrowing the exec mask, which is not exposed in any way in the SPIR-V control flow and spans an arbitrary conservative region. Note that it's not always possible to isolate the dependency tree of some ballot because its result may go to memory (or to variables in a non-SSA form, including local arrays). This may not be skipped if there are no subgroup operations inside the critical section even though the purpose is just ensuring consistency of their behavior between the critical section (sequentially running smaller sub-sub-groups for each overlap layer) and the rest of the shader (running code for all the layers at the same time in the full original fragment shader subgroup) — exactly for the reason of the results having unclear dependencies (store a result of a full-wave ballot before the critical section, load it in the critical section on the same control flow level — it's not actual anymore even though from the point of view of SPIR-V it still should be).

Triang3l commented 1 year ago

On GCN 5, I'm getting horrible hangs if POPS_OVERLAP_NUM_SAMPLES doesn't match the (rasterizer?) sample count (though I can't find any setup code for it in PAL or XGL, which is pretty weird — but neither can I for EXEC_IF_OVERLAPPED as well, for example), by the way (haven't checked yet with sample-rate shading, however), so it looks like the per-pixel mode is the only supported there (though per-sample still must be exposed for compatibility with apps relying on it, more ordering is always okay — in an extreme example, a Vulkan implementation processing 1 quad at a time in order is technically valid too). I'm not sure if the glitches I'm getting in Nvidia's order-independent transparency sample with the per-pixel mode in my experiments are something actually related to the interlock or something else, the look of the edges is not stable in time, but the effect doesn't look anything like a race condition between waves (no rectangular patterns) — though this may be something caused by the way I'm handling intrawave collisions in my prototype (not inserting memory barriers between iterations yet, for example), I'm also getting some weirdness in the software spinlock mode in this sample though so I don't know. On more recent hardware, primarily RDNA 2 with VRS, I don't know yet how the sample count should be set up at all, but the registers don't seem like they imply any flexibility in the interlock modes, apparently the hardware is designed only for the requirement of per-fine-pixel interlocking in D3D rasterizer-ordered views and GL_INTEL_fragment_shader_ordering, and possibly Metal raster order groups (though I haven't checked yet whether it's implemented as per-coarse-pixel on RDNA 2 or actually just per-fine-pixel).

Early test of a custom RADV POPS implementation on GCN 5

Update: the issue is at least partially in the sample itself.

Update 2: this is intended behavior, MSAA causes adjacent polygons covering the same pixel at the common edge to overlap each with pixel_interlock, and thus performance naturally drops massively. However, POPS_OVERLAP_NUM_SAMPLES == MSAA_EXPOSED_SAMPLES seems to work perfectly for sample_interlock, this way POPS is pretty fast even with MSAA on GFX9 — as long as you need to access only per-sample data, not per-pixel.

Triang3l commented 1 year ago

Since it's Halloween, I need to say that the way it's not the right formulation, especially when it comes to how operations aware of which lanes are currently active (like ballots) interact with the intrawave collision loop (specifically, intrawave collisions result in a part of the shader within one wave being executed first for overlapped lanes, then for overlapping lanes, then for other overlapping lanes, and so on — each time with a narrower set of active lanes than outside the CS), is extremely SCARY and spooky 🙀😿 For example:

uint64_t before = ballotARB(true);
// let's say `before` is …0000111111111111
beginInvocationInterlockARB();
uint64_t during = ballotARB(true);
// first iteration: `during` = …0000000000001111
// second iteration: `during` = …0000111111110000
endInvocationInterlockARB();
uint64 after = ballotARB(true);
// `after` is …0000111111111111 again

This is clearly not right from the point of view of GLSL and SPIR-V (or in an even more horrifying example, from the point of view of ROV loads/stores in HLSL) — there are no control flow constructs (conditionals, loops, returns) in the shader code that would suggest that during should be different than before and after — they're in the same control flow tree node, there are no returns between them. Interlocking is exposed purely on the level of a single invocation, which is not true in the implementation on AMD.

Yesterday I was thinking how they can be handled more or less safely, but it looks like for that, it would be necessary to locate all dependency chains of every ballot that cross the boundaries of the critical section, and include them in the CS. However, this has at least two issues.

One inconvenience is that by including a dependency chain of a ballot crossing the CS boundaries into the CS, you're expanding the CS, thus changing its boundaries — and some ballots that could stay outside previously now may have to be moved into it. Though this can be solved by just running this pass again and again until it makes no more changes.

But a more severe problem, that I already explained in my previous message, but that I have to highlight the painfulness of, is that obtaining the dependency chain of something is highly non-trivial when variables (including dynamically indexed arrays), or what's even worse, global memory, are involved. Basically, if you're writing into a buffer or an image, potentially any buffer/image load (with restrict, for that specific resorce, but without restrict, for any non-restrict resource of a type with which aliasing is possible) preceded by that store may be a dependency on that store. And you don't even have to actually write the result of a ballot for the dependency to appear: that would be an extremely weird and rare use case (even more like just a test case), however, there's a much more realistic situation — any write to a manually non-uniform-indexed (for which ballots and first lane reads are used commonly) image/buffer is also a memory dependency on the ballot.

One potential solution that I thought about was forcing all ballots to be inside the critical section. But there's an obvious flaw in it, so I'm of course not going to use it — that would outright ruin the last ballot in the shader that was outside the CS in the original code. Specifically, that would change:

criticalSection {
  // ROV accesses here
}
uint64_t lanesRemaining = ballotARB(true);
while (lanesRemaining) {
  // some non-uniform resource access scalarized manually here
}

into:

uint64_t lanesRemaining;
criticalSection {
  // ROV accesses here
  lanesRemaining = ballotARB(true);
}
while (lanesRemaining) {
  // some non-uniform resource access scalarized manually here
}

thus lanesRemaining would belong not to the whole subgroup, but just to the uppermost overlap layer — but later it would be used for providing data for the entire subgroup. Again, for this to work correctly, all the dependencies of lanesRemaining will have to be moved into the critical section.

I'm really not sure that I want to spend a huge amount of time trying to untangle all this mess, so at least at first I'll probably just leave a // FIXME for now. While this would of course result in behavior that makes no sense from the SPIR-V or GLSL point of view, at least the change of the set of active lanes would happen in locations that are predictable and can be taken into account. Specifically, they will be the OpBeginInvocationInterlockEXT and OpEndInvocationInterlockEXT themselves, or, if they are in different control flow nodes, and thus the boundaries of the critical section need to be moved to a parent control flow node to include both, the critical section will start right before and/or end right after a control flow construct, such as a conditional or a loop — where the writer of the shader (unless they explicitly assume that the condition is uniform) would expect the active lanes to naturally potentially change. For this reason, I'll also drop my idea of narrowing the CS to the memory accesses inside it (which would help if the shader, for instance, is written in GLSL and thus adheres to the extremely strict control flow rules of GL_ARB_fragment_shader_interlock, and puts the CS begin/end instructions on the outermost control flow level, but does all the memory accesses between them conditionally), as that would essentially result in something even more broken, unpredictable and uncontrollable — basically in the way ballots interact with ROV accesses on Direct3D.

Of course we could use more radical solutions, such as going Intel's sendc way and disabling intrawave collisions completely if ballots are used if that's possible on the hardware (loading intrawave collisions is switchable separately from loading overlapped wave info for some reason, but I'm not sure if disabling that reliably works) — or if not, wrapping the whole shader in the CS (but then it may be reasonable to switch off EXEC_IF_OVERLAPPED too, but again, I don't know if that can be relied on). However, this edge case of interaction of ballots and FSI is probably too weird for real usage — but what's much more realistic is that a shader would want the critical section to be as short as possible, especially if it does just programmable blending (like some simple overlay/hard light) somewhere in the end, it wouldn't make much sense not to do some heavy, but independent, lighting work earlier in the shader in parallel between overlapped and overlapping invocations, no matter if they are in the same wave or in different. So in the end, I think I'll just keep ballots broken, but broken in a controllable way, as the alternatives that I can imagine seem to be unreasonable.

Triang3l commented 1 year ago

@jpark37 While this was very long ago, if you still remember the details of your tests, could you please provide the settings you had in the Intel OIT sample?

Most importantly, was MSAA used in your test run on AMD, and what exact algorithm was used without ROV in your testing setup (if any OIT at all)? MSAA specifically has a massive performance hit with ROV on AMD due to adjacent primitives overlapping each other as I found out two comments above, but that applies only to the PixelInterlock modes. Without MSAA, or with MSAA in the SampleInterlock mode, in Nvidia's OIT sample, in a spinlock -> interlock comparison, I was getting a ratio similar to your Intel and Nvidia results (on the RX Vega 10, without MSAA, 22ms > 26ms if I recall correctly). MSAA with PixelInterlock, on the other hand, was closer to what you were getting on AMD, though even worse — a 15x-ish increase. However, the spinlock is also an approach that's very hostile to parallelism, so maybe the spinlock was just slow in the first place, and the interlock turned out to be just slightly slower.

Though I'll probably also try running it by myself when I finish other tasks. I also wonder, by the way, since it's a D3D11 sample, whether implicit UAV barriers might have caused a significant drop, or was ROV actually the bottleneck there.

jpark37 commented 1 year ago

While this was very long ago, if you still remember the details of your https://github.com/GPUOpen-Drivers/AMDVLK/issues/108#issuecomment-525144229, could you please provide the settings you had in the Intel OIT sample?

Sorry, I don't.

Most importantly, was MSAA used in your test run on AMD

This much is unlikely though because I'll always disable MSAA if the option is in front of me.

devshgraphicsprogramming commented 1 year ago

Why is this closed? The issue is still not resolved.

Because AMD isnt going to resolve it its a feature that is hardly used outside the Emulation community. Some devs are looking at other less faster ways to do this because they afraid to piss off the AMD Users

3 AAA games use it, AMD has this feature on DX12.

Also we plan to use it to do CSG in a single pass for a CAD app, at this rate we'll open a popup saying "Buy a real GPU" and open a browser with Amazon and Ebay searches for Nvidia and Intel when we detect an AMD GPU.

Triang3l commented 1 year ago

Yes, and I did more research recently for my future blog post — and the deterministic ordering, and thus the lack of temporal noise, in overflow handling makes fragment shader interlock a much more reliable solution even to order-independent transparency compared to other two-pass methods like with a spinlock. This was also cited in the GRID 2 article (for them even 2 nodes per pixel were enough for order-independent transparency, and 4 nodes for Adaptive Volumetric Shadow Maps for smoke lighting), in the MLAB benchmark. Without fragment shader interlock, no matter how advanced your tail blending algorithm is, it will always be broken anyway — because you'll just have incoherent noise if any overflow happens. Like an analog TV with no antenna connected, on your trees or in your glass panes.

Additionally, with fragment shader interlock you can do OIT partially, coarsely sorting large batches of geometry (level map tiles, objects, meshlets), and doing fine OIT inside those batches and between nearby batches, including to handle intersecting polygons (which are very common in foliage). And if you sort batches by the farthest depth in them (conservatively is enough), with fragment shader interlock, you can compare the sort key of the current batch with the closest OIT fragment depth in the pixel so far — and if it turns out to be closer, with fragment shader interlock, you can just safely resolve OIT as soon as that happens and free all your OIT layers for reuse (without causing any pipeline stalls for pixels that don't need OIT, unlike an explicit resolve pass, even if stencil-masked, with pipeline barriers — which also wouldn't work with instancing or mesh shading) as you know that all new fragments will be closer from that moment on. This can effectively provide you infinite layers in the view, with a small number of layers within object "clusters" needed in the RAM.

Other uses that come to mind are deferred decals — blending into the normal G-buffer, as demonstrated by Nvidia (especially useful for decals on curved surfaces); or drawing huge numbers of sorted particles with a custom blending equation (like Hard Light for both lightening and darkening), as well as with per-particle blending equation selection (especially useful with bindless textures — to have all the additive fireworks and all the alpha-blended smoke in a single draw command with correct ordering between each other — and you can't just put fireworks in one draw command and smoke in another, as you wouldn't be able to mix ordering of the particles between the two).

On the implementation side, by the way, I'm somewhat worried about the changes to POPS setup introduced by GFX11. Specifically, the POPS_OVERLAP_NUM_SAMPLES setting is now gone. Currently it's difficult for me to allocate the money to purchase a testing device, so I can't check this by myself. But can someone (@Anteru possibly?) please confirm, how does POPS behave on RDNA 3 with sample-rate shading?

Direct3D and Metal (and Intel's old extension) only require sample-granularity interlocking with sample-frequency shading. However, Vulkan and OpenGL fragment shader interlock gives explicit control of the interlock scope to the shader via its execution mode — so it's still possible to request pixel-level interlock, which offers wider guarantees, in a sample-rate shader (like via POPS_OVERLAP_NUM_SAMPLES = 0 on Vega/RDNA/RDNA2). And if the device supports the fragmentShaderPixelInterlock feature, it must expose whole-pixel-scope interlock regardless of the shading frequency of the shader. However, without the fragmentShaderPixelInterlock feature, it's not possible to support ROVs in DXVK and VKD3D at all, as PixelInterlock execution modes wouldn't be usable even conditionally, while PixelInterlockOrdered would be required for multisampling without sample-rate shading — even though in native Direct3D ROVs would just work naturally.

TheLastRar commented 1 year ago

You briefly mentioned custom blending equations, and many have also mentioned wanting this extension for programmable blending, it's possible to use feedback loops for that purpose https://registry.khronos.org/vulkan/specs/1.2-extensions/html/vkspec.html#renderpass-feedbackloop

Mesa zink uses this to implement fbfetch (https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12603), which Mesa uses to support the GL_KHR_blend_equation_advanced extension.

PCSX2 uses it to support non-standard blend modes. This requires splitting render passes and using barriers (with VK_DEPENDENCY_BY_REGION) to ensure sync. Performance is reasonable on Nvidia (even with a very large number of draws), however, AMD doesn't fully support VK_DEPENDENCY_BY_REGION and has worse performance as a result, see this Reddit post & AMD Community post, which seems to be a hardware limitation.

I don't know if OIT can be supported by the above approach. and I don't know how performance compares to using shader interlocks w.r.t. custom blending.

Triang3l commented 1 year ago

@TheLastRar Fragment shader interlock and shader framebuffer fetch (not an explicit barrier that causes excessive synchronization — especially on AMD which doesn't have BY_REGION as far as I know, but even on mobile tiled GPUs what's basically expected from BY_REGION at worst is simply not flushing/reloading tile memory, making the barrier just tile-local rather than global — and doesn't support draw commands with overlap, let alone intersecting primitives; but Arm's ordered version instead) both can be used to implement programmable blending, however, they're quite different in the details, so it would be ideal if hardware supported both, but with Intel being the only PC graphics card developer supporting the latter currently (while all the biggest 3 have fragment shader interlock in their hardware and at least some of their drivers), FSI is effectively the only option on the PC now.

But in general, fragment shader interlock offers massively more flexibility than shader framebuffer fetch.

The only advantage of SFBF I can imagine is that it supports late depth/stencil test, so it works directly with things like alphatested surfaces. With fragment shader interlock, the write happens in the shader, so your only choices are early depth/stencil (with post-depth coverage with MSAA), which only works for opaque surfaces not modifying the depth or the stencil reference from the shader, or full-blown software depth testing.

However, FSI, being a shader part rather than an output-merger one, allows for arbitrary addressing, and that removes lots of limitations:

The layout and the amount of pixel-local data can be managed completely freely. SFBF restricts you to (R32G32B32A32 * maximum output attachment count) of data at most, and if you use more than one attachment, your data becomes scattered across multiple image subresources. With FSI, you can store arbitrary amounts of data with any layout you want.
Random access within pixel-local data. Relative addressing of temporary registers is a painful thing, on AMD for instance the offset is placed in a scalar register, not a vector one. But more importantly, you don't always need to access all the per-pixel data. With SFBF, if you want to modify data at some dynamic location, you need to fetch all the data, and then to export it back. An example of when this may happen is order-independent transparency — if there's no overflow, you don't need to tail-blend and thus to locate the least important fragment to replace throughout all the pixel, you only need to increment the counter and to add the new fragment, so there's no need to waste the bandwidth reading and writing the entire list. Another example would be single-pass construction of trees, like for the Z axis when doing 2D-tiled light clustering with conservative rasterization — again, you only need to visit the nodes that matter for the new light.
Not only the inner layout, but the global placement of the pixel-local data is controlled by the application. The data doesn't have to be placed in a rectangle in one or multiple image subresources. This is important for emulation, including Xenia in particular, where we emulate EDRAM addressing of the console, thus data reinterpretation (which happens a lot on the console, even for clears that are done via 4x MSAA depth-only rectangle draws there) between different framebuffer offsets, widths, MSAA sample counts, bit depths, can be done with no copying — just a pipeline barrier is sufficient.

Also I'm not entirely sure about the requirements of Arm's rasterizer-ordered attachment access (maybe subpassInputMS is allowed, but I don't know for sure), but compared to OpenGL ES SFBF, FSI has one advantage for MSAA — you still can use pixel-frequency shading with FSI, and access per-sample data based on the input coverage mask with sample interlock, or per-pixel data as usual with pixel interlock. The OpenGL ES SFBF specification says: "Reading the value of gl_LastFragData produces a different result for each sample. This implies that all or part of the shader be run once for each sample…", but in reality as far as I know it's always full fallback to sample-rate shading, which cancels out the idea of MSAA on the performance side.

Note that with FSI, you can still take advantage of texture tiling (if I understand correctly, it's even the same 64KB_R_X for both framebuffer attachments and storage images on RDNA and RDNA 2 normally), and as far as I know, modern AMD GPUs support internal compression for storage images as well.

But again, what's the most important is that there's no SFBF anywhere on the PC except for Intel GPUs (and maybe Innosilicon, Moore Threads, though I don't know the details about them).

devshgraphicsprogramming commented 1 year ago

umbers of sorted particles with a custom blending equation (like Hard Light for both lightening and darkening), as well as with per-particle blending equation selection (especially useful with bindless textures — to have all the additive fireworks and all the alpha-blended smoke in a single draw command with correct ordering between each other — and you can't just put fireworks in one draw command and smoke in another, as

Another fun use for ROV is rendering to and blending non-renderable formats like RGB9E5 or some custom stuff.

devshgraphicsprogramming commented 1 year ago

You briefly mentioned custom blending equations, and many have also mentioned wanting this extension for programmable blending, it's possible to use feedback loops for that purpose https://registry.khronos.org/vulkan/specs/1.2-extensions/html/vkspec.html#renderpass-feedbackloop

Mesa zink uses this to implement fbfetch (https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12603), which Mesa uses to support the GL_KHR_blend_equation_advanced extension.

PCSX2 uses it to support non-standard blend modes. This requires splitting render passes and using barriers (with VK_DEPENDENCY_BY_REGION) to ensure sync. Performance is reasonable on Nvidia (even with a very large number of draws), however, AMD doesn't fully support VK_DEPENDENCY_BY_REGION and has worse performance as a result, see this Reddit post & AMD Community post, which seems to be a hardware limitation.

I don't know if OIT can be supported by the above approach. and I don't know how performance compares to using shader interlocks w.r.t. custom blending.

IIRC unless you have the brand new Vulkan ARM/EXT externsion meant to replace OpenGL ES Pixel Local Storage or FramebufferFetch.... subpass feedback loops are only limited to a single pixel overwrite cycle, then you need a barrier (can be by-region).

PLS + PSI are best of both worlds, because you can use the local framebuffer/tiler memory to store your MLAB4 buckets and not a coherent storageImage.

Triang3l commented 1 year ago

I guess this is not an April Fools' Day joke? 😜 Even though this situation is complete tragicomedy and farce 🤷‍♂️ https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22250

John-Gee commented 1 year ago

I guess this is not an April Fools' Day joke? stuck_out_tongue_winking_eye Even though this situation is complete tragicomedy and farce man_shrugging gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22250

Fantastic news, thank you for your work on this!

SopaDeMacaco-UmaDelicia commented 1 year ago

I guess this is not an April Fools' Day joke? stuck_out_tongue_winking_eye Even though this situation is complete tragicomedy and farce man_shrugging https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22250

What a chad 💪😎

rijnhard commented 1 year ago

It just got merged into Mesa RADV devel

r2rX commented 1 year ago

Congratulations, @Triang3l, and well done. AMD users are indebted....only appreciation and admiration for this awesome contribution. :)

oddMLan commented 1 year ago

Brilliant! The FOSS community has done it again 🎉 I guess this goes to show support on Windows is possible, right? ........... What do you mean there aren't any open source drivers on Windows? ........... What do you mean AMD has to implement it? But they just said they won't! ☹️

https://www.youtube.com/watch?v=Lo4DMz6fZG0

Moonlacer commented 1 year ago

Hello AMD Vulkan developers! Has your stance on supporting this extension changed within the last 4 years? There seems to be plenty of examples given here on how this would affect the user experience on any AMD card using the Windows proprietary drivers, so I would really like to know your current (updated) thoughts on this matter.

Best regards, Moonlacer

Squall-Leonhart commented 8 months ago

VK_EXT_fragment_shader_interlock has been added to amdvlk in 194a181da7e2cca5f70ec0f9e65119955b3d2b47

RinMaru commented 8 months ago

VK_EXT_fragment_shader_interlock has been added to amdvlk in 194a181da7e2cca5f70ec0f9e65119955b3d2b47

Thats not windows though is it?

GPUOpen-Drivers / AMDVLK

Expose primitive ordered pixel shaders #108