Expose primitive ordered pixel shaders

Degerz commented 4 years ago

According to the Vega ISA documentation, this feature uses the SOPP scalar microcode format. Currently, this feature is only exposed in AMD's D3D12 drivers as "rasterizer ordered views" so I'd like to see a mirror equivalent supported in Vulkan as well known as VK_EXT_fragment_shader_interlock.

We need this feature to emulate a certain PowerVR GPU for our use case and particularly we want the fragmentShaderPixelInterlock feature exposed from the extension so can your team enable this for the Vulkan drivers ? (Bonus if the team can also get fragmentShaderSampleInterlock exposed too) Also if you are working on this extension can we also get an estimate of when a driver will be released to support the extension/feature ?

ryao commented 4 years ago

It won't hurt for sure :) That said, AMDVLK on Windows uses a different compiler, so some work in that area will remain. I can't say how much work that will be -- could be a lot, could be fairly trivial.

@Anteru If the community found a volunteer to do the work to implement this in AMDVLK on Linux, would you be willing to reciprocate by putting few days into implementing it on Windows? If you could at least give us that much, I think all parties would be satisfied.

Of course, this would be dependent on one of the people who care either volunteering or finding a volunteer to go through the process of implementing and upstreaming support. If no one does, then it is fine to leave things as they are.

ryao commented 4 years ago

@Anteru @jinjianrong One more thing. If you guys start giving features to Direct3D and not Vulkan, you risk manufacturing a situation that is like Direct3D 10 vs OpenGL where developers switch to Direct3D 12 because they simply cannot get features from Vulkan. This has the effect of locking software into Windows, which hurts platforms that are backing Vulkan. People invested in open platforms do not want to see Windows get software that is difficult to port because AMD made Direct3D more capable than Vulkan on Windows.

Had no developers wanted to use this, no one would have cared that Direct3D had it, but not Vulkan. Since developers clearly do want to use this, the absence of support in AMD’s Vulkan drivers on Windows, and to a lesser extent on Linux, is an issue to a number of people here. Keeping it exclusive to Direct3D undermines both Vulkan and our platforms.

ghost commented 4 years ago

Well seeing as things won't change here maybe someone could open an issue on mesa-aco github page and ask them to implement it? Seeing as it's actually used in a game sold on Steam called Just Cause 3. This is the repo if someone is interested : https://github.com/daniel-schuermann/mesa/issues

ryao commented 4 years ago

Well seeing as things won't change here maybe someone could open an issue on mesa-aco github page and ask them to implement it? Seeing as it's actually used in a game sold on Steam called Just Cause 3. This is the repo if someone is interested : https://github.com/daniel-schuermann/mesa/issues

You could file that issue. However, DXVK is currently limited to Direct3D 11.1. It would need to support Direct3D 11.3 before that extension can be used for Just Cause 3. The DXVK author has Polaris hardware, so unless Mesa-ACO finds a way to get it working on Polaris, it could be some time before he implements it, provided that he is willing to do that.

As far as I know, he has no immediate plans to support higher direct3d versions, although that could change if the drivers for his hardware supported the features that he needs to implement them. Additionally, he would need to implement Direct3D 11.2 first and the things in it are not fun to implement:

https://docs.microsoft.com/en-us/windows/win32/direct3d11/direct3d-11-2-features

Joshua-Ashton commented 4 years ago

D3D11.3 does not imply ROV

ryao commented 4 years ago

@Joshua-Ashton I just learned something new. It looks like you can access them as long as hardware feature level 11_1 is supported and the ROVsSupported bit is set:

https://docs.microsoft.com/en-us/windows/win32/direct3d12/hardware-feature-levels

I am a little confused as to why this is called a Direct3D 11.3 feature if only the 11.1 runtime is needed:

https://docs.microsoft.com/en-us/windows/win32/direct3d11/rasterizer-order-views

In any case, the DXVK author doesn’t have hardware that supports the extension, so I wouldn’t expect ROV to be implemented in DXVK anytime soon.

pent0 commented 4 years ago

@Anteru Sorry for the ping, can I ask if fragment shader ordering is possible to consider?

I think its still benefical to our case maybe, and see amd implement it before on opengl (here) and not with interlock, so i think there be some differences maybe

Degerz commented 4 years ago

@pent0 You might not like the way AMD implements fragment shader ordering on OpenGL. The AMD engineer makes it sound like it their implementation only works for SSBOs and not image textures so using fragment shader ordering with image textures may cause some ordering bugs/race conditions to manifest. We also don't know if using it with SSBOs will work properly or cause it to bug out.

pent0 commented 4 years ago

Thanks for the info.

Vào 08:56, T.3, 27 Th8, 2019 Degerz notifications@github.com đã viết:

@pent0 https://github.com/pent0 You might not like the way https://twitter.com/grahamsellers/status/403231348602597376 AMD implements fragment shader ordering on OpenGL. The AMD engineer makes it sound like it their implementation only works for SSBOs and not image textures so using fragment shader ordering with image textures may cause some ordering bugs/race conditions to manifest. We also don't know if using it with SSBOs will work properly or cause it to bug out.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GPUOpen-Drivers/AMDVLK/issues/108?email_source=notifications&email_token=AGEGSOTFQMHX2PW4AQUVWX3QGSCVFA5CNFSM4IMMGGH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5GGZ3Y#issuecomment-525102319, or mute the thread https://github.com/notifications/unsubscribe-auth/AGEGSOVFVEFH3UL7IDV5GIDQGSCVFANCNFSM4IMMGGHQ .

jpark37 commented 4 years ago

I tested an Intel OIT sample, and out of the box, ROV performs worse on all vendors, so at first glance, AMD seems to be correct. However, the effect on AMD is dramatically worse than the competition.

I don't know what you guys are doing in your emulators, but you may want to consider non-ROV alternatives.

Sample: https://software.intel.com/en-us/articles/oit-approximation-with-pixel-synchronization Code: https://github.com/GameTechDev/AOIT-Update

Intel HD Graphics 530, ROV off -> on (2-node): Transparent (ms): 1.02 -> 1.62 Resolve (ms): 0.75 -> 0.58 Opaque (ms): 1.90 -> 1.92 Total (ms): 3.98 -> 4.44

NVIDIA RTX 2080 Ti, ROV off -> on (2-node): Transparent (ms): 0.09 -> 0.30 Resolve (ms): 0.07 -> 0.03 Opaque (ms): 0.05 -> 0.05 Total (ms): 0.30 -> 0.40

AMD RX 5700, ROV off -> on (2-node): Transparent (ms): 0.11 -> 2.18 !!! Resolve (ms): 0.10 -> 0.07 Opaque (ms): 0.23 -> 0.19 Total (ms): 0.43 -> 2.48 !!!

pent0 commented 4 years ago

Performance is definitely be worse, since ordering is waiting for last render to finish.

But we would want it to work then nothing at all. As I know so far from my side there is really no easy solution to this other than interlock (we are translating shader, and in a scene in a game, multiple fragment shaders use different blending methods and its hard to control over them).

Though thanks, AMD seems worse on this part. I dont really have another idea (degerz advises me with those stuffs too, but it seems there is no alternative solution he can provide this time). I will just use the extension for now and dont care about which gpu support it.

Degerz commented 4 years ago

@jpark37

I don't know what you guys are doing in your emulators, but you may want to consider non-ROV alternatives.

We absolutely can not!

In the case of emulation, we can't control how the hardware is designed and if the software takes advantage of a corner case such as being able to do primitive ordering inside a draw call/subpass on real hardware then we must be compelled for accuracy purposes to emulate that behaviour as well.

Thanks for the numbers though. That explains it, AMD is roughly ~20x slower(!) when primitive ordering is enabled inside a pixel shader and I can see why the driver team would want to avert exposing this asymptotic scenario.

All we can do is hope that AMD can provide a better hardware design in the future to mitigate this performance drop so that they'll feel more confident in exposing it to developers or wait until someone posts some patches if AMD is unwilling on their end.

pent0 commented 4 years ago

Me and other graphics dev discuss and decide to ignore the non exsistence of this extension this time, as there is no other workaround from our side. Users can still use OpenGL backend as it support programmable blending through texture barrier on amd, though, but not really fast.

I have same wish like degerz, I hope the hardware part will improve enough, maybe one day programmable blending will be standard.

jpark37 commented 4 years ago

How about a per-pixel linked list for overlapping primitives? Store off primitive id and final color, then sort and blend afterward. Similar to how (I think) OIT works, but not OIT. Sorry if that's a dumb idea.

gharland commented 4 years ago

@Anteru So if it's the primitive ordering that's the problem, the unordered aspect of the extension, which is all we want for voxels/volumetric/GI, would still be useful and performant:

"This extension can be useful for algorithms that need to access per-pixel data structures via shader loads and stores. Such algorithms using this extension can access such data structures in the critical section without worrying about other invocations for the same pixel accessing the data structures concurrently."

if the inefficiency lies with the serialization due to primitive ordering:

"Additionally, the ordering guarantees are useful for cases where the API ordering of fragments is meaningful. For example, applications may be able to execute programmable blending operations in the fragment shader, where the destination buffer is read via image loads and the final value is written via image stores."

could you please consider implementing just the unordered variant to begin with?VK_AMD_pixel_interlock_unordered??

might be useful for the linked-list approach to blending also

phire commented 4 years ago

I don't know what you guys are doing in your emulators, but you may want to consider non-ROV alternatives.

Basically, we need (want, I guess) bit-prefect emulation of exotic alpha blending equations.

We want to read the old color in, blend it with the new color with arbitrary pixel shader code and then write the result out. No sorting, no managing data structures is memory. Just read, modify, write.

I'm not sure how relevant performance tests of OIT with/without ROV is to this use case.

How about a per-pixel linked list for overlapping primitives? Store off primitive id and final color, then sort and blend afterward. Basically how (I think) OIT works, but not OIT. Sorry if that's a dumb idea.

I guess linked lists is technically feasible, but it's nowhere near as simple as OIT examples.
OIT examples use a single blending equation over the entire frame, so they can run a single second pass at the end, while an emulated game might continuously switch between many different blend modes within a single frame.

You would either need to do a resolve pass every single time the blend mode changed, or write the blend mode to the linked list.

And then it's very common for games on these old console to randomly start rendering with depth writes disabled. We would need to detect this and force a linked list resolve before each draw without depth tests enabled. Or we would need to put these depth writes into the same linked list.....

That's just a few examples of curve-balls that I can think up.

Trying to design a linked list implementation that works with every single edge case an emulated game might throw at us is a nightmare. In comparison, fragment shader interlock is dead simple to implement and even if it has a large speed penalty it might still be the best solution.

jpark37 commented 4 years ago

That makes sense. Well, I tried. :P

Degerz commented 4 years ago

@jpark37 Per-pixel linked lists still don't give us the needed primitive ordering I think ?

What we would need to do is to be able to is that for every draw call/subpass with overlapping/intersecting geometry we would need to split the geometry into separate draw calls but I don't even know if this is viable or not.

We can already currently express ordering between draw calls/subpass but we can't do this inside a draw call/subpass.

pent0 commented 4 years ago

Correct

phire commented 4 years ago

@jpark37 Per-pixel linked lists still don't give us the needed primitive ordering I think ?

Oh right, I missed that.

You would need to store a draw call id and primitive id into the linked list and sort by that during resolve.

Degerz commented 4 years ago

@phire So does that mean we should do per-pixel linked lists to emulate programmable blending now ? :D

jpark37 commented 4 years ago

I don't think you would need to split the geometry. I was looking at the HLSL semantics list, and assumed SV_PrimitiveID behaved the way I wanted, but I've never used it. And yeah, if spanning draws, storing draw+primitive id with color is an idea.

phire commented 4 years ago

@Degerz we always had the option of submitting just one primitive per draw call to emulate programmable blending. It's just a question of performance.

Like I said, I guess per-pixel linked lists would work, I've just never really considered it before.

But the worst case would have us spamming that resolve pass so many times per frame that I have no idea what the performance of such an approach is like.

Degerz commented 4 years ago

@phire Well considering POPS is roughly ~20x slower on red team I would think that per-pixel linked lists might actually end up being competitive on their hardware or even better in terms of performance!

pent0 commented 4 years ago

It work, but pouring the effort that imo is too complex and too hard that near impossiblr is not worth it for emulator.

jpark37 commented 4 years ago

I wouldn't extrapolate too much about relative ROV performance yet. It's one test on a new GPU architecture with very young drivers, and just me doing the measurements. It would be nice if AMD confirmed that POPS is indeed just way slower for them, but I would be surprised if they were allowed to say that.

pent0 commented 4 years ago

Basically, if we want to do it, for case of vita3k, give each draw call an id, store it in linked list, detect all subroutines that do blending in each shader used in a frame, generate a shader that do blending together, reorder blending code with corresponding call, in just one frame. This is like even slower then ROV in our case.

And we cant gurantee it to work properly. Interlock would be one of the simplest.

Degerz commented 4 years ago

It would be cool if we could see the Dolphin emulator team implement both approaches and benchmark them on all hardware vendors using both D3D ROVs and per-pixel linked lists.

It'd make for great material on the GPU Open blog posts for emulation developers out there.

Edit: Also wonder if we can accelerate the per-pixel linked lists approach using the VK_KHR_shader_atomic_int64 extension as hinted by @Anteru because then the comparison would have to be D3D ROVs vs VK per-pixel linked lists with int64 atomics. I don't know if 64-bit atomics are available in D3D12.

pent0 commented 4 years ago

Persona 4: Dancing All Night (JP) - Heartbeat, Heartbreak (Video & Let's Dance)

I still have my hope though

phire commented 4 years ago

It would be cool if we could see the Dolphin emulator team implement both approaches

Sounds like you are trying to bait me into doing your homework.

ghost commented 4 years ago

I know that Redream uses per-pixel linked lists for Dreamcast emulation if someone wonders which emulator used this approach before. Also Play! implements shader interlock for PS2 emulation.

Degerz commented 4 years ago

@phire I'll be honest, I might very well have been! Your project has the most the amount of resources so it's the ideal place for experimentation purposes.

As for Redream, does per-pixel linked lists work perfectly in their case ?

Found an interesting presentation from Nvidia on doing OIT. I think the AMD driver engineer was talking about doing an atomic loop in a single pass but linked lists are still the best bet with pathological case of a high number of layer count.

RinMaru commented 4 years ago

No Fragment shader interlock is still needed in Redream

Degerz commented 4 years ago

@RinMaru Interesting, has the author tried phire's suggestion of storing the draw call ID and the primitive order ID as well into the linked lists and sorted them during the resolve pass ? (I assume the author is probably already doing this ? It would be very nice if the author explained all the implementation details.)

I'm starting to think that no matter how much information we store to the linked lists, it is increasingly impossible to do programmable blending ...

phire commented 4 years ago

@Degerz You know, I've just realized you are trying to implement a 5 series PowerVR gpu, which already has fully programmable blending via a framebuffer fetch capability.

To actually emulate that with per-pixel linked lists you have to put the shaders entire outstanding state into the linked list, and then somehow resume execution from the resolve shader.

To resume execution, you either have to generate a resolve shader which inlines the tail of every single programmable blend shader it might encounter or write a resolve shader that executes arbitrary bytecode out of uniforms. The latter is not as crazy as it might seem, Dolphin's ubershaders are essentially interpreters that interpret TEV shader opcodes in the pixel shader.

fragment interlock would be a much more sane solution for that.

As for Dolphin, the only thing we can't do actually do with the fixed function blend unit is dithered blends into 6bit framebuffers, for example:

DST_COLOR = ((DST_COLOR<<2 * DST_ALPHA<<2 + SRC_COLOR * (255 - DST_ALPHA<<2)) + dither_table[screen_x_pos & 3, screen_y_pos & 3]) >> 2

The effect on the final image is usually so minor that it's hardly worth doing.

I'm currently experimenting with hacking per-pixel linked lists into dolphin, because you nerd sniped me.

pent0 commented 4 years ago

Its SGX 5xx gpus. And one degerz mentioned is from ps vita.

To resume execution, you either have to generate a resolve shader which inlines the tail of every single programmable blend shader it might encounter or write a resolve shader that executes arbitrary bytecode out of uniforms. The latter is not as crazy as it might seem, Dolphin's ubershaders are essentially interpreters that interpret TEV shader opcodes in the pixel shader.

Its not 200x console anymore. These shaders on Vita are complex than that. Which p4g maybe an example, laid programmable blending across multiple opcodes, so we need to detect dependency. Also its insane that you need to order them correctly cause multiple shaders use multiple blending method in a scene. I think you said that before. Essentially this is like from high efficient method turned into bottleneck, because we have to generate blend shader every frame, and shader may also does complex operations, which make keeping the framerate stable not possible.

If I want it to work, I just want to take an easier path. This is like insane work, i dont like it.

pent0 commented 4 years ago

Overall, interlock may sacrifies some perf, but its easiest method and everyone likes it. Its also used for compute.

On AMD, the performance sacrifition maybe bigger then other vendor, so we cant take the easy path on AMD. We may need to wait for hardware to improve or more traffic came to this. I will take the easy path anyway.

Degerz commented 4 years ago

@phire

You know, I've just realized you are trying to implement a 5 series PowerVR gpu, which already has fully programmable blending via a framebuffer fetch capability.

Yeah and as commonly with mobile GPUs, I think they have dedicated tile memory to accelerate this so that programmable blending in their case is not nearly as bad as observed on red team.

fragment interlock would be a much more sane solution for that.

It is, definitely from an implementation standpoint but what about the performance cliff by vendors ? On Intel, in their demo with AOIT a critical section costs as much as ~1.6x (2 layers ?) without the critical section. On AMD, this increases by about as much as ~20x (2 layers ?) so this might very well be an opportunity for them to find alternatives.

I'm currently experimenting with hacking per-pixel linked lists into dolphin, because you nerd sniped me.

This seemed like an unexplored area with per-pixel linked lists so you could very well be on the verge of making a new discovery!

inolen commented 4 years ago

@Anteru @phire we use per-pixel linked lists on redream, but you run into issues with rasterization order of overlapping primitives inside of a single draw call where the depth is equal which is common in the user interfaces of many Dreamcast games.

You can store a primitive id at the expense of performance (you'll likely blow out a single vec4 packing the color, depth, blend parameters, next id, prim id). If I recall correctly, at least on NVIDIA / Intel, relying on the rasterization order guarantees provided by fragment shader interlock is faster if your alternative is using more than a vec4 per-fragment to fit in the additional primitive id.

phire commented 4 years ago

@inolen It's a shame, dolphin looks like it is just over the size of a vec4.

32bits of color/alpha, 24bits of depth.

I think we can squash the blend state into 30 bits:
3 bits src factor
3 bits src factor alpha
3 bits dst factor
3 bits src factor alpha
4 bits of logic ops mode
6 misc control bits
8 bits of constant alpha

So unless we can squash the next index and primitive_id into just 42 bits total, it's not going to fit.

Edit: thinking about it more, the blend state is constant over the entire draw call. There are probably less than 100 unique blend states per frame. If we preprocessed these on the CPU and packed them into a uniform, then had a node of:

32bit color/alpha
24bit depth
12 bits blend_function_idx
28 bits next_idx
32 bits primitive id

Then it would fit in a single vec4

jpark37 commented 4 years ago

Do you need blend state for every node? I'd think that data would be constant for at least a draw. Maybe you could index a buffer of blend states by custom draw id or something. Apologies if I'm not understanding this correctly.

Degerz commented 4 years ago

@phire I had another idea but could we extend DXR shader binding tables beyond the raytracing pipeline in the future so that we can index the shader blending programs from the nodes in our linked lists ?

DXR shader binding tables sounds really conceptually familiar to function pointers.

phire commented 4 years ago

@jpark37 Correct, that's what I'm suggesting in my edit

@Degerz My understanding is that shader cores (nvidia, amd and intel) are totally capable doing indirect branches and have been able to for years. It's just not really exposed to pixel shaders. OpenGL 4.0 does have ARB_shader_subroutine extention, but that only exposes uniform control flow that is static per draw call.

The downside is the whole wave/wrap follows the branch. To do dynamic per thread indirect branching, you would have to disable lanes and loop, executing upto 64 different subroutines before continuing. Performance would be very dependant on how often threads branch to the same blend shader.

But I think GCN and later is technically capable of this. I wonder if Nvidia added hardware to Tesla to accelerate this operation?

Also, I think you would be surprised at the performance of literally writing a bytecode interpeter in your pixel shader and compiling blend programs to bytecode which you store in uniforms. Keep the bytecode simple, only the operations you need and only four or eight registers, spill any extras to the stack. Your resolve shader will just use a switch statement and bunch of dynamic indirect array accesses to execute it.

It won't be as fast as the option above, but potentially competitive, within a order of magnitude.
More importantly, it doesn't require the development of any special extensions and works across many GPUs (though watch out for dynamic array indexing performance on Nvidia, they don't have an "index into array of registers" operation like most other vendors)

jpark37 commented 4 years ago

Another idea, how about dividing up the index buffer into indirect draws?

Example: 10 triangles, 30 indices, Triangles 4, 6, and 7 have overlap.

Do a triangle list draw that does just enough work to store enough info to know which primitives overlapped: vkCmdDrawIndexed(commandBuffer, indexCount=30, instanceCount=1, firstIndex=0, vertexOffset=0, firstInstance=0);

Do some sort of pass that takes that information and builds a list of indirect draws: VkBuffer argumentBuffer, sized to maxDrawCount: DrawIndexed [0:17], DrawIndexed [18:20], DrawIndexed [21:29] VkBuffer countBuffer: 3

Set up regular draw state, and do multi-draw indirect: vkCmdDrawIndexedIndirectCountKHR(commandBuffer, argumentBuffer, 0, countBuffer, 0, maxDrawCount, stride);

Initial questions:

What's the best way to build the overlap info structure?
How best to convert that info into an array of indirect draw arguments?
What should maxDrawCount be sized to? What to do if it's short?
Can this work for other types of draws?
What's the support level of vkCmdDrawIndexedIndirectCountKHR?

oscarbg commented 4 years ago

Just two additional cents: @Degerz if ROV use cases on Vulkan end using VK_KHR_shader_atomic_int64 then we will have to ask Intel Windows Vulkan driver team to implement it (or use interlock EXT on this case).. available everywhere else on Linux and Win.. EDIT: Intel VK Win VK_KHR_shader_atomic_int64 support is possible as supported by Anvil now..

in the light of shared tests where AMD use of ROV causes a 20X slowdown wondering if AMD big slowdowns are really to architecture not being so friendly to ROV as NV and Intel or really they have to additionaly implement to software workarounds to HW bugs in ROV case than cause more slowdowns than would be needed either way.. I may be wrong but one hint may be Xenia D3D12 ROV has notable graphical issues running Read Redemption on Vega the two last times I tested (and there ware 6 good months between).. maybe @Triang3l can share thoughts on AMD rendering bugs on Xenia D3D12 ROV on Vega at least.. and if anybody knows if new Navi 5700 cards are known to render correctly on Xenia D3D12 ROV case..

oscarbg commented 4 years ago

hey guys.. good news to share: we will have interlock ext supported on MacOS MoltenVK very soon:

https://github.com/KhronosGroup/SPIRV-Cross/pull/1138

so even AMD GPUs on MacOS (Vega only right now) will support it.. maybe interesting if DXVK or VKD3D gained support for ROV to run the Intel demos mentioned earlier to see if also the big slowdowns on AMD GPUs stil are there on Metal..

Degerz commented 4 years ago

Also, do we tell them to just sort the linked lists by primitive id and store the shadow map framebuffer combiner state too as a solution ?

Attempting a bytecode interpreter for our case to generate blend shader programs could be something to consider in the future once system configurations with GPUs capable of 10+ TFlops/500+ GB/s become common enough which won't be too far off ...

MadByteDE commented 4 years ago

Its not only relevant to the emulation, its related to everything. DXVK needs interlock to implement ROV, games developer need it to do programmable blending.

Hey guys, I'm just a regular consumer and would like to know if the stuff you're discussing here may be the cause for this kind of artifacting seen on Navi / Raven Ridge in games utilizing DXVK. Seems like the dev's not gonna answer it any time soon.. so.. This could prevent new ppl from creating more issues for this over - and over - again..

Joshua-Ashton commented 4 years ago

@MadByteDE No.

mcoffin commented 4 years ago

@MadByteDE nope (I think) this is just about how to expose some new ngg capabilities

GPUOpen-Drivers / AMDVLK

Expose primitive ordered pixel shaders #108