KhronosGroup / Vulkan-Ecosystem

Public repository for Vulkan Ecosystem issues
Apache License 2.0
133 stars 15 forks source link

Plans for providing raster order views (aka ROV)-like functionality in Vulkan? #27

Closed oscarbg closed 5 years ago

oscarbg commented 6 years ago

Just seeing stream output request thread I feel motivated to open this request: Just asking as this is supported everywhere (by every vendor) on desktop now (in fact by Nvidia and Intel iGPUs for almost three years).. now that AMD Vega supports it joins Nvidia Maxwell and newer and Skylake GPUs and newer.. this was added in D3D12 and even D3D11 (in v11.3) at same time as other features like conservative rasterization that are supported now in Vulkan since early this year.. should be useful for projects like VKD3D and DXVK (as possibly some D3D12 games already using it oportunistically?) EDIT: forgot to say that Metal2 supports it as optinally cap. (they call it RasterOrderGroupsSupported) (checked it & it's exposed on Intel Skylake and AMD Vega Metal drivers..) so as a plus could be supported by Vulkan on Macos (aka MoltenVK) with no much work also as soon as SPIRV-Cross Metal output supported it..

Also for real case I see talk from GDC17 sharing how to mix conservative raster+ROV for "improved AA"

Raster Ordered Views and Conservative Rasterization: Rahul Sathe (NVIDIA) and Evgeny Makarov (NVIDIA) will discuss how conventional MSAA rasterization can result in a lot of overhead throughout the rendering pipeline. They show how hardware MSAA can be combined with ROVs instead of conventional render targets during rasterization for better AA at reduced cost. They also demonstrate a technique that uses conservative rasterization to place the sample(s) in a fully programmable way. They use raster order views to ensure that pixels along the shared edges of the triangles generated by clipper are handled correctly.

HansKristian-Work commented 6 years ago

I think some questions that should be answered before raising this to the working group is to understand:

Basically, I think we need to establish how important this is compared to all other things that could be added/fixed in Vulkan.

oscarbg commented 6 years ago

Ok not so fast response.. have taken some time to gather data.. hope appreciate the effort..

1)yes some demos/content with source code are avaiable (even engines like Lumberyard) :

I attach links to samples (with source code avaiable for all of them) using ROV from D3D11(.3),D3D12,OpenGL and Metal2.. *ROV on Metal2: forgot to say that also A11 adds support for it on mobile GPUs..

see both of: https://developer.apple.com/documentation/metal/about_gpu_family_4/about_raster_order_groups https://developer.apple.com/videos/play/fall2017/605/

Raster order groups allow Metal 2 apps to precisely control the order of parallel fragment shader threads accessing the same pixel coordinates. Learn how A11 extends raster order groups with support for multiple groups and adds new capabilities for accessing threadgroup memory. See how you can improve the performance of single pass deferred shading and order independent transparency.

two samples:

1)Deferred Lighting with Raster Order Groups https://developer.apple.com/sample-code/metal/Deferred-Lighting-with-Raster-Order-Groups.zip

2)Order Independent Transparency with Imageblocks (uses ROV too) https://developer.apple.com/sample-code/metal/Order-Independent-Transparency-with-Imageblocks.zip

*ROV usage from OpenGL: https://software.intel.com/en-us/articles/fragment-shader-ordering-with-opengl-42

*ROV usage for OIT (DX11_3): Intel dev demo: pb_pso

https://software.intel.com/en-us/articles/programmable-blend-with-pixel-shader-ordering (download "code samples add win 10 support" other use Intel early D3D extension)

*Lumberyard 1.10 added OIT on D3D12 using ROV: shared-oit-example-animation see: https://docs.aws.amazon.com/lumberyard/latest/userguide/graphics-rendering-order-independent-transparency.html says:

OIT requires the following:

Hardware requirements: DirectX 12_1 feature level compatible graphics card (Nvidia Maxwell & Pascal, 4th generation Intel core processors)

Software requirements: DirectX 11.3 and 12 runtime on Windows 10 compiled with Windows 10 SDK

note you can explore Lumberyard source code for implementation..

*Voxellization using ROVs and conservative raster (D3D12) (My weekend project: voxelizing on the GPU using Rasterizer Order Views: https://twitter.com/MyNameIsMJP/status/937552329359859712) code here: https://github.com/TheRealMJP/BakingLab commit adding ROV: https://github.com/TheRealMJP/BakingLab/commit/5b29f2ee198b80582b8a2c99571ac956556999be commit adding conser raster: https://github.com/TheRealMJP/BakingLab/commit/54d8e2f6340b4589416368b44b5b43b48f6e15d1

2)don't know for current AAA games.. don't expect any D3D11 game to use D3D11.3 ROV.. but perhaps as said before D3D12 games use oportunistically.. not known because current D3D to Vulkan wrappers either are D3D11 (DXVK) or D3D12 but not mature enough (VKD3D).. I expect once VKD3D gets integrated onto Wine for D3D12 support people using D3D12 games will find out if games use it or not..

3) feature is considered "fast".. for example running "https://software.intel.com/en-us/articles/programmable-blend-with-pixel-shader-ordering" for ordering enabled vs disabled performance doesn't slow down significantlly as will do doing a "emulation" a K-buffer like technique using atomic counters or something similar altough no expert .. also see taken from Nvidia Maxwell OpenGL extensions post:

Fragment Shader Interlock (NV_fragment_shader_interlock):

This extension exposes an hardware-accelerated critical section for the fragment shader, allowing hazard-free read-modify-write operations on a per-pixel basis. It also allows enforcing primitive-ordering for threads entering the critical section. It provides new GLSL calls beginInvocationInterlockNV() and endInvocationInterlockNV() defining a critical section which is guaranteed to be executed only for one fragment at a time. Interlock can be done on a per-pixel or a per-sample basis if multi-sampled rasterization is used. This feature is useful for algorithms that need to access per-pixel data structures via shader load and store operations, while avoiding race conditions. Obvious applications are OIT and programmable blending for instance.

but also doesn't come for free vs not ordered: Nvidia says in D3D12 do and don'ts document:

Don’ts •Don’t use Raster Order View (ROV) techniques pervasivelyGuaranteeing order doesn’t come for free ◦Always compare with alternative approaches like advanced blending ops and atomics

4) no this applies to mobile as well with Apple A11 (Iphone 8/X) GPUs having ROV support exposed on Metal2 API.. seeing Android HW SOC IHVs try to "copy" all Apple new things.. hope they (ARM/Qualcomm) are working on his GPUs supporting ROV..

finally note this is a relatively easy extension to "write" as we have the existing ARB_fragment_shader_interlock exposing the new GLSL bits which are basically: layout-qualifier-id pixel_interlock_ordered pixel_interlock_unordered sample_interlock_ordered sample_interlock_unordered and void beginInvocationInterlockARB(void); void endInvocationInterlockARB(void); all we need is a way to map this to SPIR-V and drivers implementing it and we are done.. seeing D3D11/12 support probably even we can omit void beginInvocationInterlockARB(void); void endInvocationInterlockARB(void); as marking buffers/textures/samplers with pixel_interlock_ordered/sample_interlock_ordered is enough as shown how ROV is supported in HLSL world..

feel free to ask more doubts/questions..

oscarbg commented 6 years ago

Hi just more updates after yesterday day post after some sleeping..

Basically, I think we need to establish how important this is compared to all other things that could be added/fixed in Vulkan.

Sorry in advance if it's sounds a brave affirmation, but IMHO, I think having a feature exposed on all other graphics APIs (D3D11/12,OpenGL, and Metal2) and not in Vulkan should bring a red flag on minds of Khronos Vulkan people.. specially having it on Metal as Apple is (call it what you want) lazy/smart enough to not implement not widely used features (according to them of course) as geometry shaders/shader subroutines and the like. seems like Apple sees ROV as an important graphics feature to have (call it "UAV serialization" like Intel or "hardware-accelerated critical section for the fragment shader" like Nvidia) to dedicate transitors to it..

Don't know what features Khronos Vulkan WG is evaluating adding but honestly can't think of any other missing one exposed on some many other APIs and with so vast hardware vendor support..

as said previously also having support from 3 desktop vendors in form of Vega,Maxwell and Intel Skylake iGPU and exposed on mobile GPU in A11 should indicate something..

(outside demos + experiments)?

ok didn't read the outside very well: as noted Amazon Lumberyard supports it.. some AAA game using it is Star Citizen altough don't know if it has D3D12 and even then if it uses OIT support..

anyway found 2 AAA games that may be using it (at least blogs mentions engineers implemented in game engine so possible available with some game update) sorry in advance for not testing as I don't own any of these games to test.

1)Just Cause 3 did a talk at GDC 2016: “More Explosions, More Chaos, and Definitely More Blowing Stuff Up: Optimizations and New DirectX Features in ‘Just Cause 3′”: https://software.intel.com/sites/default/files/managed/20/d5/2016_GDC_Optimizations-and-DirectX-features-in-JC3_v0-92_X.pdf there is a blog : https://software.intel.com/en-us/articles/optimizations-enhance-just-cause-3-on-systems-with-intel-iris-graphics

The original approach–using HDR–consisted of using one 32-bit buffer to store depth nodes, and another for color and alpha. HDR rendering could not be used in this case, however. This was solved by packing the eighth bit of alpha in the depth buffer, leaving the other buffer for a R11G11B10F HDR buffer. As an additional optimization, the team switched to using a Texture2DArray to store each node (instead of a structured buffer), which brought some performance benefit when using less than four AOIT nodes (JC3 uses two). With JC3 being an open world, with a lot of far vegetation, Intel and Avalanche engineers realized that Salvi’s approach would be extremely costly. In order to scale the performance from high-end PCs to mainstream systems with integrated graphics, the developers decided to add quality levels for the OIT, by simply using OIT on the first LOD at low settings, versus all levels of detail at high settings.

2)Codemasters Grid 2 and Grid autosport..

(this is for sure implemented/supported altough using (at the time) Intel D3D11 extensions (aka Pixelsync) as neither DX12 or DX11.3 were ready.. https://software.intel.com/en-us/articles/oit-approximation-with-pixel-synchronization-update-2014

The Order Independent Transparency sample using Intel® Iris™ graphics extension for pixel synchronization shows a real-time solution using the extensions available on 4th Generation Intel® Core™ processors. Codemasters used the algorithm in GRID* 2 and GRID Autosport to improve rendering of foliage and semi-transparent track side objects as shown in Figure 1.

in addition this blog mentions another sample which has been updated to use standard D3D11.3 Apis https://github.com/GameTechDev/AOIT-Update

to close seeing this games from 2014/2015/2016 era having support for it is a hint that perhaps some AAA D3D12 games ship with support for it (someone needs investigating Division,Gears of War4, Quantum Break UWP and the like)

Just to finish "might" be some Gameworks DX12 libs are using ROV specially the voxellization related ones like VXGI (which for example VXAO is in included on Rise of Tomb Raider DX12 mode) also NV FLow&Flex seems a good candidate to be using ROV features..

HansKristian-Work commented 6 years ago

Thanks for the lengthy comments, this adds a lot more context to the discussion.

So, based on the current discussion I see different approaches to how the APIs expose this functionality, so, reading ahead, I think an important step is to get some feedback on how the functionality should be exposed in Vulkan.

Basically, this boils down to "programmable blending with a twist", and all APIs seem to have done different things.

Do developers have any preference which style they'd like to see? Vulkan already has a reasonably expressive subpass system, but it can't express ordering between overlapping pixels in a draw call (just GL_ARB_texture_barrier-like things with subpass self-dependency).

oscarbg commented 6 years ago

Thanks for the lengthy comments, this adds a lot more context to the discussion.

glad to be useful..

just let me add, personally I would prefer D3D based one.. for D3D11/D3D12->Vulkan wrappers to have minimum translation changes.. addding D3D reference: https://msdn.microsoft.com/en-us/library/windows/desktop/dn914601(v=vs.85).aspx we see very simple changes:

Rasterizer ordered views (ROVs) are declared with the following new High Level Shader Language (HLSL) objects, and are only available to the pixel shader:

RasterizerOrderedBuffer..
RasterizerOrderedTexture1D..

Use these objects in the same manner as other UAV objects (such as RWBuffer etc.).

related to GL extension.. forgot to say that even AMD propietary GL driver supports the non ARB one: INTEL_fragment_shader_ordering.. so that makes also all 3 vendors (Intel+AMD+NV) supporting also the GL extension.. even Mesa has had patches for a year but still not merged: https://lists.freedesktop.org/archives/mesa-dev/2017-April/152340.html recent up to date patches: https://lists.freedesktop.org/archives/mesa-dev/2018-April/191361.html implementation as seen here is super simple: "We achieve the interlock and fragment ordering by issuing a memory fence via sendc."

Some comments on differences between Intel and ARB ext (NV and ARB seems equal from quick inspection): a)Intel GL ext only has the begin..() call no the end one (assumes end() call at very end of shader):

Note there is no explicit built-in function to signal the end of the region that should be ordered. Instead, the region that will be ordered logically extends to the end of fragment shader execution.

b)Intel GL ext doesn't have the tagging part:

     layout-qualifier-id
        pixel_interlock_ordered
        pixel_interlock_unordered
        sample_interlock_ordered
        sample_interlock_unordered

so it' assumes all read/modify/write image between the begin() call and the end have "ordered access guaranteed" for me is like implicitely tagging all UAV accesses between these calls with pixel_interlock_ordered and sample_interlock_ordered..

comparing D3D to these 2 GL ones seems like D3D is like the ARB one having the tagging but instead of using something like pixel_interlock_ordered to tag, you tag by using RasterizerOrderedTexture1D being equivalent to (pixel_interlock_ordered) or classic Texture1D (pixel_interlock_unordered) related to missing begin end calls seems like D3D doesn't support it and it assumes implicitely they are placed implicitely at begin and end of shader respectively psmain() { beginInvocationInterlockNV(); ... endInvocationInterlockNV(); } or has some compiler intelligence builtin..

question 8 of GL spec mentions reasoning of having these calls:

(8) Should we provide an explicit mechanisms for shaders to indicate a critical section? Or should we just automatically infer a critical section by analyzing shader code? Or should we just wrap the entire fragment shader in a critical section?

  RESOLVED:  Provide an explicit critical section.

  We definitely don't want to wrap the entire shader in a critical section
  when a smaller section will suffice..
oscarbg commented 6 years ago

let me finalize with two more bits:

1) asked Unity and Unreal devs to join discussion: https://twitter.com/oscarbg81/status/987269914758189056

Question to @UnrealEngine devs (@BrianKaris,@RCalocaO,@nickpwd) & @unity3d devs (@aras_p can point somebody?): Any plans to support/expose D3D11/12 raster order views in your engines? I say because I'm requesting same for Vulkan: https://github.com/KhronosGroup/Vulkan-Ecosystem/issues/27 … please join discussion..

2) with Qualcomm Adreno having D3D11/12 driver for new Windows on ARM devices: https://www.techspot.com/review/1599-windows-on-arm-performance/ I wouldn't be surprised if they are implementing some D3D optional caps in hadware progressively.. as example from Vulkaninfo Adreno 630 has added depth bounds test support and from a new Qualcomm OpenCL SDK released around GDC also seems Adreno 630 have new shuffle and other subgroup ops support.. don't know if pure coincidence but wave ops and depth bounds test support are just relatively new optional caps in D3D12(and D3D11.x I think).. so perhaps Adreno 630 internal D3D12 driver already exposes that (we have no devices).. wouldn't be surprised if 630 already supported ROV but they can't expose on Vulkan having no extension for it..

HansKristian-Work commented 6 years ago

I've pinged an internal issue linking to this thread.

oscarbg commented 6 years ago

Nice.. thanks.. Hope it helps gaining some traction internally..

nsubtil commented 6 years ago

@oscarbg thanks for the detailed write up! This sort of feedback really helps a lot.

Tobski commented 6 years ago

Hi All,

Just as an update, we want to say thanks for the feature request, and the great discussion around it!

The Vulkan WG have got this on our internal list of potential new features, and have definitely given some thought to how this might be implemented - we're evaluating the priority of this feature before taking it any further.

Any additional feedback, requests, or simple calls for this to be supported would be helpful in prioritizing this :)

Thanks, Tobias

thokra1 commented 6 years ago

Just wanted to bring another paper to your attention. This more recent OIT approach, which is amazingly close to the depth-peeling ground truth, uses ROVs in its sample implementations.

oscarbg commented 6 years ago

oh thanks.. this is a I3D paper from this year so very relevant.. they give you shader source code and use RasterizerOrderedTexture2DArray variant in lots of formats:

#if MOMENT_GENERATION
/*! Generation of moments in case that rasterizer ordered views are used. 
    This includes the case if moments are stored in 16 bits. */
#if ROV
RasterizerOrderedTexture2DArray<float> b0 : register(u0);
#if SINGLE_PRECISION
#if NUM_MOMENTS == 6
RasterizerOrderedTexture2DArray<float2> b : register(u1);
#if USE_R_RG_RBBA_FOR_MBOIT6
RasterizerOrderedTexture2DArray<float4> b_extra : register(u2);
#endif
#else
RasterizerOrderedTexture2DArray<float4> b : register(u1);
#endif
#else
#if NUM_MOMENTS == 6
RasterizerOrderedTexture2DArray<unorm float2> b : register(u1);
#if USE_R_RG_RBBA_FOR_MBOIT6
RasterizerOrderedTexture2DArray<unorm float4> b_extra : register(u2);
#endif
#else
RasterizerOrderedTexture2DArray<unorm float4> b : register(u1);
#endif
#endif
oscarbg commented 5 years ago

Hi, this is a bit old, but just pointing that "The Forge" engine 1.16 added an "Order-Independent Transparency Unit Test" showing various transparency techniques, one of them is the said before AOIT: Adaptive Order Independent Transparency with Raster Order Views (paper by Intel, supports DirectX 11, 12 only) generally "The Forge" tries to have all included samples running on both D3D12 and Vulkan APIs (and also Metal2) but this transparency mode isn't implementable on Vulkan because of the issue opened here..

oscarbg commented 5 years ago

Hi, recently saw a video showing Xenia (PS3 Xbox 360) emulator is capable of running Red Dead Redemption fairly well (in terms of speed at least): https://www.youtube.com/watch?v=5KJuLatL1hk&t=2s In Xenia's title bar "D3D12 ROV" can be seen, which was a surprise, as I wasn't aware that Xenia D3D12 backend was using Raster Order Views feature.. ROV usage is present at least in most up to date Xenia D3D12 branch here: https://github.com/xenia-project/xenia/tree/d3d12 I asked main d3d12 dev @Triang3l about ROV usage motivation on twitter:

D3D12 ROV functionality helps in emulating eDRAM, right? But it helps in faster emulator or more accurate emu or both? Just curious..

and he answered:

Both — it lets us alias render target data arbitrarily (no expensive copy and compute when RT configuration changes) and emulate custom formats like 7e3 floats, SNORM16 with a larger range, formats with PWL gamma, with correct blending, also 20e4 depth.

as Xenia also has a Vulkan backend in progress: https://www.youtube.com/watch?v=17jjIOU-djo&t=404s so a ROV Vulkan ext. could be helpful to this emulator Vulkan backend..

pdaniell-nv commented 5 years ago

Thanks for enumerating all the use cases for Raster Order Views on Vulkan. Since our response above, there is now a multi-vendor extension under development to expose OpenGL-style fragment shader interlock for Vulkan. HLSL’s ROV functionality can be implemented on top of this lower level primitive. This will likely be an EXT extension and should be available in the next few months. A couple of vendors have expressed interest in supporting it so far.

oscarbg commented 5 years ago

@pdaniell-nv nice news.. thanks for the update!

gharland commented 5 years ago

Any updates on this? Hopefully pixel_interlock_unordered is also included so we can say goodbye to atomiccompswap spin-locks and perhaps still be able to use VK_AMD_RASTERIZATION_ORDER in relaxed mode at the same time if possible.

pdaniell-nv commented 5 years ago

Yes, coming real soon. Thanks for your patience.

gharland commented 5 years ago

Awesome, thank you!

chrismile commented 5 years ago

Good news: https://www.phoronix.com/scan.php?page=news_item&px=Vulkan-1.1.110-Released VK_EXT_fragment_shader_interlock is part of Vulkan 1.1.110. Thanks for your hard work!

jeffbolznv commented 5 years ago

Release checklist for VK_EXT_fragment_shader_interlock is at https://github.com/KhronosGroup/Vulkan-Docs/issues/975.

pdaniell-nv commented 5 years ago

Driver for NVIDIA can be found here https://developer.nvidia.com/vulkan-driver

oscarbg commented 5 years ago

Good news to start the week.. at least in my time zone.. congrats to Khronos Vulkan WG.. also to Nvidia for 0-day drv support.. closing the issue..