GPUOpen-Drivers / AMDVLK

AMD Open Source Driver For Vulkan
MIT License
1.68k stars 159 forks source link

Expose primitive ordered pixel shaders #108

Closed Degerz closed 3 years ago

Degerz commented 4 years ago

According to the Vega ISA documentation, this feature uses the SOPP scalar microcode format. Currently, this feature is only exposed in AMD's D3D12 drivers as "rasterizer ordered views" so I'd like to see a mirror equivalent supported in Vulkan as well known as VK_EXT_fragment_shader_interlock.

We need this feature to emulate a certain PowerVR GPU for our use case and particularly we want the fragmentShaderPixelInterlock feature exposed from the extension so can your team enable this for the Vulkan drivers ? (Bonus if the team can also get fragmentShaderSampleInterlock exposed too) Also if you are working on this extension can we also get an estimate of when a driver will be released to support the extension/feature ?

Degerz commented 4 years ago

Can I get a response from the team ?

jinjianrong commented 4 years ago

Our stance is that we don’t want to implement it. It messes with subpasses and is not the right formulation

Degerz commented 4 years ago

What exactly do you mean by "not the right formulation" ? Is this extension somehow the wrong abstraction to map to this feature inside the hardware ?

If so, is there a better way to expose it like a potential framebuffer fetch extension ? (I don't think AMD HW supports multiple render targets so things could get sketchy)

How is the interlocks extension different compared to your "primitive ordered pixel shaders" ? Because we badly need this extension or something similar to a framebuffer fetch.

Degerz commented 4 years ago

How would you like to proceed with this ?

Can we at least get a vendor extension from AMD exposing this directly if your team doesn't like how interlocks are specified or would you prefer to close this issue if you have no intention of exposing similar functionality in Vulkan ?

I am requesting this because there's arguably a stronger case to have ROV-like functionality exposed in Vulkan rather than in D3D12 because there's a higher interest from open source projects in using it than AAA game engine developers.

oscarbg commented 4 years ago

@Degerz @jinjianrong it's sad to see AMD has no plans to support this extension on AMD Vulkan driver even now that VK_EXT_fragment_shader_interlock is a de facto "standard" by the fact that is supported by all other vendors (NV &Intel) and all other OSes (Windows & Linux): you can see on Windows is supported on "recent" NV (>=Maxwell) & Intel GPUs (>=gen9 Skylake): https://vulkan.gpuinfo.org/listdevices.php?platform=windows&extension=VK_EXT_fragment_shader_interlock similarly on Linux supported on NV&Intel: https://vulkan.gpuinfo.org/listdevices.php?platform=linux&extension=VK_EXT_fragment_shader_interlock

heck even since Metal2.0 on MacOS we have support for exact same feature.. on Vega cards on Imac Pro we get "RasterOrderGroupsSupported"..

adding more use cases: should be useful for VKD3D project for supp. D3D12 "rasterizer ordered views" in case any D3D12 games use it..

well in fact, Xenia emulator D3D12 backend uses ROV feature for better emulation of Xbox EDRAM hardware.. also Xenia is working towards adding Linux support: https://github.com/xenia-project/xenia/issues/1430 and has a more immature Vulkan backend that Linux backend would use.. Xenia VK backend could take advantage of VK_EXT_fragment_shader_interlock for better/faster emulation of Xbox hardware.. so joining @Triang3l to discussion in case it wants to discuss further..

EDIT: even it could be supported on MacOS MoltenVK Vulkan driver so I asked for it: https://github.com/KhronosGroup/MoltenVK/issues/630

Degerz commented 4 years ago

@oscarbg Good idea to get more people interested in this functionality, I think I'll do the same as well! While you're at it, can you go request other AMD engineers like @Anteru on twitter to show that the community wants this functionality as well on AMD HW on Vulkan.

cc @tadanokojin @hrydgard @pent0

The above have actively expressed interest and/or already using functionality similar to shader interlock in their projects. One of their main motivations to using Vulkan is getting access to modern GPU features like interlock so we'd prefer it from AMD if we didn't have to move over to platform specific APIs like Metal or D3D12 to be able to use this feature!

Vulkan subpasses are possibly not powerful enough for their purposes. I don't care if AMD doesn't ever expose VK_EXT_fragment_shader_interlock but please at least give another viable alternative for their sake even if it is an AMD specific extension!

pent0 commented 4 years ago

I just want to do programmable blending. If you guys can provide another primitives it would also be ok, but this is best. Texture barrier (for opengl) is what I am using but not the fastest path really (also for vulkan if appliable). I dont really know how you guys would do it though

ryao commented 4 years ago

How would you like to proceed with this ?

Here is my suggestion. Use the extension and tell users to switch to either Intel or Nvidia graphics hardware because AMD refuses to support the extension and cite this issue. Watch AMD backpedal on this very quickly after an executive hears about the situation.

Also, ask the RADV developers to implement support so that Windows users who want to use software that depends on it have the option of switching to Linux for it.

Triang3l commented 4 years ago

From the passes point of view, what's different in this from regular image/buffer stores?

Degerz commented 4 years ago

@ryao Seems like an unlikely scenario that it would reach to the very high echelons in the company and I don't know if mesa developers are all that interested since I haven't seen any patches related to this issue ...

@Triang3l Here are some insights from Sascha. Along with the stated limitations, I do not think that Vulkan subpasses are capable of handling self-intersecting draws just like OpenGL's texture barrier.

RussianNeuroMancer commented 4 years ago

Seems like an unlikely scenario that it would reach to the very high echelons in the company

News article on Phoronix could help with this a bit.

ryao commented 4 years ago

@ryao Seems like an unlikely scenario that it would reach to the very high echelons in the company and I don't know if mesa developers are all that interested since I haven't seen any patches related to this issue ...

All that you need is for end users to start telling each other that AMD graphics hardware is not friendly to emulators after they start asking why it doesn’t work on AMD graphics hardware. It will reach the upper echelons when they are trying to figure out why they did not meet their sales projections.

As for the mesa developers, they might not know that this extension has any use cases. I was under the impression that those working on RADV were volunteers, so if you don’t ask them about it, they seem less likely to implement it.

Degerz commented 4 years ago

@ryao TBH, I feel it is more constructive for developers like @pent0 to just express their desire to expose this feature and just list out their use cases instead ...

At the end of the day, advanced system emulation doesn't even account for the fraction of AMD's customers and the emulation community is already aware that AMD has a checkered history with them so the leading hardware vendor is already favoured over there.

I'd prefer it if we can show that their driver manager's position is out of touch with the community's position because unlike with higher ups such as executives there's no guarantee that they'd understand this issue or that they'd be specialists regarding GPU drivers to help us out.

pent0 commented 4 years ago

I love you guys. I know you can do it however the hard it's. Go go go! We all want this feature.

Also programmable blending is not what emulators want also, it's also what many game developers desired to achieve nice and godly effect on their game, which fixed blending can not do. I can't bring an example for PC but here is an example one doing on Metal IOS.

Programmable shader pipeline has been here for 15 years, so programmable blending should be too. Its the defacto nowadays.(I copy this quote from this article).

I am really bad at wordings hehe, I just express what many people want. Reconsider please :)

jarrard commented 4 years ago

Who exactly does this affect anyway? just PowerVR GPU users? if so I can understand why AMD doesn't want to dedicated valuable development time to this endeavour. Nothing is stopping the community adding it themselves thanks to OPEN-SOURCE drivers!

jarrard commented 4 years ago

Who exactly does this affect anyway? just PowerVR GPU users?

In relation to the topic and example given! How is this not obvious?

PowerVR GPU

Then that's not the best example, it would have been better to give examples that are not fringe case but more common use. Also referring to people as retarded is why stuff like this gets ignored, its quite a anti-open-source attitude to have, and only derails things.

Take a chill pill mate! THE END

pent0 commented 4 years ago

Its not only relevant to the emulation, its related to everything. DXVK needs interlock to implement ROV, games developer need it to do programmable blending.

It affects many things hence this extension exist. Please think more.

jfdhuiz commented 4 years ago

@pent0 I get that you are angry. If you feel misunderstood, express that feeling. Make your points and cut out the strong language. Strong language won't help your cause (for innocent bystanders it looks like you're grasping at straws), and it is disrespectful. Your point is much, much stronger without the strong language.

pent0 commented 4 years ago

Really sorry for the bother! My point still stands anyway, it helps many things, not just emulation for PowerVR. Its's expressed upper.

jinjianrong commented 4 years ago

@Degerz Here is the feedback from our Vulkan team regarding the extension:

Whilst we could potentially support this feature, we don't see any use of this functionality in typical applications using our drivers (mostly desktop gaming). This functionality is exposed via DirectX (ROVs) and sees no real use outside of a handful of demo applications.

Additionally, this is an inefficient method of performing the typical thing it's often advocated for - order independent transparency. For such an effect we would usually recommend using one of the many two-pass OIT algorithms out there, and making use of multiple subpasses, with the second pass doing the resolve. This is likely the most portably efficient mechanism you can use that works between desktop and mobile parts. We're thus not inclined to support it, as we'd rather not promote an inefficient technology.

However, if you're looking to do direct emulation, we are not sure that really helps you - perhaps you could elaborate on what it is you're trying to emulate exactly and we may be able to advise on an alternative method?

ghost commented 4 years ago

@Degerz Here is the feedback from our Vulkan team regarding the extension:

Whilst we could potentially support this feature, we don't see any use of this functionality in typical applications using our drivers (mostly desktop gaming). This functionality is exposed via DirectX (ROVs) and sees no real use outside of a handful of of demos applications

It's being used in Just Cause 3 and GRID 2. https://software.intel.com/en-us/articles/optimizations-enhance-just-cause-3-on-systems-with-intel-iris-graphics

https://software.intel.com/en-us/articles/oit-approximation-with-pixel-synchronization

illusion0001 commented 4 years ago

Quote from @oscarbg Xenia (Xbox 360 Emulator) D3D12 backend uses ROV feature for better emulation of Xbox EDRAM hardware.

pent0 commented 4 years ago

I don't know much about these stuffs, so I will let the guys who know discuss. I will try to get a workaround for now.

Hi, in our case, we are trying to emulate a feature from the PowerVR GPU. It's that you can fetch the last fragment data of a texel in color buffer and use it for blending inside the fragment shader. It's like blending but not fixed, but inside the shader (programmable).

For OpenGL on AMD, we are using texture barrier. On Vulkan I'm not sure if that's available (the only thing I know is pipeline barrier so far, but I will look more). What would you advice me to do in this case?

Edit: @degerz was asking for our case, thanks! I was not aware of you asking this before you ping me.

Degerz commented 4 years ago

@jinjianrong Thank you very much for the response!

Support for interlocks/ROVs aren't that compelling in D3D because engine developers are more interested in targeting higher hardware compatibility than using the latest features like I mentioned. By comparison, there are already open-source desktop(!) OpenGL applications out there that are already using interlocks or framebuffer fetch and we would like to be able to target both both Windows and Linux on AMD's Vulkan drivers.

Also, we don't want to implement order independent transparency with Vulkan subpasses. We want to have the same capability to do programmable blending for emulation purposes and for this reason alone Vulkan subpasses are not a powerful enough mechanism for this purpose since it possibly(?) can't handle self-intersecting draws like we see with texture barrier. I understand from the hardware people's point of view that primitive ordered pixel shaders can place an ordering constraint when executing fragment shaders and that certainly has undesirable effects in terms of increased latency due to the stalling it causes.

This feature helps us to emulate systems that have non-standard fixed function blending pipelines and systems that are capable of programmable blending as well via shader framebuffer fetch. The biggest reason why interlocks/ROVs/fetch have an advantage over Vulkan subpasses are because the latter does not cover the edge case of self-intersecting geometry and thus gives in our case incorrectly rendered content!

Edit: If I had to rate the severity for the lack of this feature it would almost be as bad as not having transform feedbacks/stream-output available for DXVK and your team also added support for this fundamentally hardware unfriendly feature as well so just like with transform feedbacks we need to also cover some more cases with interlocks as well even if it does have undesirable performance characteristics.

hrydgard commented 4 years ago

As another data point as the author of PPSSPP, the popular Sony PSP emulator, the PSP also has a few blend modes that cannot be replicated without fragment shader interlock or similar programmable blending functionality. Now, games don't actually use them much and don't generally use them for self-overlapping draws, so framebuffer copies work in practice to emulate them, but for fully hardware-accurate emulation this would be useful.

I'm not directly involved with Xenia, have only followed its development from the side, but it needs this functionality to simulate some framebuffer formats that only exist on the Xbox 360 and are heavily used by games. They're not practically feasible to emulate in other ways,

Triang3l commented 4 years ago

@jinjianrong Xenia needs this for pretty much everything in render target emulation:

gharland commented 4 years ago

The unordered variant of this extension is essential for voxelization/global illumination/volumetric rendering. Otherwise what option is there for avg or max blending voxels other than clunky atomiccompswap spinlocks? Couldn't we at least have the unordered variant? Even if there were no use case why can't developers just have another tool in the tool box for coming up with new algorithms?

The extension is also requested here, please come over and register your interest.

https://community.amd.com/message/2927066

https://community.amd.com/message/2926956

Degerz commented 4 years ago

@jinjianrong I'm sure you've realized it by now but our stance on this issue as a community is non-negotiable so we do not desire to seek the 'alternatives' that you speak of.

I understand the anxiety your team is facing right now since they're going to expose an unfriendly hardware feature and if you absolutely cannot have the general public wanting to access this feature then I have a solution which is applying whitelists to these community projects specifically to be able to access this feature in the driver.

Is whitelisting certain applications a viable solution at your end for our case ?

Triang3l commented 4 years ago

@Degerz Whitelisting will make adding this to new projects impossible, it would never exist in Xenia if we had to go through any procedure of being added to a whitelist (and ROV usage there began as an experiment anyway), and that's the opposite of how PC gaming works.

Degerz commented 4 years ago

@Triang3l Then what other solutions do you suggest to an unwilling driver team ?

If new projects just pop up, then they should arguably just file an appeal since AMD does not like the way applications could potentially use this feature.

Triang3l commented 4 years ago

@Degerz To modify the extension so it's more clear about subpass dependencies, if needed — Vulkan extensions are versioned, as long as no new "must"s are added, it should be fine. Whitelisting is completely orthogonal to the actual issue.

@jinjianrong What are the exact issues that interlocking causes with subpasses? I don't know Vulkan much, I've mostly used D3D12, but is it an implicit dependency across subpasses (shader executions for different subpasses running in parallel may still interlock when they don't need to)?

Well, I don't think it should be a big issue. Since it seems to be a hardware limitation, then probably ¯\_(ツ)_/¯ for now, and if more granular interlocking control is added in the future, it could be exposed via a device feature flag + a new pNext in VkRenderPassCreateInfo?

Or what is incorrect in the extension? But if something is just a hardware limitation making things a bit less optimal, it should be okay to implement it anyway.

Degerz commented 4 years ago

@Triang3l I am not here to play around with ideological purity because AMD engineers here are very clearly spooked by anyone using this feature with implications of high performance pitfalls.

What I want to do is to strive to find a common ground between us which is the community and their driver team and if this means making compromises like whitelists to be able to even access the feature at all then you should at the very least consider this approach as well.

I assume that AMD will be reasonable enough to grant a whitelist to certain applications as long as the user(s) who filed an issue can explain why this feature fixes their issue for their application and if they absolutely need it.

Triang3l commented 4 years ago

@Degerz This is just a graphics programming GitHub issue tracker, not the UN Security Council, calm down a bit please :)

Whitelisting is an ideological workaround (that would go against the idea of open drivers and also add unnecessary management burden), we need to settle on something technical and not very time-consuming to do. APIs have always been defined in part by what is already there in production GPUs, with tiers in D3D and Vulkan being very explicit about things like device features, a "tier 1" implementation based on the current extension would be enough to expose the existing hardware feature and to cover all the current usage cases, but for flexibility, it may be expanded later.

It's not prohibitively slow (in Xenia's case it's also significantly faster than the alternative), every tool has its own performance implications and developers are more or less aware of them.

Degerz commented 4 years ago

@Triang3l

every tool has its own performance implications and developers are more or less aware of them

Wrong assumption to make that developers are aware of them. There's a good reason why AMD doesn't expose push descriptors despite being a Khronos promoted extension and that's because they're dirt slow on AMD hardware and without AMD engineers reeducating the developers to do the 'alternatives' then the performance of those applications would've been a disaster on their hardware.

You clearly haven't considered the driver team's feeling on this issue. Taking a hardline stance on this matter is shortsighted when we're at the absolute mercy of their team's decision. I feel if we have to give them a sense of security for them to come to a compromise then whitelists become a reality no matter how much you dislike them.

gharland commented 4 years ago

@jinjiangrong , could you suggest an alternative to this without interlock?

This is the kind of code we have to write on AMD to blend RGBA8 voxel fragments. With interlock it's a simple load modify store inside a critical section.

void imageAtomicAverageRGBA8(layout(r32ui) coherent volatile uimage3D voxels, ivec3 coord, vec3 nextVec3)
{
    uint nextUint = packUnorm4x8(vec4(nextVec3,1.0f/255.0f));
    uint prevUint = 0;
    uint currUint;

    vec4 currVec4;

    vec3 average;
    uint count;

    //"Spin" while threads are trying to change the voxel
    while((currUint = imageAtomicCompSwap(voxels, coord, prevUint, nextUint)) != prevUint)
    {
        prevUint = currUint;                    //store packed rgb average and count
        currVec4 = unpackUnorm4x8(currUint);    //unpack stored rgb average and count

        average =      currVec4.rgb;        //extract rgb average
        count   = uint(currVec4.a*255.0f);  //extract count

        //Compute the running average
        average = (average*count + nextVec3) / (count+1);

        //Pack new average and incremented count back into a uint
        nextUint = packUnorm4x8(vec4(average, (count+1)/255.0f));
    }
}
Degerz commented 4 years ago

The irony of it all will be lost on some people but it's a tragedy how this community wants a feature that AMD has identified to be a slow-path which is exactly what they were hoping to avoid with OpenGL and their minefields of slow-paths since quite a few developers couldn't be trusted to hit the fast paths.

If the AMD driver team have nightmares about OpenGL and are afraid of Vulkan turning into the same disaster then please consider exercising some extra precaution like whitelists or vendor specific extensions! I want to make it as easy as possible for them in this difficult situation.

ryao commented 4 years ago

@Degerz Here is the feedback from our Vulkan team regarding the extension:

Whilst we could potentially support this feature, we don't see any use of this functionality in typical applications using our drivers (mostly desktop gaming). This functionality is exposed via DirectX (ROVs) and sees no real use outside of a handful of demo applications.

@jinjianrong Is your Vulkan team’s solution for developers that need this to switch to Direct3D? It sounds like their options are to use Direct3D just for AMD hardware on Windows or drop AMD hardware support. Nobody should be forced to implement a Direct3D render path on Windows because of AMD hardware. I assume that RADV will be giving this to Linux in the future, so it is really just Windows where AMD is pushing developers to do rewrites to get this functionality by refusing to support this in Vulkan.

By the way, if people are willing to do rewrites because of AMD, I suggest that willingness be redirected into reimplementing Vulkan on Windows using Direct3D 12 to deduplicate effort. Microsoft backported Direct3D 12 to Windows 7, so this seems like a viable alternative, even if it is a huge amount of work that would not need to be done if AMD were to provide the same functionality in Vulkan that they provide in Direct3D 11.3 and Direct3D 12. It would also likely penalize AMD hardware more than if AMD were to implement this, but that would be AMD’s fault.

Anyone starting such a project would likely find third parties willing to help because of the relevance of it to bringing software to the Xbox One (another piece of AMD hardware where Vulkan is crippled in favor of Direct3D to put it mildly). I know that the developer doing Direct3D 9 -> Vulkan in D9VK has expressed interest in Vulkan -> Direct3D 12 as a possible future project. He cannot be the only one outside the emulation community interested in doing that.

jinjianrong commented 4 years ago

@Degerz and others, thanks for the feedback on the use cases. I will pass this on to our team for further discussion

Joshua-Ashton commented 4 years ago

This functionality is exposed via DirectX (ROVs) and sees no real use outside of a handful of demo applications.

Literally not true, Just Cause 3 uses them in D3D11.3 + Xenia emulator would like them

Degerz commented 4 years ago

@jinjianrong Please give us an update on the results of the discussion once you're finished and thanks for hearing out our deliberations!

Regardless of the decision you and your team makes, I will come to an understanding.

ryao commented 4 years ago

@Degerz If AMD does not implement this extension, please consider my suggestion of pooling resources with others affected by this on a Vulkan -> Direct3D 12 translation layer just for AMD hardware. Also, @Joshua-Ashton is the developer doing Direct3D 9 -> Vulkan who has talked about doing Vulkan -> Direct3D 12 as a future project, so you are in good company there. :)

Anteru commented 4 years ago

ROVs have seen use in games, but it’s extremely limited. Even worse, if you just use ROVs to do blending, performance is typically much worse than alternative solutions, as fundamentally they require a serialization/locking on every single pixel. Due to those reasons, ROVs are of low priority for us to implement. We understand that this priority setting is not super-helpful for your specific case where you’re trying to emulate some specific bit of hardware, but we need to weight it with all the other requests we’re seeing and how many titles will benefit.

If it’s just blending your interested in, you might want to look at techniques like per-pixel linked lists which allow arbitrary programmable blending (with good performance, assuming you have only a few layers), or 64-bit atomics (which allow comparing color + depth simultaneously.)

As mentioned previously, AMDVLK is an open source project, and we’d be happy to review patches if you really want to have it implemented, but there’s not enough bandwidth on our side at this point in time to commit to this feature. Should things change, we’ll let you know.

pent0 commented 4 years ago

Hi, thanks for answering. I understand that this is weighted, I respect that tbh. But you guys do consider to provide and working on it when more attentions and titles get to it, so I hope that will be it someday, though I don't think emulators weight much here though. Thanks.

For our case it's hard to do the per-pixel linked list (yes i just search it a couple of minutes ago). We are translating shaders from their bytecodes and have no control over it, so we don't know where is the subroutine that does the blending, and where is the subroutine that does the color calculation. It's up to shader writer honestly, so we can't collect all layer's colors and start doing blending all together (we dont even have control over where the blending is).

Generally though, for normal developers, I think per-pixel linked lists are good performance (that trade over memory). But, I think some like me (emu devs maybe) will sacrifice performance for accuracy (and something I dont know to describe, but like implementable?) if possible, so I still demand for the extension VK_EXT_fragment_shader_interlock's existence. Count me one :D

I hope one day it will be the issue that you guys implement hehe. I dont know anything tbh about drivers and how they do it, so I cant implement, but we still want it if possible. Thanks for responding though, I will wait and try to find other workaround in mean time.

ryao commented 4 years ago

It seems that there is already a project doing Vulkan -> Direct3D 12. It implements a Vulkan-like API that it translates into Direct3D 12, Metal and others. It has a wrapper that translates Vulkan into that API that has been demonstrated primarily for Vulkan -> Metal, but it should be able to do Vulkan -> Direct3D 12 from what I have read:

https://github.com/gfx-rs/gfx https://github.com/gfx-rs/portability

It could probably be extended to implement VK_EXT_fragment_shader_interlock through ROVs to get it on AMD hardware on Windows.

ryao commented 4 years ago

@Anteru Would patches to AMDVLK translate into getting the extension on Windows?

Anteru commented 4 years ago

It won't hurt for sure :) That said, AMDVLK on Windows uses a different compiler, so some work in that area will remain. I can't say how much work that will be -- could be a lot, could be fairly trivial.

Degerz commented 4 years ago

@Anteru This wasn't the ideal response we were expecting but if the driver team does not want to implement what they find to be a distasteful feature then I won't press them any farther. Thanks for trying to hear us out anyways!

If things do change, please let us know.

oscarbg commented 4 years ago

@Joshua-Ashton Just Cause 3 uses them in D3D11.3 =anyway this extension would be useful for DXVK then, right?

jpark37 commented 4 years ago

Is bad performance specific to AMD architectures? I'd be interested to see OIT benchmarks with and without ROV usage for red, green, and blue GPUs.

Triang3l commented 4 years ago

@Anteru How much work was needed for ROV support, just for reference? :) Though ROV semantics are even less compiler-friendly than mutex semantics probably.