Closed shmerl closed 6 years ago
Looks like a transform feedback-like functionality. Simply use general-purpose storage buffer to stream out the vertex processing output?
Apparently doing that isn't trivial or efficient enough.
I see here, that transform feedback was actually planned to be added to Vulkan at some point:
Is it still the case?
That slide just indicates some major things that are in OpenGL but not in Vulkan. It doesn't mean to imply that the working group is necessarily planning to add them to Vulkan.
Well, the question is whether there are plans for it or not now. And those features were highlighted, so I suppose there was a reason to do it, since developers could need them.
We can't really comment on WG plans. And while we can't promise that the WG will take action on any particular issue, we can certainly help bring feedback on missing features into the WG.
@shmerl can you add more detail on why using a buffer to emit the same output is not feasible?
Would also be good to get feedback from other applications that might benefit from this. Probably worth leaving this issue open for a bit to gather more data.
I talked about this to other people already, but since @shmerl insists on me commenting here:
The main issue with mapping Stream Output to SSBO writes is satisfying the ordering guarantees for the output data - to loosely quote the OpenGL spec, writing out the transformed vertices in the same order they were received.
If the number of vertices emitted by a geometry shader is the same across all invocations, we could simply use a multiple of PrimitiveId
as an index into the stream output buffers, but this won't work in case a geometry shader conditionally emits data to a stream. Atomics cannot be used in this case because the output order would become undefined.
In some cases this could be worked around by simply ignoring the ordering if the application is known not to rely on it, but for a generic API wrapper like DXVK, this is not an option.
Indeed, true StreamOut cannot be emulated by trivially writing to SSBOs with atomics because of ordering guarantees.
I've raised this internally in Khronos.
Question for @doitsujin . The GS stage has a statically declared maximum number of output vertices - could you do something like allocate a region for the GS instance based on primitive ID, and also record the number of vertices written? Then, if gaps are detected (possibly by static analysis detecting branching) run a compaction/defrag step? This isn't performant, but my understanding is that outputting dynamic numbers of vertices from the GS is generally not expected to be performant.. could issue some kind of performance warning in this case.
Or compute the final indices beforehand.
I wonder if there is some HW under this feature, or something like above hack is done in the driver anyway on modern unified shader architecture.
@null77 Yes, that should be possible. I'm aware that this workaround exists, but I don't like it very much due to the overhead involved. It requires temporary buffers, CPU<>GPU sync for indirect draw calls in order to determine the required size of the temporary buffers, and basically BeginRenderPass->Draw->EndRenderPass->Dispatch
for every single draw call.
@krOoze how would you compute the indices in advance?
@doitsujin By replacing all EmitVertex
with ++counter
, you can then get a buffer of counts. From which you can then compute indices for each first emitted vertex (partial sums). (At least that is how it works in my head...).
Hm, I guess that would work, although basically with the same drawbacks as the contraction pass mentioned above. It would also require vertex and geometry shaders to not have any side effects of their own (through SSBO stores and atomics), which is potentially an issue for D3D11.1 applications.
+1 TLDR: Please do it! IMHO for me sounds like exposing stream output in Vulkan API is an important thing to work on now.. I'm optimistic and hope Vulkan WG agrees to work on it so it will be exposed eventually.. Anyway I think will be much faster if a vendor exposed it (like NV, but no pressure on them), then DXVK used it for running AAA D3D games and others will follow surely.. as an example NV works very fast, after all we asked for NV upping max descriptor limits (useful for projects like DXVK) and from a GTC Vulkan presentantion they are gonna do it soon: (I must personally say thanks to NV!)
after two years of Vulkan getting mature/robust/etc and projects like DXVK are finding they need features like stream output it seems a reasonable think to ask for (from this thread seems stream output seems can't be emulated efficiently & there are AAA games using stream output it like Mafia III (https://github.com/doitsujin/dxvk/issues/135)) ) this is not like D3D11 shader subroutine also missing but that feature seems was only implemented on some Direct3D11 SDK sample and nothing more (at least DXVK testers haven't found any game using it, right?).. Note Wine also doesn't support "shader subroutines" even it could with GL_ARB_shader_subroutine ext..
We discussed this in our weekly ecosystem call again today. It would be very useful to get more detailed feedback on how shipping applications are using stream output in DX. Transform feedback is a pretty big topic and it would be good to try to get a sense of what it might be used for, to help inform priorities on the various bits of functionality involved.
@doitsujin, do you guys have insights on this from your work in dxvk?
@oscarbg offtopic, but since you brought it up: is hiding the "real" limits (as well as switching to less performant path behind the user's back) supposed to be a good thing?
@krOoze well it depends.. all the apps that worked OK before this change on Nvidia (and hopefully doesn't changing descriptor usage depending on querying these limits) should use less than old limits so no spilling.. for projects like VKD3D and DXVK allows them don't worrying about D3D limits as these Vulkan driver limits are greater than D3D limits.. I'm afraid to ask Nvidia but perhaps adding some new devince queries "no spilling" variants like maxPerStageDescritporUniformBufferfsNospilling or setting some new flag "give me true no spilling limits" before getting these queries could be useful..
Our hardware doesn't necessarily map 1:1 to the limits exposed by the API. We expect most applications that go above the original limits won't actually suffer any performance loss in practice, so having a separate query wouldn't be very useful either.
I'd prefer if we kept this thread focused on the transform feedback requests. We can discuss implementation limits offline if needed.
@nsubtil:
I asked one of the REDengine developers who worked on The Witcher 3 about this, and he said:
AFAIR we used stream out for the decals (blood, etc) on the characters, I can't recall any other use case. That was rather simple implementation.
In other games and projects I recall that streamout was mainly used for skinned geometry either for decals or as an optimization when given geometry was rendered multiple times (cascade shadowmaps, etc). Usually the initial geometry's vertex shader was too expensive to evaluate multiple times (like for more complicated characters or particles) or it was just inconvenient to integrate proper support for animated geometry at the target and stream out allowed to "cheat" and provide a pre-made vertex buffers.
Though in my experience with TW3, stream output affects some monster models the most (thus they are messed up or invisible in dxvk now).
@nsubtil: Do you still need more input to move this forward?
We use stream output for mesh skinning on OpenGL es 3.0 Android device. On iOS we use Metal compute shader. It seems compute shader is a superset of stream output, so Metal do not provide stream output support.
@alek314 Well, if you need the output from geometry and\or tessellation shader, you would have to implement those in the compute shader. You might also lose some performance doing so.
It would be great to have a solution for this since this issue is the biggest blocker for many games to reach ~gold status on Wine + DXVK (The Witcher 3 being a prominent one among them).
Unless this is resolved, DXVK cannot fix / implement Stream Output related issues and many otherwise great-performing games remain subpar on this stack.
@doitsujin: Do you have any specific preference for how it should be implemented in Vulkan? I think projects like dxvk and vkd3d would be the primary users of it, so your input would be valuable.
cc @jozefkucia.
@krOoze:
I wonder if there is some HW under this feature, or something like above hack is done in the driver anyway on modern unified shader architecture.
Yes, there is HW, at least on some architectures: AMD_Evergreen-Family_Instruction_Set_Architecture.pdf, 2.1.3 describes the data flow in case geometry shader is used:
- The VS program ends by writing vertices out to the VS ring buffer.
- The GS program reads multiple vertices from the VS ring buffer, executes its geometry functions, and outputs one or more vertices per input vertex to the GS ring buffer. The VS program can only write a single vertex per single input; the GS program can write a large number of vertices per single input. Every time a GS program outputs a vertex, it indicates to the vertex VGT that a new vertex has been output (using
EMIT_*
instructions). The VGT counts the total number of vertices created by each GS program. The GS program divides primitive strips by issuingCUT_VERTEX
instructions.- The GS program ends when all vertices have been output. No position or parameters is exported.
- The DC program reads the vertex data from the GS ring buffer and transfers this data to the parameter cache and position buffer using one of the
MEM*
memory export instructions.
So it seems that real transform feedback is needed in Vulkan to take advantage of that.
@michalsrb It describes parameter cache and position buffer, not stream output. Also instructions is miles away from HW. GLSL is translated to instructions. The question is if they are something that cannot be expressed in GLSL explicitly.
PS: it mentions stream(out) buffers but it is not immediatelly obvious to me from the doc how they are supposed to work. In GCN1 doc they seem to be mentioned even less so.
PPS: And if there is HW there, I wonder if it can do more for me than the old API of transform feedback.
@krOoze You are right, this does not describe how it really works internally and I admit I don't know. My comment is only based on reading the documentation and the open source drivers.
The stream out functionality seems to be configured by setting VGT_STRMOUT_*
registers. My guess is that either the hardware or the "DC program" writes the vertices to the buffers in addition to the parameter cache and position buffer. If it could be used for more than the OpenGL style transform feedback is good question.
@nsubtil: Is there any rough ETA on how long this can be worked on (spec wise)?
Some members of the Vulkan working group are developing a multi-vendor EXT extension for transform feedback with the primary goal of satisfying the needs of the DXVK, vkd3d and ANGLE translation layers. The Vulkan working group does not plan to promote this functionality as a KHR extension or as core functionality because it believes there are better, more forward-looking ways of processing and capturing vertex data with the GPU. The multi-vendor EXT extension should be available soon and is likely to be implemented on those platforms where DXVK, vkd3d and ANGLE translation is required.
What great news! Thank you!
Don't know what it means, but sounds awesome.
It means they are adding an extension to Vulkan so this issue can be resolved, but the extension will not become part of the mandatory standard ones.
I would like to mention that the majority of virtual reality games are broken on SteamOS. That is unlikely to change until this extension becomes available.
The multi-vendor EXT extension should be available soon
Do you expect it within weeks or months?
but the extension will not become part of the mandatory standard ones.
KHR extensions are not mandatory. And all published extensions are standard (except perhaps the provisional ones).
We’re aiming for weeks not months. There has been a lot of progress since my last comment.
I believe that because of Vulkan being implemented lower level, this can theoretically be implemented fairly efficiently in the shader handling code, without the massive bloat that it would create implementing it on Direct3D.
However, just like any routine using generic functions for specialized tasks, this is likely still a sub-optimal way to handle it as it would create access-masks/locks/holds and other latency sources that would not occur in fully hardware/driver accelerated setup, as well as the utilization of additional GPU cycles to achieve the same result.
The VK_EXT_transform_feedback extension specification has now been published as part of the Vulkan 1.1.88 release here https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VK_EXT_transform_feedback.
It looks like DXVK has already added support along with NVIDIA and RADV drivers. More support will follow including Vulkan CTS coverage.
Congrats and kudos to everyone involved! Note that Wine still needs patches to use it, but it should land upstream sometime before the next release.
For the technical background on this effort, see a very good write up by Jason Ekstrand.
Are there are any plans to add stream output-like functionality to Vulkan?
Developers of translation layers that implement D3D11 and D3D12 through Vulkan (dxvk and vkd3d) have hard time implementing this kind of functionality with current Vulkan options.