Plans for providing transform feedback / stream output-like functionality in Vulkan?

shmerl commented 6 years ago

Are there are any plans to add stream output-like functionality to Vulkan?

Developers of translation layers that implement D3D11 and D3D12 through Vulkan (dxvk and vkd3d) have hard time implementing this kind of functionality with current Vulkan options.

krOoze commented 6 years ago

Looks like a transform feedback-like functionality. Simply use general-purpose storage buffer to stream out the vertex processing output?

shmerl commented 6 years ago

Apparently doing that isn't trivial or efficient enough.

I see here, that transform feedback was actually planned to be added to Vulkan at some point:

transform_feedback_vulkan

Is it still the case?

pdaniell-nv commented 6 years ago

That slide just indicates some major things that are in OpenGL but not in Vulkan. It doesn't mean to imply that the working group is necessarily planning to add them to Vulkan.

shmerl commented 6 years ago

Well, the question is whether there are plans for it or not now. And those features were highlighted, so I suppose there was a reason to do it, since developers could need them.

nsubtil commented 6 years ago

We can't really comment on WG plans. And while we can't promise that the WG will take action on any particular issue, we can certainly help bring feedback on missing features into the WG.

@shmerl can you add more detail on why using a buffer to emit the same output is not feasible?

Would also be good to get feedback from other applications that might benefit from this. Probably worth leaving this issue open for a bit to gather more data.

doitsujin commented 6 years ago

I talked about this to other people already, but since @shmerl insists on me commenting here:

The main issue with mapping Stream Output to SSBO writes is satisfying the ordering guarantees for the output data - to loosely quote the OpenGL spec, writing out the transformed vertices in the same order they were received.

If the number of vertices emitted by a geometry shader is the same across all invocations, we could simply use a multiple of PrimitiveId as an index into the stream output buffers, but this won't work in case a geometry shader conditionally emits data to a stream. Atomics cannot be used in this case because the output order would become undefined.

In some cases this could be worked around by simply ignoring the ordering if the application is known not to rely on it, but for a generic API wrapper like DXVK, this is not an option.

HansKristian-Work commented 6 years ago

Indeed, true StreamOut cannot be emulated by trivially writing to SSBOs with atomics because of ordering guarantees.

HansKristian-Work commented 6 years ago

I've raised this internally in Khronos.

null77 commented 6 years ago

Question for @doitsujin . The GS stage has a statically declared maximum number of output vertices - could you do something like allocate a region for the GS instance based on primitive ID, and also record the number of vertices written? Then, if gaps are detected (possibly by static analysis detecting branching) run a compaction/defrag step? This isn't performant, but my understanding is that outputting dynamic numbers of vertices from the GS is generally not expected to be performant.. could issue some kind of performance warning in this case.

krOoze commented 6 years ago

Or compute the final indices beforehand.

I wonder if there is some HW under this feature, or something like above hack is done in the driver anyway on modern unified shader architecture.

doitsujin commented 6 years ago

@null77 Yes, that should be possible. I'm aware that this workaround exists, but I don't like it very much due to the overhead involved. It requires temporary buffers, CPU<>GPU sync for indirect draw calls in order to determine the required size of the temporary buffers, and basically BeginRenderPass->Draw->EndRenderPass->Dispatch for every single draw call.

@krOoze how would you compute the indices in advance?

krOoze commented 6 years ago

@doitsujin By replacing all EmitVertex with ++counter, you can then get a buffer of counts. From which you can then compute indices for each first emitted vertex (partial sums). (At least that is how it works in my head...).

doitsujin commented 6 years ago

Hm, I guess that would work, although basically with the same drawbacks as the contraction pass mentioned above. It would also require vertex and geometry shaders to not have any side effects of their own (through SSBO stores and atomics), which is potentially an issue for D3D11.1 applications.

oscarbg commented 6 years ago

+1 TLDR: Please do it! IMHO for me sounds like exposing stream output in Vulkan API is an important thing to work on now.. I'm optimistic and hope Vulkan WG agrees to work on it so it will be exposed eventually.. Anyway I think will be much faster if a vendor exposed it (like NV, but no pressure on them), then DXVK used it for running AAA D3D games and others will follow surely.. as an example NV works very fast, after all we asked for NV upping max descriptor limits (useful for projects like DXVK) and from a GTC Vulkan presentantion they are gonna do it soon: (I must personally say thanks to NV!)

after two years of Vulkan getting mature/robust/etc and projects like DXVK are finding they need features like stream output it seems a reasonable think to ask for (from this thread seems stream output seems can't be emulated efficiently & there are AAA games using stream output it like Mafia III (https://github.com/doitsujin/dxvk/issues/135)) ) this is not like D3D11 shader subroutine also missing but that feature seems was only implemented on some Direct3D11 SDK sample and nothing more (at least DXVK testers haven't found any game using it, right?).. Note Wine also doesn't support "shader subroutines" even it could with GL_ARB_shader_subroutine ext..

nsubtil commented 6 years ago

We discussed this in our weekly ecosystem call again today. It would be very useful to get more detailed feedback on how shipping applications are using stream output in DX. Transform feedback is a pretty big topic and it would be good to try to get a sense of what it might be used for, to help inform priorities on the various bits of functionality involved.

@doitsujin, do you guys have insights on this from your work in dxvk?

krOoze commented 6 years ago

@oscarbg offtopic, but since you brought it up: is hiding the "real" limits (as well as switching to less performant path behind the user's back) supposed to be a good thing?

oscarbg commented 6 years ago

@krOoze well it depends.. all the apps that worked OK before this change on Nvidia (and hopefully doesn't changing descriptor usage depending on querying these limits) should use less than old limits so no spilling.. for projects like VKD3D and DXVK allows them don't worrying about D3D limits as these Vulkan driver limits are greater than D3D limits.. I'm afraid to ask Nvidia but perhaps adding some new devince queries "no spilling" variants like maxPerStageDescritporUniformBufferfsNospilling or setting some new flag "give me true no spilling limits" before getting these queries could be useful..

nsubtil commented 6 years ago

Our hardware doesn't necessarily map 1:1 to the limits exposed by the API. We expect most applications that go above the original limits won't actually suffer any performance loss in practice, so having a separate query wouldn't be very useful either.

I'd prefer if we kept this thread focused on the transform feedback requests. We can discuss implementation limits offline if needed.

shmerl commented 6 years ago

@nsubtil:

I asked one of the REDengine developers who worked on The Witcher 3 about this, and he said:

AFAIR we used stream out for the decals (blood, etc) on the characters, I can't recall any other use case. That was rather simple implementation.

In other games and projects I recall that streamout was mainly used for skinned geometry either for decals or as an optimization when given geometry was rendered multiple times (cascade shadowmaps, etc). Usually the initial geometry's vertex shader was too expensive to evaluate multiple times (like for more complicated characters or particles) or it was just inconvenient to integrate proper support for animated geometry at the target and stream out allowed to "cheat" and provide a pre-made vertex buffers.

Though in my experience with TW3, stream output affects some monster models the most (thus they are messed up or invisible in dxvk now).

shmerl commented 6 years ago

@nsubtil: Do you still need more input to move this forward?

alek314 commented 6 years ago

We use stream output for mesh skinning on OpenGL es 3.0 Android device. On iOS we use Metal compute shader. It seems compute shader is a superset of stream output, so Metal do not provide stream output support.

krOoze commented 6 years ago

@alek314 Well, if you need the output from geometry and\or tessellation shader, you would have to implement those in the compute shader. You might also lose some performance doing so.

asoltesz commented 6 years ago

It would be great to have a solution for this since this issue is the biggest blocker for many games to reach ~gold status on Wine + DXVK (The Witcher 3 being a prominent one among them).

Unless this is resolved, DXVK cannot fix / implement Stream Output related issues and many otherwise great-performing games remain subpar on this stack.

shmerl commented 6 years ago

@doitsujin: Do you have any specific preference for how it should be implemented in Vulkan? I think projects like dxvk and vkd3d would be the primary users of it, so your input would be valuable.

shmerl commented 6 years ago

cc @jozefkucia.

michalsrb commented 6 years ago

@krOoze:

I wonder if there is some HW under this feature, or something like above hack is done in the driver anyway on modern unified shader architecture.

Yes, there is HW, at least on some architectures: AMD_Evergreen-Family_Instruction_Set_Architecture.pdf, 2.1.3 describes the data flow in case geometry shader is used:

The VS program ends by writing vertices out to the VS ring buffer.

The GS program reads multiple vertices from the VS ring buffer, executes its geometry functions, and outputs one or more vertices per input vertex to the GS ring buffer. The VS program can only write a single vertex per single input; the GS program can write a large number of vertices per single input. Every time a GS program outputs a vertex, it indicates to the vertex VGT that a new vertex has been output (using EMIT_* instructions). The VGT counts the total number of vertices created by each GS program. The GS program divides primitive strips by issuing CUT_VERTEX instructions.

The GS program ends when all vertices have been output. No position or parameters is exported.

The DC program reads the vertex data from the GS ring buffer and transfers this data to the parameter cache and position buffer using one of the MEM* memory export instructions.

So it seems that real transform feedback is needed in Vulkan to take advantage of that.

krOoze commented 6 years ago

@michalsrb It describes parameter cache and position buffer, not stream output. Also instructions is miles away from HW. GLSL is translated to instructions. The question is if they are something that cannot be expressed in GLSL explicitly.

PS: it mentions stream(out) buffers but it is not immediatelly obvious to me from the doc how they are supposed to work. In GCN1 doc they seem to be mentioned even less so.

PPS: And if there is HW there, I wonder if it can do more for me than the old API of transform feedback.

michalsrb commented 6 years ago

@krOoze You are right, this does not describe how it really works internally and I admit I don't know. My comment is only based on reading the documentation and the open source drivers.

The stream out functionality seems to be configured by setting VGT_STRMOUT_* registers. My guess is that either the hardware or the "DC program" writes the vertices to the buffers in addition to the parameter cache and position buffer. If it could be used for more than the OpenGL style transform feedback is good question.

shmerl commented 6 years ago

@nsubtil: Is there any rough ETA on how long this can be worked on (spec wise)?

pdaniell-nv commented 6 years ago

Some members of the Vulkan working group are developing a multi-vendor EXT extension for transform feedback with the primary goal of satisfying the needs of the DXVK, vkd3d and ANGLE translation layers. The Vulkan working group does not plan to promote this functionality as a KHR extension or as core functionality because it believes there are better, more forward-looking ways of processing and capturing vertex data with the GPU. The multi-vendor EXT extension should be available soon and is likely to be implemented on those platforms where DXVK, vkd3d and ANGLE translation is required.

mozo78 commented 6 years ago

What great news! Thank you!

bobafetthotmail commented 6 years ago

Don't know what it means, but sounds awesome.

It means they are adding an extension to Vulkan so this issue can be resolved, but the extension will not become part of the mandatory standard ones.

ryao commented 6 years ago

I would like to mention that the majority of virtual reality games are broken on SteamOS. That is unlikely to change until this extension becomes available.

shmerl commented 6 years ago

The multi-vendor EXT extension should be available soon

Do you expect it within weeks or months?

krOoze commented 6 years ago

but the extension will not become part of the mandatory standard ones.

KHR extensions are not mandatory. And all published extensions are standard (except perhaps the provisional ones).

pdaniell-nv commented 6 years ago

We’re aiming for weeks not months. There has been a lot of progress since my last comment.

Ruedii commented 6 years ago

I believe that because of Vulkan being implemented lower level, this can theoretically be implemented fairly efficiently in the shader handling code, without the massive bloat that it would create implementing it on Direct3D.

However, just like any routine using generic functions for specialized tasks, this is likely still a sub-optimal way to handle it as it would create access-masks/locks/holds and other latency sources that would not occur in fully hardware/driver accelerated setup, as well as the utilization of additional GPU cycles to achieve the same result.

pdaniell-nv commented 6 years ago

The VK_EXT_transform_feedback extension specification has now been published as part of the Vulkan 1.1.88 release here https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VK_EXT_transform_feedback.

It looks like DXVK has already added support along with NVIDIA and RADV drivers. More support will follow including Vulkan CTS coverage.

shmerl commented 6 years ago

Congrats and kudos to everyone involved! Note that Wine still needs patches to use it, but it should land upstream sometime before the next release.

For the technical background on this effort, see a very good write up by Jason Ekstrand.

KhronosGroup / Vulkan-Ecosystem

Plans for providing transform feedback / stream output-like functionality in Vulkan? #26