KhronosGroup / SPIRV-Cross

SPIRV-Cross is a practical tool and library for performing reflection on SPIR-V and disassembling SPIR-V back to high level languages.
Apache License 2.0
1.98k stars 555 forks source link

Support multiple SPIR-V modules compiled together into one MSL shader #719

Open cdavis5e opened 5 years ago

cdavis5e commented 5 years ago

This is unfortunately necessary for #120. Tessellation works differently in Metal vs. Vulkan (and D3D). In Vulkan, two shader stages (tessellation control and tessellation evaluation) and a fixed function stage (tessellation proper) are inserted between the vertex shader and rasterization (or the geometry shader, if present). In Metal, however, the vertex and tess. control stages are omitted entirely. Instead, you are expected to run a compute shader that populates the buffer(s) used by the tessellator. The tess. evaluation stage is then run as a "post-tessellation vertex function." Therefore, for Vulkan->Metal translation, as MoltenVK is doing, the vertex shader and tess. control shader must be compiled together into one Metal compute kernel, which is responsible for running the original vertex and tess. control functions and using their output to populate the tessellation input buffers. (See also Apple's guidance for porting D3D11 shaders to Metal.)

Geometry shaders are another place where this may be necessary, since Metal does not yet support them natively. The geometry shader must thus run as a compute kernel, feeding its output through a buffer and a pass-through vertex shader to the rasterizer. But the original vertex shader must run prior to the geometry shader. If the original vertex shader and geometry shader (assuming no tessellation) were run normally, it would require one render pipeline with rasterizationDisabled, followed by a compute pipeline running the geometry shader, followed by yet another render pipeline for the rest of the rendering process. (And indeed, this is what we'll likely end up doing for combined geometry and tessellation shaders.) But if the vertex and geometry shaders were combined into one compute kernel, the first render pipeline could be omitted (along with a buffer needed to hold the output of the vertex shader to pass to the geometry shader).

cdavis5e commented 5 years ago

I should probably add why I think we need this specifically instead of just decompiling the shaders separately and concatenating the source strings.

When a shader uses something that isn't directly supported by Metal, the decompiler adds a definition of a function that will be called by the generated shader code to implement the missing functionality. Normally, each shader is compiled into its own translation unit, so it isn't necessary to worry about things like multiple definitions. But if two shaders are concatenated into the same source string, they will now be compiled into the same translation unit. If both shaders need one of these generated functions, there will now be two definitions of each needed function, which is not allowed.

Multiple definitions of user shader functions and variables is also a concern, but this is actually orthogonal to my concern above, and can be fixed by adding a prefix to each named function or global variable.

HansKristian-Work commented 5 years ago

Disclaimer: To be clear, I don't know much about how Metal does tessellation. So I might just be talking non-sense here. :p

Hm ... So, if I understand it correctly Metal expects you to do the vertex and tessellation control shaders on your own, (in compute I assume).

To me, vertex and tess control stages are quite separate, the vertex shader just works on patch vertices in isolation, tess control works on a chunk of vertex shader outputs based on patch size, and you build up tessellation factors, which are then consumed by the fixed function tessellator. I guess it's this buffer we need to create manually. The control shader will also need to pass data down to the post-tessellation vertex shader I guess.

I see this as two separate compute stages (or maybe more!) we need in order to implement this in Metal based on Vulkan's way of doing tess. Is there a particular reason why vertex and tessellation control needs to be merged into a single compute shader?

HansKristian-Work commented 5 years ago

Of course, there could be a separate discussion on how to support emitting multiple stages in the MSL (from same SPIR-V module). However, vertex and tessellation control shaders come from two completely separate SPIR-V modules in Vulkan. Merging those into one MSL file sounds very complicated.

cdavis5e commented 5 years ago

They don't strictly need to be combined... but if they were separate, we'd then need another temporary buffer to hold the output of the vertex shader to be passed to the tess. control shader.

HansKristian-Work commented 5 years ago

Right, it's more of an optimization perhaps to merge, but I think there are correctness implications as well.

The work group size in a control shader is application defined, so it might be very difficult to get a clean mapping of threads, data, and dependencies. The number of vertex threads and tess control threads is not the same, so we might be forced down this path anyways.

E.g., you could have a patch size of 3 vertices, but the control shader could have 7 threads per patch or something weird like that. The thread mapping (especially since control shaders can call barrier()), would be non-trivial, and in-optimal in the extreme case.

billhollings commented 5 years ago

MSL supports namespaces. Could we use namespaces to separate combined shaders semantically?

By that I mean wrap each shader in a namespace declaration when combining?

And then perhaps have an msl_options member that can be set to indicate whether to output the boilerplate metal headers...to avoid duplicating them in a single shader file?

cdavis5e commented 5 years ago

MSL supports namespaces. Could we use namespaces to separate combined shaders semantically?

By that I mean wrap each shader in a namespace declaration when combining?

That actually might work. Then we could have the real entry point call the namespaced entry points in turn. It might also solve the problem with duplicate definitions. LLVM (which naturally Apple uses for the backend of metalfe) should be smart enough to deduplicate the identical functions.

And then perhaps have an msl_options member that can be set to indicate whether to output the boilerplate metal headers...to avoid duplicating them in a single shader file?

I don't know... Sometimes, the boilerplate is a bit different, depending on what's in the shader. For example, if there are atomics, SPIRV-Cross inserts an #include <metal_atomic>. (Shouldn't that get pulled in by #include <metal_stdlib> anyhow?)

HansKristian-Work commented 5 years ago

If the kernel entry points can be part of a namespace that sounds like a near-perfect solution for combining two SPIR-V modules into one translation unit. I assume having to declare types using the namespace in some outer entry point could get annoying.

For Metal headers, I assume we would just have some way of saying "these headers are already included, don't bother emitting that in next translation unit." Another alternative to that would be adding manual include guards in the MSL output for this kind of output.

cdavis5e commented 4 years ago

All right, it's time for me to do something about this. Clients are complaining that MoltenVK's tessellation performance is too slow, and we think that running the vertex and tessellation control stages separately is the cause.

Based on the discussion so far, I should have a workable solution, except for one thing: generating the actual entry point for the combined shader. For that, I need to know the union of all resources and inputs used by both the vertex and tess. control shaders, so I can generate the combined entry point prototype and the code to invoke the two shaders in turn. The easiest way to do that is to pass both shaders... which puts us right back where we started. The other way is to specify all that manually. I guess we already do that for resources (sort of), but inputs, particularly special inputs, would need work.

The other thing is that MSL doesn't allow you to call shader entry points (i.e. functions with a vertex, fragment, or kernel qualifier) directly. Unlike standard C/C++ where calling main() is merely undefined behavior, Metal Clang straight up forbids this. So we'd need to ensure that the vertex and tess. control entry points are undecorated.

HansKristian-Work commented 4 years ago

Oh ... boy. Before we delve into this, I'd like to understand what we can do better without adding even more complexity to the tessellation implementation. I'm very wary about supporting this as it will touch so many parts of the implementation and likely require a ton more APIs.

First, I think we need some evidence that this is indeed the bottleneck and not something else.

There are some potential avenues for improvement with the existing structure. We currently emit a barrier() to deal with loading vertex inputs into a threadgroup block. By having a barrier there and only emitting threadgroups with size == output vertices, we are likely starving AMD GPUs severely since we won't be able to run multiple patches per warp. If there is a bottleneck, I'm pretty sure this is the one.

Rather than barriers, we can then just use subgroup barrier, which is effectively a noop. Going further, we might be able to elide the store-control-point-input-to-threadgroup path by using SIMD broadcasts. One way we can do this is to pass down the SIMD size to SPIRV-Cross and submit multiple control points per workgroup (2D workgroup size should work). Hopefully this is possible w.r.t. attribute strides, etc ... This would be a path that can be taken if SIMD size >= max(vertices per patch, control point count), and this should always be true on AMD since maximum number of control points is 32. On iOS, we might need the old-style path, but the common case is control points <= 4 I think.

cdavis5e commented 4 years ago

The problem with SIMD barriers or broadcasts is that the thread execution width (and therefore, the SIMD-group size) is allowed to vary depending on the kernel's register use--a bit like VK_EXT_subgroup_size_control, except it's always on. Metal won't let us know what execution width the driver picked until we compile the kernel into a pipeline. The other thing is that the number of input points is set at draw time. So I'm not sure I can actually guarantee that taking advantage of SIMD-groups would work.

Also, SIMD-group functionality isn't exposed at all until Metal 2.0. For Metal 1.2, we'd have to do it the old way.

This jumped out at me:

By having a barrier there and only emitting threadgroups with size == output vertices, we are likely starving AMD GPUs severely since we won't be able to run multiple patches per warp

I think I'm going to try to fix that first. This should also work for Metal 1.2.