FS input `location`s with `flat` differing between `component`s incompatible with modern GPUs?

SPIR-V and GLSL make it possible to declare multiple fragment shader input variables within the same location using the component layout qualifier.

The GLSL specification defines the following requirements for variables assigned to components of the same location:

Location aliasing is causing two variables or block members to have the same location number.

[…] when location aliasing, the aliases sharing the location must have the same underlying numerical type and bit width (floating-point or integer, 32-bit versus 64-bit, etc.) and the same auxiliary storage and interpolation qualification.

(Section 4.4.1. Input Layout Qualifiers of the GLSL 4.60 specification.

However, I am unable to find any similar limitations in the Vulkan (in both 1.0 and 1.3-extensions) and SPIR-V specifications — in "Interpolation decorations", "Location Assignment", "Component Assignment", there doesn't seem to be anything that prevents variables with aliasing Location decorations from having different interpolation decorations. glslang also seems to generate SPIR-V fine if such case occurs.

This is, however, a problematic decision/oversight if it turns out to be true. If I understand correctly, some hardware, including desktop GPUs produced today, requires flat shading to be enabled for whole 4-component fragment input vectors (each corresponding to a Location in Vulkan basically) — so you can't mix Flat and non-Flat, and thus also floating-point and integer variables within the same Location. Specifically, there are at least two implementations where this seems to be true:

All generations of AMD RDNA, GCN and TeraScale including the most recent RDNA 3 have flat shading toggled for a 4-component pixel shader input via the single-bit FLAT_SHADE field of the SPI_PS_INPUT_CNTL_[0-31] registers. Starting with RDNA 2 it may be possible to emulate per-component flat shading in software since VK_KHR_fragment_shader_barycentric is supported (as RDNA 2 is able to make the shader aware of which vertex is the provoking one), however, on earlier AMD hardware, without FLAT_SHADE, the per-vertex values come to the shader in an undefined order, thus it's not possible to load the one for the provoking vertex.
Though I have no experience with Intel GPU internals, but according to what I can find in Mesa, the ConstantInterpolationEnable field of the 3DSTATE_SBE structure is a 32-bit mask, and just like on AMD, each bit also seems to control flat shading for the entire 4-component vector as opposed to individual scalars.

With Vulkan's original design built around monolithic pipelines, it may be possible that it was an intentional decision to relax those requirements, as it might have been expected that this would be resolved during VS–FS linkage (note that interpolation decorations only need to be provided in FS, they have no effect in the vertex stages), since with monolithic pipelines, VS/TES/GS and FS are aware of each others' interfaces, and may do remapping if needed.

However, the direction of the design has changed towards separate compilation of stages and fast linkage since then. The graphics pipeline library extension contains the device property graphicsPipelineLibraryIndependentInterpolationDecoration that requires the application to specify the needed interpolation decorations not only in the fragment shaders, but in the last vertex stage too where it must match, if it's VK_FALSE. It may be helpful in this situation, or it may not, I'm not sure. But the biggest user of graphics pipeline libraries — DXVK — requires that property to be true, as in Direct3D shader bytecode, interpolation modifiers are specified only in the pixel shader (though you can't mix interpolation modifiers within one vec4 in Direct3D shader bytecode either, and in the HLSL source, you have to specify the interpolation modifiers in both VS and PS so the compiler doesn't compact variables with different interpolation modifiers into one vec4 — but this info is not written to the VS bytecode, that only effects location assignment). And the more modern VK_EXT_shader_object doesn't have any equivalents of that while letting applications freely mix different vertex and fragment shaders even without creating pipelines.

Doing any remapping on the GPU at runtime using something like creating subroutines in hardware shader machine code for remapping so that all smooth and all flat components are in different vectors (both in the end of the VS and in the beginning of the FS) doesn't seem to be a viable approach to me, at least for two reasons:

They would impose very specific requirements on the hardware, precisely that shaders need to be able to jump to some subroutine elsewhere. This may be possible on AMD GCN, for example, but AMD TeraScale (that I'm currently writing a Vulkan driver for) simply has no way of jumping anywhere outside the "control flow program" within the shader (and the kernel driver wouldn't even make it possible since the subroutine would potentially be in a memory allocation separate from the program itself) — even though everything involved in VK_EXT_shader_object translates to hardware concepts on TeraScale even more nicely and transparently than on the more modern GCN and RDNA.
There would be some negative GPU performance difference caused by those subroutine calls, especially on the VS side as it'd have to be done unconditionally there (since the VS is unaware of the interpolation decorations and thus whether there are any flat outputs), you'd basically either have to have one subroutine, but then all outputs will have to be in precious general-purpose registers at the moment of the call (thus you won't be able to export them early), or you'd have to make lots of per-location subroutine calls, and the same will apply to the FS, but with the result general-purpose registers.

But even if you let some kind of linkage resolve this situation and remap all smooth and all flat varyings to different 4-component vectors, that still won't cover all of the cases. Specifically, if you have maxFragmentInputComponents and its vertex counterpart set to 128, if you declare 125 smooth components and 3 flat ones, even if you compact them you'll end up with 1 vector containing both smooth and flat variables — something not possible on the hardware. For that, you'd have to reduce maxFragmentInputComponents so you always have one free vec4 in hardware for this purpose — but this would make the Vulkan limits here inferior to the Direct3D 11 ones on existing modern hardware, and that would harm DXVK and VKD3D.

Was this relaxing in Vulkan compared to OpenGL intentional, and would it be possible to maybe retroactively modify the specification to reintroduce that limitation from OpenGL as there apparently are existing popular drivers where mixing of interpolation decorations within a location produces an incorrect result, and that's basically not fully fixable on many GPUs still actively used?

KhronosGroup / Vulkan-Docs

FS input `location`s with `flat` differing between `component`s incompatible with modern GPUs? #2170