Closed trenouf closed 2 months ago
This looks good. Thanks.
As bystander, I really appreciate your summary, Tim. It's great you gave this are more structure and provided a high level overview of the design space -- usually a few folks would just come in with some corner-case in the design that they are aware of and it was very difficult for me connect the dots when that happened. Many things are much more clear to me now, although I still don't understand the details.
I have opened #545 LGC shader compilation interface proposal, to detail how the front-end would call LGC (the middle-end) to do shader compilation and linking.
Now that I have pushed #720 fetch shader for review, here are some ideas on how to go about implementing the color export shader:
That is exactly what I was thinking. Thanks.
LGC: shader compilation proposal
There are several different efforts to move away from whole-pipeline compilation in LLPC, or that will affect LLPC in the future. This proposal is to unify them in new LGC (LLPC middle-end) functionality.
There is a "partial pipeline compilation" scheme in LLPC that kind of hacks into LGC's otherwise whole-pipeline compilation, and does ELF linking in the front-end using ad-hoc ELF reading and writing code, rather than LLVM code.
Steven et al have started work on their scheme to be able to compile separate shaders (VS, FS, CS) offline to pre-populate a shader cache, with some pipeline state missing, and some pipeline state guessed with multiple combinations per shader. This builds on the front-end linking functionality above. See Github issues Cache creator tool, Relocatable elf vertex input handling, Handling descriptor offsets as relocations.
There are AMD-internal discussions about shader compilation.
This proposal is to unify these different efforts to use new LGC (LLPC middle-end) functionality. The link stage in particular requires knowledge that should be in the middle-end, such as the workings of PAL metadata, and ELF reading and writing, and needs to be shared and used by potential multiple LLPC front-ends.
Background
Existing whole pipeline compilation
Whole-pipeline compilation in LLPC works like this:
Existing(ish) shader and partial pipeline caching
Existing partial pipeline compilation
There are some changes on top of this to handle a "partial pipeline compilation" mode. Part way through step 2, LGC calls a callback provided by the front-end with a hash of each shader and the pipeline state and input/output info pertaining to it. The callback in the front-end can ask to omit a shader stage, if it finds it already has a cached ELF containing that shader. Then, the front-end has a post-compilation ELF linking step to use the part of that cached ELF for the omitted shader. This only works for VS-FS, and has some other provisos, because of the way that it plucks the part of the pipeline it needs out of a whole pipeline ELF.
This scheme has some disadvantages, especially the way that it allows the middle-end to think that it is compiling a whole pipeline, but it then post-processes the ELF to extract the part it needs. A more holistic approach would be for the middle-end to know that it is not compiling a whole pipeline, and for the link stage to be in the middle-end where knowledge of (for example) PAL metadata should be confined to.
Steven et al's shader caching
Steven's scheme is to offline compile shaders to pre-populate a shader cache. This would involve compiling a shader with most of the pipeline state missing (principally resource descriptor layout, vertex buffer info and color export info), and with some "bounded" items in the pipeline state set to a guessed value. The resulting compiled shader ELF would be cached keyed on the input SPIR-V and (I assume) the "bounded" parts of the pipeline state that were set.
The proposal
This proposal outlines a shader compilation scheme using relocs, prologs and epilogs, and a pipeline linking stage, all handled in LGC (the LLPC middle-end).
Shader compilation vs pipeline compilation
This proposal does not cover how and when a driver decides to do shader compilation. Of the two compilation modes:
there is scope for API and/or driver changes to use shader compilation first, then kick off a background thread to do the optimized compilation and swap the result in at the next opportunity.
Early vs late shader caching
We can divide existing and proposed shader caching schemes into two types:
Early shader caching caches the shader keyed on just its input language (SPIR-V for Vulkan), possibly combined with some of the pipeline state. Steven's scheme is an example.
Late shader caching caches the shader after some part of the compilation has taken place, and keys it on the state of the compilation at that point. The existing partial pipeline compilation scheme is an example.
I propose to focus here on early shader caching, which has the following pros and cons:
Nicolai also suggests taking the existing partial pipeline compilation scheme, a late shader caching scheme, and tidying up its interface and implementation (see Inter-shader data cache tracking. One problem is that we do pretty much have to choose one or the other; within one application run, you can't use both at the same time, as trying to means that a shader gets cached early and late, and next time the same shader is seen, the early cache check always succeeds.
The choice partly depends on how you view the existing partial pipeline compilation scheme: was a late shader caching scheme chosen for the possibility of VS-FS optimizations, or was it chosen because that meant that it could be implemented without implementing the relocs and prologs and epilogs in this proposal? I suspect the latter, and I reckon we're better with an early shader caching scheme for the two pros I list above.
What shaders are cached
This proposal makes no attempt to cache VS, TCS, TES, GS shaders that make up part of a geometry or tessellation vertex-processing stage. The FS in such a pipeline can still be cached though. So the shader types that can be cached are:
In addition, we can compile the whole vertex-processing stage (VS-GS, VS-TCS-TES, or VS-TCS-TES-GS) without the FS, or with an already-compiled FS.
Failure of shader compilation or pipeline linking
There needs to be scope for shader compilation or pipeline linking to fail, in which case the front-end needs to do full pipeline compilation instead:
Shader compilation can fail if the compiler can tell in advance that the shader does something that will not work in the shader compilation model, for example a VS that is obviously not a standalone VS.
Pipeline linking can fail because the pipeline uses something that is not possible to implement in this model, for example:
This kind of failure is different to normal compilation failure, in that it needs to exit cleanly and clean up, because the driver or front-end is going to retry as a full pipeline compilation. If any such condition is detected in an LLVM pass flow, we need to come up with a clean exit mechanism, such as deleting all the code in the module and detecting that at the end.
Prologs and epilogs
Compiling shaders with some or all pipeline state missing and without the other shader to refer to means that the pipeline linker needs to generate prologs and epilogs.
CS prolog
If the compilation of a CS without resource descriptor layout puts its user data sgprs in the wrong order for the layout in the pipeline state, then the linker needs to generate a CS prolog that loads and/or swaps around user data sgprs. The linker picks up the descriptor set to sgpr mapping that the CS compilation used from the user data registers in the PAL metadata.
VS prolog
If vertex buffer information is unavailable at VS compile time, then the linker needs to generate a VS prolog (a "fetch shader") that loads vertex buffer values required by the VS. The VS expects the values to be passed in vgprs, and the linker picks up details of which vertex buffer locations and in what format from extra pre-link metadata attached to the VS ELF.
VS epilog
If the VS (or whole vertex-processing stage) is compiled without information on how the FS packs its parameter inputs, then the VS compilation does not know how to export parameters, and the linker needs to generate a VS epilog. The VS (or last vertex-processing-stage shader) exits with the parameter values in vgprs, and the VS epilog takes those and exports them. The linker picks up information on what parameter locations are in which vgprs and in what format from extra pre-link metadata attached to the VS ELF, and information on how parameter locations are packed and arranged from extra pre-link metadata attached to the FS ELF.
No FS prolog
No FS prolog is ever needed. FS compilation decides how to pack and arrange its input parameters.
FS epilog
If the FS is compiled without color export pipeline state, then it does not know how to do its exports, and the linker needs to generate an FS epilog. The FS exits with its color export values in vgprs (and the exec mask set to the surviving pixels after kills/demotes), and the FS epilog takes those and exports them. The linker picks up information on what color exports are in which vgprs and in what format from extra pre-link metadata attached to the FS ELF.
Prolog/epilog compilation notes
A prolog has the same input registers as the shader it will be attached to, minus the vgprs that are generated by the prolog for passing to the shader proper. That is, the shader's SPI register settings that determine what registers are set up at wave dispatch apply to the prolog.
For a VS prolog where the VS is part of a merged shader (including the NGG case), the code to set exec needs to be in the prolog.
The exact same set of registers are also outputs from the prolog, plus the vgprs that are generated by the prolog.
A prolog/epilog is generated as an IR module then compiled. The compiled ELF is cached with the hash of the inputs to the prolog/epilog IR generator being the key.
In the context of a prolog being generated as IR then compiled:
SPI_SHADER_RSRC1_VS
register in PAL metadata. The linker needs to take the maximum usage of that and the shader proper.An epilog's input registers are the same as the shader's output registers, which is the vgprs containing the values to export. (This may need to change to also have some sgprs passed for VS epilog parameter export on gfx11, if parameter exports are going to be replaced by normal off-chip memory writes.)
Prolog/epilog generation even in pipeline compilation
In a case where a particular prolog or epilog is not needed (e.g. the VS prolog when vertex buffer information is available at VS compilation time), I propose that LGC internally uses the same scheme of setting up a shader as if it is going to use the prolog/epilog (including setting up the metadata for the linker), and then uses the same code to generate the IR for the prolog/epilog as would otherwise be used at link time. Then it would merge the prolog/epilog into the shader at the IR stage, allowing optimizations from there.
The advantage of that is that there is less different code in LGC between the shader and pipeline compilation cases.
A change this causes is that the vertex buffer loads are all at the start of the VS, even in a pipeline compilation. I'm not sure whether that is good, bad or neutral for performance. (Ignoring the NGG culling issue for now.)
NGG culling
An early version of this feature should probably just ignore this case, because it is quite complex.
With NGG culling, it is advantageous to delay vertex buffer loads that are only used for parameter calculations until after the culling. Thus, for an NGG VS, there should be two VS prologs (fetch shaders). The VS compilation needs to generate the post-culling part as a separate shader, such that the second fetch shader can be glued in between them. At that point (the exit of the first shader), sgprs and vgprs need to be as at wave dispatch, except that the vgprs (vertex index etc) have been copied through LDS to account for the vertices being compacted. Also exec needs to reflect the compacted vertices.
Jumping between prolog, shader and epilog
I'm not sure how possible this is, or if there is a better idea, but:
We want the generated code to reflect that it is going to jump to the next part of the shader. So, when generating the prolog, or when generating the shader proper when there will be an epilog, we want to have an
s_branch
with a reloc, rather than ans_endpgm
. Perhaps we could tell the backend that by defining a new function attribute giving the symbol name tos_branch
to when generating what would otherwise be ans_endpgm
.Linking a prolog, shader and epilog would then just work with the
s_branch
. Linking could optimize that by ensuring the chunks of code are glued together in the right order, and removing a finals_branch
. Alignment is a consideration:s_nop
s, except that any finals_waitcnt
s should be moved to after thes_nop
s as an optimization.The LGC interface
I propose that we extend LGC (LLPC middle-end) to handle the various requirements.
Currently LGC has an interface that says:
That interface needs to be extended to allow compilation of a shader with missing or incomplete pipeline state, and to allow linking of previously-compiled shader ELFs and pipeline state.
We would probably want to implement compilation of a geometry and/or tessellation pipeline by providing LGC with IR modules for non-FS shaders, a previously-compiled shader ELF for the FS, and the pipeline state. That allows the other shaders to be compiled knowing which attribute exports will be unused by the FS so can be removed.
Compilation modes
The compilation modes LGC would support (in probable order of implementation priority) are:
Note that the above modes do not include any case where a shader is compiled separately, and then in the link stage needs to be combined with another shader to create a merged shader or an NGG prim shader.
Tuning options
As proposed by Rob, tuning options should always be made available at shader compilation time. This does probably mean that all tuning has to be done by shader, not pipeline. Most tuning options are per-shader anyway, except the NGG ones, which obviously apply only to the VS in a VS-FS pipeline.
Use of the LGC interface by the front-end
VS-FS parameter optimization
As pointed out by Nicolai, the use of early shader caching limits the parameter optimizations that can be done between VS and FS, and how that is limited depends on whether you compile the VS first or the FS first. I consider that it is worth taking this hit because of the saving in compile time in the cache-hit case.
FS first
In this scheme, at VS compilation time, we know exactly how parameters are packed by the FS, so we can generate the parameter exports and we do not need a VS epilog. We can also see where the FS does not use a parameter at all, and DCE it and its calculation in the VS. However we cannot do constant parameter propagation into the FS.
VS first
In this scheme, VS compilation does not know how parameters will be laid out by the FS, so we need a VS epilog. This does allow constant parameter propagation into the FS, because the VS's parameter metadata can include an indication that a parameter is a constant so is not being returned in a vgpr at all. FS compilation will see this metadata, and propagate the constant into the FS, saving an export/import. (Note that LLPC doesn't do this at all currently.) However, the dead parameter (one not used by the FS) optimization is limited to the VS epilog spotting it does not need to export it. The calculation of the dead parameter, and any vertex buffer load needed only for that, does not get DCEd.
Other VS-FS parameter optimizations we miss out on
Here are some examples of potential optimizations Nicolai mentioned that we miss out on by using early shader caching:
All these are possible when doing a full pipeline compile.
LLPC front-end changes
The LLPC interface would need to change so that a partial pipeline state (and tuning options) is provided to the shader compile function. That function would then check the shader cache, and, if a compile is needed, do front-end compilation then call the LGC interface with the partial pipeline state.
The pipeline compile function would check the cache for its shaders or partial pipeline. The difficulty here is that it does not know how much of the pipeline state was known at shader compile time, so there may need to be some mechanism for multiple shader ELFs to be stored for a particular shader in the cache, with a way of finding one whose known pipeline state at the time is compatible.
amdllpc
Steven proposes using a modified amdllpc as his offline shader compile tool. Thus, that will be calling the LLPC shader compile function with an incomplete pipeline state containing values for the "bounded" items.
The proposed un-pipeline-linked ELF module
Such an ELF is the result of anything other than full pipeline compilation. It contains various things to represent the parts of the pipeline state or inter-shader-stage linking information that was unavailable at the time it was compiled.
Representation of metadata needed for linking
Some of the items below list metadata that needs to be left in the unlinked ELF for the link stage to read. I propose that we will define a new section in the PAL metadata msgpack tree to put these in. The link stage will remove that metadata.
Representation of final PAL metadata
Some parts of the PAL metadata can be directly generated in a shader compile before linking. Hopefully all the link stage needs to do is merge the two msgpack trees, ORing together any register that appears in both. That handles the case that the same register has a part used by VS and a part used by FS.
Resource descriptor layout
If resource descriptor layout was unavailable at shader compile time, then the load of a descriptor from its descriptor table has a reloc on its offset where the symbol name gives the descriptor set and binding. Such relocs are resolved at link time, when the resource descriptor layout pipeline state is available. This work is already underway by Steven from Gibraltar.
In addition, an array of image or sampler descriptors needs a reloc for the array stride. That is different depending on whether it is actually an array of combined image+samplers, and you can't tell at shader compile time.
For a descriptor set pointer that can fit into a user data sgpr, the PAL metadata register for that user data sgpr contains the descriptor set number. The link stage updates that to give the spill table offset. Work on this mechanism is underway by David Zhou in AMD (although in the context of the front-end ELF linking mechanism). There needs to be some way of telling whether the PAL metadata register represents a fully-linked spill table offset, or an unlinked descriptor set number. I believe David's work already does that.
For a descriptor set pointer that cannot fit into a user data sgpr, it is loaded from the spill table with a reloc on the offset whose symbol gives the descriptor set. That reloc is resolved at link time.
We will have to ban the driver putting any descriptors into the top level of the descriptor layout:
Currently, if a descriptor set contains both dynamic and non-dynamic descriptors, the driver puts the dynamic ones in the top level. This proposal would not be able to find them.
Banning that also avoids the use of compact descriptors, which we also cannot cope with in this proposal.
A compute shader's user data has a restriction on which spill table entries can be put into user data sgprs, and in what order. For that reason, the link stage may need to prepend code to load and/or swap around sgprs for descriptor set pointers.
Vertex inputs
If vertex input information is unavailable at VS compile time, then vertex inputs are passed into the vertex shader in vgprs, with metadata saying which inputs they are and what type. The link stage then constructs a "fetch shader", and glues it on to the front of the shader.
The fetch shader has an ABI where the vertex shader's input registers are also the fetch shader's inputs and outputs, except that the vertex input values are obviously not part of the fetch shader's inputs.
Color exports
If color export information is unavailable at FS compile time, then color exports are passed out of the fragment shader in vgprs, with metadata saying which exports they are and what type. The link stage then constructs an FS epilog, and glues it on to the back of the shader. The shader exits with exec set to pixels that are not killed/demoted.
The following pipeline state items also affect color export code, so the absence of any of them also forces the use of an FS epilog:
Parameter exports and attribute inputs
In a shader compile, parameter exports are passed out of the last stage vertex-processing shader in vgprs, with metadata saying which parameters they are. In an unlinked fragment shader, attributes are packed and there is metadata saying how that is done. The link stage then ties them up, and adds an epilog to the last stage vertex-processing stage.
enableMultiView
enableMultiView has several impacts:
gl_Layer
andgl_ViewIndex
actually areIt looks like the best way of handling this if enableMultiView is unavailable at VS compile time is to compile the two alternatives for each thing inside an if..else..endif with a reloc as the condition.
perSampleShading
If the perSampleShading item is unavailable at FS compile time, and the FS uses
gl_SampleMask
orgl_PointCoord
, then the compiler needs to generate code for both alternatives inside an if..else..endif where the condition is a reloc.PAL metadata items
Certain pipeline state items do not affect compilation except for being copied straight into PAL metadata registers:
In a shader compile with a link stage, it is the link stage that copies these items into PAL metadata.
Relocatable items
As pointed out by Steven's document pipeline state - Sheet1 (1).pdf, the following items are relocatable. That is, if the item is unavailable in pipeline state at shader compile time, a simple 32-bit constant load with a reloc will work, so it can be resolved at link time:
We should probably add the shadow descriptor table high 32 bits to this too.
Specialization constants
Steven's document claims that SPIR-V specialization constants can be handled by relocs. That is only partly true:
Where a specialization constant is used somewhere a reloc can be used (an operand to an instruction in function code), then the SPIR-V reader could call a new Builder function "get reloc value". The name of the symbol referenced by the reloc is private to the SPIR-V LLPC front-end, and is not understood by LGC.
Where a specialization constant is used somewhere a reloc cannot be used (e.g. the size of an array type), then the SPIR-V reader uses the default value for that constant, and it somehow needs to record what value it used so the linker can later check that the specialization constants supplied with the pipeline do not clash with that. If they do clash, then the link fails and the front-end needs to start again compiling that shader.
At the link stage, the front-end needs to supply a list of symbol,value pairs to the linker to satisfy the relocs. I'm not sure whether it is worth encapsulating that in an ELF.
Bounded items that we need to make relocatable
These are pipeline state items that Steven's document lists as "bounded", that is, there is a limited range of values that each one can take. Gibraltar's proposal to handle this in their offline shader cache populating scheme is to compile a shader multiple times with these items set to the most popular values, in the hope of covering most cases that the shader is used in a pipeline.
The implication of this is that the shader cache needs to be able to keep multiple ELFs for the same shader, with different assumptions about these pipeline state items. When a pipeline compile looks for a cached shader, there needs to be some mechanism where it can find the one with a compatible state for these items.
However, for the purposes of app runtime shader compilation, we need to find some way of making these fixuppable by the link stage. In some cases, that might involve generating code that can handle all possibilities, and then having a branch with a reloc to select the required alternative.
NGG control items
These items are supplied to the compiler through pipeline state to save needing to load them at runtime from the primitive shader table. If they are unavailable at shader compile time, then the compiler is forced to load from the primitive shader table.
These items are similar, except certain settings also need to force NGG pass-through mode. Therefore, if the items are unavailable at shader compile time, we need to force NGG pass-through mode.
Items only needed for tessellation or geometry
These pipeline state items are only used for tessellation or geometry. Because this proposal insists that a vertex-processing half-pipeline with tessellation or geometry has to be compiled with full pipeline state, these items do not need to be handled by a reloc:
The link stage
The link stage needs to:
A prolog is generated to end with an
s_branch
with a reloc to branch to the VS.Where an FS needs an epilog (color export information was unavailable at shader compile time), it is generated with an
s_branch
with a reloc instead of ans_endpgm
, to branch to its epilog code.In both cases, we can optimize by gluing sections in the right order, and applying the optimization that a chunk of code that ends with an
s_branch
can have thes_branch
removed and turned into a fallthrough. There may need to be special handling for a prolog to ensure that the CS or VS remains instruction-cache-line-aligned, such as insertings_nop
padding before the fetch shader.Prologs will be generated as IR then compiled. They will be cached so that will not happen very often.