GPUOpen-Drivers / llpc

LLVM-Based Pipeline Compiler
MIT License
167 stars 115 forks source link

LGC: shader compilation proposal #507

Closed trenouf closed 2 months ago

trenouf commented 4 years ago

LGC: shader compilation proposal

There are several different efforts to move away from whole-pipeline compilation in LLPC, or that will affect LLPC in the future. This proposal is to unify them in new LGC (LLPC middle-end) functionality.

This proposal is to unify these different efforts to use new LGC (LLPC middle-end) functionality. The link stage in particular requires knowledge that should be in the middle-end, such as the workings of PAL metadata, and ELF reading and writing, and needs to be shared and used by potential multiple LLPC front-ends.

Background

Existing whole pipeline compilation

Whole-pipeline compilation in LLPC works like this:

  1. For each shader, run the front-end shader compilation: SPIR-V reader and various "lowering" passes use Builder to construct the IR for a shader. This phase does not use pipeline state.
  2. LGC (the middle-end) is given the pipeline state, and it links the shader IR modules into a pipeline IR module.
  3. LGC runs its middle-end passes and optimizations, then passes the resulting pipeline IR module to the AMDGPU back-end for pipeline ELF generation.

Existing(ish) shader and partial pipeline caching

Existing partial pipeline compilation

There are some changes on top of this to handle a "partial pipeline compilation" mode. Part way through step 2, LGC calls a callback provided by the front-end with a hash of each shader and the pipeline state and input/output info pertaining to it. The callback in the front-end can ask to omit a shader stage, if it finds it already has a cached ELF containing that shader. Then, the front-end has a post-compilation ELF linking step to use the part of that cached ELF for the omitted shader. This only works for VS-FS, and has some other provisos, because of the way that it plucks the part of the pipeline it needs out of a whole pipeline ELF.

This scheme has some disadvantages, especially the way that it allows the middle-end to think that it is compiling a whole pipeline, but it then post-processes the ELF to extract the part it needs. A more holistic approach would be for the middle-end to know that it is not compiling a whole pipeline, and for the link stage to be in the middle-end where knowledge of (for example) PAL metadata should be confined to.

Steven et al's shader caching

Steven's scheme is to offline compile shaders to pre-populate a shader cache. This would involve compiling a shader with most of the pipeline state missing (principally resource descriptor layout, vertex buffer info and color export info), and with some "bounded" items in the pipeline state set to a guessed value. The resulting compiled shader ELF would be cached keyed on the input SPIR-V and (I assume) the "bounded" parts of the pipeline state that were set.

The proposal

This proposal outlines a shader compilation scheme using relocs, prologs and epilogs, and a pipeline linking stage, all handled in LGC (the LLPC middle-end).

Shader compilation vs pipeline compilation

This proposal does not cover how and when a driver decides to do shader compilation. Of the two compilation modes:

there is scope for API and/or driver changes to use shader compilation first, then kick off a background thread to do the optimized compilation and swap the result in at the next opportunity.

Early vs late shader caching

We can divide existing and proposed shader caching schemes into two types:

I propose to focus here on early shader caching, which has the following pros and cons:

Nicolai also suggests taking the existing partial pipeline compilation scheme, a late shader caching scheme, and tidying up its interface and implementation (see Inter-shader data cache tracking. One problem is that we do pretty much have to choose one or the other; within one application run, you can't use both at the same time, as trying to means that a shader gets cached early and late, and next time the same shader is seen, the early cache check always succeeds.

The choice partly depends on how you view the existing partial pipeline compilation scheme: was a late shader caching scheme chosen for the possibility of VS-FS optimizations, or was it chosen because that meant that it could be implemented without implementing the relocs and prologs and epilogs in this proposal? I suspect the latter, and I reckon we're better with an early shader caching scheme for the two pros I list above.

What shaders are cached

This proposal makes no attempt to cache VS, TCS, TES, GS shaders that make up part of a geometry or tessellation vertex-processing stage. The FS in such a pipeline can still be cached though. So the shader types that can be cached are:

In addition, we can compile the whole vertex-processing stage (VS-GS, VS-TCS-TES, or VS-TCS-TES-GS) without the FS, or with an already-compiled FS.

Failure of shader compilation or pipeline linking

There needs to be scope for shader compilation or pipeline linking to fail, in which case the front-end needs to do full pipeline compilation instead:

This kind of failure is different to normal compilation failure, in that it needs to exit cleanly and clean up, because the driver or front-end is going to retry as a full pipeline compilation. If any such condition is detected in an LLVM pass flow, we need to come up with a clean exit mechanism, such as deleting all the code in the module and detecting that at the end.

Prologs and epilogs

Compiling shaders with some or all pipeline state missing and without the other shader to refer to means that the pipeline linker needs to generate prologs and epilogs.

CS prolog

If the compilation of a CS without resource descriptor layout puts its user data sgprs in the wrong order for the layout in the pipeline state, then the linker needs to generate a CS prolog that loads and/or swaps around user data sgprs. The linker picks up the descriptor set to sgpr mapping that the CS compilation used from the user data registers in the PAL metadata.

VS prolog

If vertex buffer information is unavailable at VS compile time, then the linker needs to generate a VS prolog (a "fetch shader") that loads vertex buffer values required by the VS. The VS expects the values to be passed in vgprs, and the linker picks up details of which vertex buffer locations and in what format from extra pre-link metadata attached to the VS ELF.

VS epilog

If the VS (or whole vertex-processing stage) is compiled without information on how the FS packs its parameter inputs, then the VS compilation does not know how to export parameters, and the linker needs to generate a VS epilog. The VS (or last vertex-processing-stage shader) exits with the parameter values in vgprs, and the VS epilog takes those and exports them. The linker picks up information on what parameter locations are in which vgprs and in what format from extra pre-link metadata attached to the VS ELF, and information on how parameter locations are packed and arranged from extra pre-link metadata attached to the FS ELF.

No FS prolog

No FS prolog is ever needed. FS compilation decides how to pack and arrange its input parameters.

FS epilog

If the FS is compiled without color export pipeline state, then it does not know how to do its exports, and the linker needs to generate an FS epilog. The FS exits with its color export values in vgprs (and the exec mask set to the surviving pixels after kills/demotes), and the FS epilog takes those and exports them. The linker picks up information on what color exports are in which vgprs and in what format from extra pre-link metadata attached to the FS ELF.

Prolog/epilog compilation notes

A prolog has the same input registers as the shader it will be attached to, minus the vgprs that are generated by the prolog for passing to the shader proper. That is, the shader's SPI register settings that determine what registers are set up at wave dispatch apply to the prolog.

For a VS prolog where the VS is part of a merged shader (including the NGG case), the code to set exec needs to be in the prolog.

The exact same set of registers are also outputs from the prolog, plus the vgprs that are generated by the prolog.

A prolog/epilog is generated as an IR module then compiled. The compiled ELF is cached with the hash of the inputs to the prolog/epilog IR generator being the key.

In the context of a prolog being generated as IR then compiled:

An epilog's input registers are the same as the shader's output registers, which is the vgprs containing the values to export. (This may need to change to also have some sgprs passed for VS epilog parameter export on gfx11, if parameter exports are going to be replaced by normal off-chip memory writes.)

Prolog/epilog generation even in pipeline compilation

In a case where a particular prolog or epilog is not needed (e.g. the VS prolog when vertex buffer information is available at VS compilation time), I propose that LGC internally uses the same scheme of setting up a shader as if it is going to use the prolog/epilog (including setting up the metadata for the linker), and then uses the same code to generate the IR for the prolog/epilog as would otherwise be used at link time. Then it would merge the prolog/epilog into the shader at the IR stage, allowing optimizations from there.

The advantage of that is that there is less different code in LGC between the shader and pipeline compilation cases.

A change this causes is that the vertex buffer loads are all at the start of the VS, even in a pipeline compilation. I'm not sure whether that is good, bad or neutral for performance. (Ignoring the NGG culling issue for now.)

NGG culling

An early version of this feature should probably just ignore this case, because it is quite complex.

With NGG culling, it is advantageous to delay vertex buffer loads that are only used for parameter calculations until after the culling. Thus, for an NGG VS, there should be two VS prologs (fetch shaders). The VS compilation needs to generate the post-culling part as a separate shader, such that the second fetch shader can be glued in between them. At that point (the exit of the first shader), sgprs and vgprs need to be as at wave dispatch, except that the vgprs (vertex index etc) have been copied through LDS to account for the vertices being compacted. Also exec needs to reflect the compacted vertices.

Jumping between prolog, shader and epilog

I'm not sure how possible this is, or if there is a better idea, but:

We want the generated code to reflect that it is going to jump to the next part of the shader. So, when generating the prolog, or when generating the shader proper when there will be an epilog, we want to have an s_branch with a reloc, rather than an s_endpgm. Perhaps we could tell the backend that by defining a new function attribute giving the symbol name to s_branch to when generating what would otherwise be an s_endpgm.

Linking a prolog, shader and epilog would then just work with the s_branch. Linking could optimize that by ensuring the chunks of code are glued together in the right order, and removing a final s_branch. Alignment is a consideration:

The LGC interface

I propose that we extend LGC (LLPC middle-end) to handle the various requirements.

Currently LGC has an interface that says:

That interface needs to be extended to allow compilation of a shader with missing or incomplete pipeline state, and to allow linking of previously-compiled shader ELFs and pipeline state.

We would probably want to implement compilation of a geometry and/or tessellation pipeline by providing LGC with IR modules for non-FS shaders, a previously-compiled shader ELF for the FS, and the pipeline state. That allows the other shaders to be compiled knowing which attribute exports will be unused by the FS so can be removed.

Compilation modes

The compilation modes LGC would support (in probable order of implementation priority) are:

  1. Pipeline compilation, as now. Must be provided with full pipeline state. Generates a pipeline ELF satisfying the PAL pipeline ELF spec.
  2. Compilation of a single shader with missing or partial pipeline state. The shader must be CS, FS, or VS in a non-tessellation non-geometry pipeline. For VS or FS, this may or may not be provided with the other shader already compiled, which would provide parameter information. Generates an ELF that needs to be pipeline linked. Then there is a link stage in LGC that takes such ELFs and generates a pipeline ELF satisfying the PAL pipeline ELF spec.
  3. Compilation of the vertex-processing part of a geometry or tessellation pipeline, with full pipeline state. This may or may not be provided with the already-compiled FS ELF, which would supply parameter layout information. Generates an ELF that needs to be pipeline linked.

Note that the above modes do not include any case where a shader is compiled separately, and then in the link stage needs to be combined with another shader to create a merged shader or an NGG prim shader.

Tuning options

As proposed by Rob, tuning options should always be made available at shader compilation time. This does probably mean that all tuning has to be done by shader, not pipeline. Most tuning options are per-shader anyway, except the NGG ones, which obviously apply only to the VS in a VS-FS pipeline.

Use of the LGC interface by the front-end

VS-FS parameter optimization

As pointed out by Nicolai, the use of early shader caching limits the parameter optimizations that can be done between VS and FS, and how that is limited depends on whether you compile the VS first or the FS first. I consider that it is worth taking this hit because of the saving in compile time in the cache-hit case.

FS first

In this scheme, at VS compilation time, we know exactly how parameters are packed by the FS, so we can generate the parameter exports and we do not need a VS epilog. We can also see where the FS does not use a parameter at all, and DCE it and its calculation in the VS. However we cannot do constant parameter propagation into the FS.

VS first

In this scheme, VS compilation does not know how parameters will be laid out by the FS, so we need a VS epilog. This does allow constant parameter propagation into the FS, because the VS's parameter metadata can include an indication that a parameter is a constant so is not being returned in a vgpr at all. FS compilation will see this metadata, and propagate the constant into the FS, saving an export/import. (Note that LLPC doesn't do this at all currently.) However, the dead parameter (one not used by the FS) optimization is limited to the VS epilog spotting it does not need to export it. The calculation of the dead parameter, and any vertex buffer load needed only for that, does not get DCEd.

Other VS-FS parameter optimizations we miss out on

Here are some examples of potential optimizations Nicolai mentioned that we miss out on by using early shader caching:

All these are possible when doing a full pipeline compile.

LLPC front-end changes

The LLPC interface would need to change so that a partial pipeline state (and tuning options) is provided to the shader compile function. That function would then check the shader cache, and, if a compile is needed, do front-end compilation then call the LGC interface with the partial pipeline state.

The pipeline compile function would check the cache for its shaders or partial pipeline. The difficulty here is that it does not know how much of the pipeline state was known at shader compile time, so there may need to be some mechanism for multiple shader ELFs to be stored for a particular shader in the cache, with a way of finding one whose known pipeline state at the time is compatible.

amdllpc

Steven proposes using a modified amdllpc as his offline shader compile tool. Thus, that will be calling the LLPC shader compile function with an incomplete pipeline state containing values for the "bounded" items.

The proposed un-pipeline-linked ELF module

Such an ELF is the result of anything other than full pipeline compilation. It contains various things to represent the parts of the pipeline state or inter-shader-stage linking information that was unavailable at the time it was compiled.

Representation of metadata needed for linking

Some of the items below list metadata that needs to be left in the unlinked ELF for the link stage to read. I propose that we will define a new section in the PAL metadata msgpack tree to put these in. The link stage will remove that metadata.

Representation of final PAL metadata

Some parts of the PAL metadata can be directly generated in a shader compile before linking. Hopefully all the link stage needs to do is merge the two msgpack trees, ORing together any register that appears in both. That handles the case that the same register has a part used by VS and a part used by FS.

Resource descriptor layout

If resource descriptor layout was unavailable at shader compile time, then the load of a descriptor from its descriptor table has a reloc on its offset where the symbol name gives the descriptor set and binding. Such relocs are resolved at link time, when the resource descriptor layout pipeline state is available. This work is already underway by Steven from Gibraltar.

In addition, an array of image or sampler descriptors needs a reloc for the array stride. That is different depending on whether it is actually an array of combined image+samplers, and you can't tell at shader compile time.

For a descriptor set pointer that can fit into a user data sgpr, the PAL metadata register for that user data sgpr contains the descriptor set number. The link stage updates that to give the spill table offset. Work on this mechanism is underway by David Zhou in AMD (although in the context of the front-end ELF linking mechanism). There needs to be some way of telling whether the PAL metadata register represents a fully-linked spill table offset, or an unlinked descriptor set number. I believe David's work already does that.

For a descriptor set pointer that cannot fit into a user data sgpr, it is loaded from the spill table with a reloc on the offset whose symbol gives the descriptor set. That reloc is resolved at link time.

We will have to ban the driver putting any descriptors into the top level of the descriptor layout:

A compute shader's user data has a restriction on which spill table entries can be put into user data sgprs, and in what order. For that reason, the link stage may need to prepend code to load and/or swap around sgprs for descriptor set pointers.

Vertex inputs

If vertex input information is unavailable at VS compile time, then vertex inputs are passed into the vertex shader in vgprs, with metadata saying which inputs they are and what type. The link stage then constructs a "fetch shader", and glues it on to the front of the shader.

The fetch shader has an ABI where the vertex shader's input registers are also the fetch shader's inputs and outputs, except that the vertex input values are obviously not part of the fetch shader's inputs.

Color exports

If color export information is unavailable at FS compile time, then color exports are passed out of the fragment shader in vgprs, with metadata saying which exports they are and what type. The link stage then constructs an FS epilog, and glues it on to the back of the shader. The shader exits with exec set to pixels that are not killed/demoted.

The following pipeline state items also affect color export code, so the absence of any of them also forces the use of an FS epilog:

Parameter exports and attribute inputs

In a shader compile, parameter exports are passed out of the last stage vertex-processing shader in vgprs, with metadata saying which parameters they are. In an unlinked fragment shader, attributes are packed and there is metadata saying how that is done. The link stage then ties them up, and adds an epilog to the last stage vertex-processing stage.

enableMultiView

enableMultiView has several impacts:

It looks like the best way of handling this if enableMultiView is unavailable at VS compile time is to compile the two alternatives for each thing inside an if..else..endif with a reloc as the condition.

perSampleShading

If the perSampleShading item is unavailable at FS compile time, and the FS uses gl_SampleMask or gl_PointCoord, then the compiler needs to generate code for both alternatives inside an if..else..endif where the condition is a reloc.

PAL metadata items

Certain pipeline state items do not affect compilation except for being copied straight into PAL metadata registers:

In a shader compile with a link stage, it is the link stage that copies these items into PAL metadata.

Relocatable items

As pointed out by Steven's document pipeline state - Sheet1 (1).pdf, the following items are relocatable. That is, if the item is unavailable in pipeline state at shader compile time, a simple 32-bit constant load with a reloc will work, so it can be resolved at link time:

We should probably add the shadow descriptor table high 32 bits to this too.

Specialization constants

Steven's document claims that SPIR-V specialization constants can be handled by relocs. That is only partly true:

Bounded items that we need to make relocatable

These are pipeline state items that Steven's document lists as "bounded", that is, there is a limited range of values that each one can take. Gibraltar's proposal to handle this in their offline shader cache populating scheme is to compile a shader multiple times with these items set to the most popular values, in the hope of covering most cases that the shader is used in a pipeline.

The implication of this is that the shader cache needs to be able to keep multiple ELFs for the same shader, with different assumptions about these pipeline state items. When a pipeline compile looks for a cached shader, there needs to be some mechanism where it can find the one with a compatible state for these items.

However, for the purposes of app runtime shader compilation, we need to find some way of making these fixuppable by the link stage. In some cases, that might involve generating code that can handle all possibilities, and then having a branch with a reloc to select the required alternative.

NGG control items

These items are supplied to the compiler through pipeline state to save needing to load them at runtime from the primitive shader table. If they are unavailable at shader compile time, then the compiler is forced to load from the primitive shader table.

These items are similar, except certain settings also need to force NGG pass-through mode. Therefore, if the items are unavailable at shader compile time, we need to force NGG pass-through mode.

Items only needed for tessellation or geometry

These pipeline state items are only used for tessellation or geometry. Because this proposal insists that a vertex-processing half-pipeline with tessellation or geometry has to be compiled with full pipeline state, these items do not need to be handled by a reloc:

The link stage

The link stage needs to:

A prolog is generated to end with an s_branch with a reloc to branch to the VS.

Where an FS needs an epilog (color export information was unavailable at shader compile time), it is generated with an s_branch with a reloc instead of an s_endpgm, to branch to its epilog code.

In both cases, we can optimize by gluing sections in the right order, and applying the optimization that a chunk of code that ends with an s_branch can have the s_branch removed and turned into a fallthrough. There may need to be special handling for a prolog to ensure that the CS or VS remains instruction-cache-line-aligned, such as inserting s_nop padding before the fetch shader.

Prologs will be generated as IR then compiled. They will be cached so that will not happen very often.

s-perron commented 4 years ago

This looks good. Thanks.

kuhar commented 4 years ago

As bystander, I really appreciate your summary, Tim. It's great you gave this are more structure and provided a high level overview of the design space -- usually a few folks would just come in with some corner-case in the design that they are aware of and it was very difficult for me connect the dots when that happened. Many things are much more clear to me now, although I still don't understand the details.

trenouf commented 4 years ago

I have opened #545 LGC shader compilation interface proposal, to detail how the front-end would call LGC (the middle-end) to do shader compilation and linking.

trenouf commented 4 years ago

Now that I have pushed #720 fetch shader for review, here are some ideas on how to go about implementing the color export shader:

  1. Analogous to "New vertex fetch pass" in #720, handle color exports in a similar way: use a new lgc.output.export.color call (instead of lgc.output.export.generic) for writing to a color export in InOutBuilder, and add a new pass into the existing FragColorExport.cpp that runs before PatchEntryPointMutate to lower the lgc.output.export.color calls to export intrinsics.
  2. In that new pass, spot that it is an unlinked compile and no color export info was provided. In that case, write the info from the color export calls to metadata, mutate the shader to return a struct containing the export values, and hook up the return value elements to the inputs to the color export calls. The FS is then an "exportless" FS.
  3. In the linker, spot that it is an "exportless" FS (perhaps by the presence of the metadata), and create a color export shader (new subclass of GlueShader), analogous to a fetch shader. Actually it is quite a bit simpler than a fetch shader, because it does not need to ask PalMetadata to tell it how many sgprs and vgprs it has on entry, or where any entry register is.
s-perron commented 4 years ago

That is exactly what I was thinking. Thanks.