Better material data packing for material system

VReaperV commented 6 days ago

(I wrote this for #1414, but it's too lengthy and out of scope for that issue)

Right now, the highest total amount of shader stages I've seen on a map is 1551, which I believe can be decreased further since some of them will only differ in the textures that they use.

Currently, these are the non-global uniforms used by the material shaders:

genericMaterial:
// 32 components (128 bytes)
struct Material {
  mat4 u_TextureMatrix;
  vec4 u_ColorModulate;
  vec4 u_Color;
  uvec2 u_ColorMap;
  uvec2 u_DepthMap
  float u_AlphaThreshold;
  float u_DepthScale;
  int material_padding0;
  int material_padding1;
};

lightMappingMaterial:
// 48 components (192 bytes)
struct Material {
  vec4 u_ColorModulate;
  vec4 u_Color;
  mat4 u_TextureMatrix;
  vec3 u_NormalScale;
  int u_NormalScale_padding;
  uvec2 u_DiffuseMap;
  uvec2 u_NormalMap;
  uvec2 u_HeightMap;
  uvec2 u_MaterialMap;
  uvec2 u_LightMap;
  uvec2 u_DeluxeMap;
  uvec2 u_GlowMap;
  vec2 u_SpecularExponent;
  float u_AlphaThreshold;
  float u_LightFactor;
  float u_ReliefDepthScale;
  float u_ReliefOffsetBias;
};

skyboxMaterial:
// 24 components (96 bytes)
struct Material {
  mat4 u_TextureMatrix;
  uvec2 u_ColorMapCube;
  uvec2 u_CloudMap;
  float u_CloudHeight;
  float u_AlphaThreshold;
  int material_padding0;
  int material_padding1;
};

fogQuake3Material:
// 8 components (32 bytes)
struct Material {
  vec4 u_Color;
  uvec2 u_FogMap;
  int material_padding0;
  int material_padding1;
};

heatHazeMaterial:
// 36 components (144 bytes)
struct Material {
  vec4 u_ColorModulate;
  vec4 u_Color;
  mat4 u_TextureMatrix;
  vec3 u_NormalScale;
  int u_NormalScale_padding;
  uvec2 u_CurrentMap;
  uvec2 u_NormalMap;
  uvec2 u_HeightMap;
  float u_DeformMagnitude;
  int material_padding0;

};

liquidMaterial:
// 38 components (152 bytes)
struct Material {
  mat4 u_TextureMatrix;
  vec3 u_NormalScale;
  int u_NormalScale_padding;
  vec3 u_FogColor;
  int u_FogColor_padding;
  uvec2 u_NormalMap;
  uvec2 u_HeightMap;
  vec2 u_SpecularExponent;
  float u_RefractionIndex;
  float u_FresnelPower;
  float u_FresnelScale;
  float u_FresnelBias;
  float u_ReliefDepthScale;
  float u_ReliefOffsetBias;
  float u_FogDensity;
  int material_padding0;
};

reflectionMaterial:
// 32 components (128 bytes)
struct Material {
  mat4 u_TextureMatrix;
  vec3 u_NormalScale;
  int u_NormalScale_padding;
  uvec2 u_ColorMapCube;
  uvec2 u_NormalMap;
  uvec2 u_HeightMap;
  float u_ReliefDepthScale;
  float u_ReliefOffsetBias;
};

I'll ignore the skybox, liquid and reflection ones for now since they're either not really working or unused.

Almost all of the textures and u_TextureMatrix would be moved to a different buffer: since textures are more or less addresses, they cannot be quantised, and due to their values being unpredictable lossless compression isn't all that useful either. While the texture matrix is only used with those specific textures, as part of the bundle. Additionally, it can be changed from a mat4 to mat3x2 - thereby creating a strict with 20 components and 80 bytes size:

struct TexData {
  mat3x2 u_TextureMatrix;
  uvec2 u_ColorMap;
  uvec2 u_NormalMap;
  uvec2 u_HeightMap;
  uvec2 u_SpecularMap;
  uvec2 u_MaterialMap;
  uvec2 u_LightMap;
  uvec2 u_DeluxeMap;
};

This is the only buffer that has to be accessed by a dynamically uniform expression - the rest of the data doesn't have to be.

This would result in the following structs for generic and lightMapping material shaders:

genericMaterial:
// 32->12 (128->48)
struct Material {
  vec4 u_ColorModulate;
  vec4 u_Color;
  float u_AlphaThreshold;
  float u_DepthScale;
  int material_padding0;
  int material_padding1;
};

lightMappingMaterial:
// 48->20 (192->80)
struct Material {
  vec4 u_ColorModulate;
  vec4 u_Color;
  vec3 u_NormalScale;
  int u_NormalScale_padding;
  vec2 u_SpecularExponent;
  float u_AlphaThreshold;
  float u_LightFactor;
  float u_ReliefDepthScale;
  float u_ReliefOffsetBias;
  int material_padding0;
  int material_padding1;
};

These uniforms can then be quantised or simply use a lower number of bits:

u_ColorModulate - vec4 is wholly redundant here, it only holds 9 different states - or 4 bits.
u_Color can likely just use 8 bits per component - so 32 bits.
u_AlphaThreshold - this seems to be in range of [-2, 1] with a step of 1 / 255.0. It therefore holds 765 values (a bit less in reality I believe, but it would maybe only save 1 bit, so I'm not gonna bother with that), or 10 bits.
u_DepthScale - should be fine with 16 bits.
u_NormalScale - should be fine with 48 bits, 16 per component.
u_SpecularExponent - should be fine with 32 bits, 16 per component.
u_LightFactor - 1 bit, this is either set to 1 or tr.mapLightFactor, which can be a uniform.
u_ReliefDepthScale and u_ReliefOffsetBias - 16 bits each is probably enough, and it seems to only be used for DarkPlaces shaders, otherwise it can just be set as a uniform.

Note that a lot of these end up being the same for many surfaces and shaders: since the stuff used by RB_EvalExpression() doesn't seem to be used a lot.

Let's say that up to 8192 total stages would be supported - 13 bits to address. There's a maximum of 256 lightmaps supported in engine, so 8 bits. Combined with textures it can be quite a bit more, however. 4096 should be enough, so 12 more bits. Then another 4 bits for colour modulate. Then there's 3 bits remaining to use for light factor. I'm not sure if more can really fit there, closest would be alpha threshold, which would only leave 22 bits for addressing, which wouldn't be enough (or alternatively using an extra buffer, which seems worse).

This also means that light factor can then be represented in the value itself - 3 bits should be enough for any sane values.

All of these values would, of course, be the draw command's baseInstance value - which right now we're using to index into the material buffer.

The material structures for generic and lightMapping shaders can then be reduced as follows:

genericMaterial:
// 32->8 (128->32)
struct Material {
  vec4 u_Color;
  float u_AlphaThreshold;
  float u_DepthScale;
  int material_padding0;
  int material_padding1;
};

lightMappingMaterial:
// 48->16 (192->64)
struct Material {
  vec4 u_Color;
  vec3 u_NormalScale;
  int u_NormalScale_padding;
  vec2 u_SpecularExponent;
  float u_AlphaThreshold;
  float u_ReliefDepthScale;
  float u_ReliefOffsetBias;
  int material_padding0;
  int material_padding1;
  int material_padding2;
};

Then, the quantisation:

genericMaterial:

// 32 (128 bytes->58 bits)
struct Material {
  u_Color; // 32 bits
  u_AlphaThreshold; // 10 bits
  u_DepthScale; // 16 bits
};

lightMappingMaterial:

// 48 (192 bytes->154 bits)
struct Material {
  u_Color; // 32 bits
  u_NormalScale; // 48 bits
  u_SpecularExponent; // 32 bits
  u_AlphaThreshold; // 10 bits
  u_ReliefDepthScale; // 16 bits
  u_ReliefOffsetBias; // 16 bits
};

If the DarkPlaces u_ReliefDepthScale and u_ReliefOffsetBias is not used, then the lightMapping one can be 122 bits instead. Then, there's 2 possibilities:

Use a bitstream. This will allow packing all the data with no "gaps" left (there can even be bit offsets set for each different shader's data as uniforms, so there would be no padding between them). This would require custom packing and unpacking logic, but it shouldn't be difficult to implement. It should, however, fetch chunks of data efficiently - which would likely be in the form of a uvec4 or uint array (the former for UBOs for alignment).
Manually write the struct definitions and parsing for each shader. This would leave padding between data and take up a bit more space. It would also likely be more difficult to maintain, since any changes to uniforms might require manually changing the packing and unpacking of data (a bitstream would be more like what we have now in that regard - just set whether the uniform is global or not, the engine will do the rest; well, for bitstream it might require also providing the quantisation method for uniforms - with some default ones, like packing a float into fp16).

The latter would result in 8 byte genericMaterial structures and 20 byte lightMapping ones - 8/16 and 3/6 (4/8 without the DP depthRelief stuff) in a typical cacheline (the latter overfetching as well with depthRelief). The bitstream would have 8/17 and 3/6 (4/8 without depthRelief in a cacheline).

Other than lightMapping + depthRelief, the first solution would align well, but would essentially be a bitstream with extra steps. Either way, with 20 bytes per stage we could have 819 stages in a 16kb UBO or 3276 in a 64kb one. Most GPUs support the latter, but the former might be a limiting factor on Intel. It's possible that comparing the remaining stage values would reduce their total amount enough, or perhaps just binding different ranges of the buffer for different drawcalls if it's beyond the limit would work well.

illwieckz commented 6 days ago

u_ReliefDepthScale and u_ReliefOffsetBias - 16 bits each is probably enough, and it seems to only be used for DarkPlaces shaders, otherwise it can just be set as a uniform.

Because the feature isn't merged yet:

https://github.com/DaemonEngine/Daemon/pull/1282

When I fixed the relief mapping code back in the days, we didn't have relief mapping assets and I used Xonotic ones for testing the code, that's why the DarkPlaces compatibility mode was completed first.

VReaperV commented 6 days ago

Oh, I see. Well, I suppose extra 4 bytes isn't too big of a deal.

VReaperV commented 1 day ago

After a quick preliminary test, it appears that I can reduce the total amount of shader stage data required by a lot.

VReaperV commented 1 day ago

In fact, it might fit in under 1kb a lot of the time. Putting light factor and color modulate into the baseInstance isn't viable though as it precludes merging anyway, so disregard that part...

VReaperV commented 1 day ago

The latter isn't an issue though, as it will still be within the same 32-bit boundaries anwyay.

VReaperV commented 1 day ago

Looks like with eliminating duplicate stage data it cuts down to ~64 at most per map. So no need to pack every bit, aligning it to 32-bit boundaries should work fine, still ~1.2kb at most. Lightmaps and deluxemaps seem to be better put into a different part of the texture handle buffer, because they tend to have much different data frequency than the texture bundles (something like 6.3kb vs 250kb).

This way it would also skip on some instructions in material shaders to load data from the bitstream.

DaemonEngine / Daemon

Better material data packing for material system #1448