Open VReaperV opened 6 days ago
u_ReliefDepthScale and u_ReliefOffsetBias - 16 bits each is probably enough, and it seems to only be used for DarkPlaces shaders, otherwise it can just be set as a uniform.
Because the feature isn't merged yet:
When I fixed the relief mapping code back in the days, we didn't have relief mapping assets and I used Xonotic ones for testing the code, that's why the DarkPlaces compatibility mode was completed first.
Oh, I see. Well, I suppose extra 4 bytes isn't too big of a deal.
After a quick preliminary test, it appears that I can reduce the total amount of shader stage data required by a lot.
In fact, it might fit in under 1kb a lot of the time. Putting light factor and color modulate into the baseInstance
isn't viable though as it precludes merging anyway, so disregard that part...
The latter isn't an issue though, as it will still be within the same 32-bit boundaries anwyay.
Looks like with eliminating duplicate stage data it cuts down to ~64 at most per map. So no need to pack every bit, aligning it to 32-bit boundaries should work fine, still ~1.2kb at most. Lightmaps and deluxemaps seem to be better put into a different part of the texture handle buffer, because they tend to have much different data frequency than the texture bundles (something like 6.3kb vs 250kb).
This way it would also skip on some instructions in material shaders to load data from the bitstream.
(I wrote this for #1414, but it's too lengthy and out of scope for that issue)
Right now, the highest total amount of shader stages I've seen on a map is 1551, which I believe can be decreased further since some of them will only differ in the textures that they use.
Currently, these are the non-global uniforms used by the material shaders:
I'll ignore the skybox, liquid and reflection ones for now since they're either not really working or unused.
Almost all of the textures and
u_TextureMatrix
would be moved to a different buffer: since textures are more or less addresses, they cannot be quantised, and due to their values being unpredictable lossless compression isn't all that useful either. While the texture matrix is only used with those specific textures, as part of the bundle. Additionally, it can be changed from a mat4 to mat3x2 - thereby creating a strict with 20 components and 80 bytes size:This is the only buffer that has to be accessed by a dynamically uniform expression - the rest of the data doesn't have to be.
This would result in the following structs for generic and lightMapping material shaders:
These uniforms can then be quantised or simply use a lower number of bits:
Note that a lot of these end up being the same for many surfaces and shaders: since the stuff used by
RB_EvalExpression()
doesn't seem to be used a lot.Let's say that up to 8192 total stages would be supported - 13 bits to address. There's a maximum of 256 lightmaps supported in engine, so 8 bits. Combined with textures it can be quite a bit more, however. 4096 should be enough, so 12 more bits. Then another 4 bits for colour modulate. Then there's 3 bits remaining to use for light factor. I'm not sure if more can really fit there, closest would be alpha threshold, which would only leave 22 bits for addressing, which wouldn't be enough (or alternatively using an extra buffer, which seems worse).
This also means that light factor can then be represented in the value itself - 3 bits should be enough for any sane values.
All of these values would, of course, be the draw command's
baseInstance
value - which right now we're using to index into the material buffer.The material structures for generic and lightMapping shaders can then be reduced as follows:
Then, the quantisation:
If the DarkPlaces
u_ReliefDepthScale
andu_ReliefOffsetBias
is not used, then the lightMapping one can be 122 bits instead. Then, there's 2 possibilities:The latter would result in 8 byte genericMaterial structures and 20 byte lightMapping ones - 8/16 and 3/6 (4/8 without the DP depthRelief stuff) in a typical cacheline (the latter overfetching as well with depthRelief). The bitstream would have 8/17 and 3/6 (4/8 without depthRelief in a cacheline).
Other than lightMapping + depthRelief, the first solution would align well, but would essentially be a bitstream with extra steps. Either way, with 20 bytes per stage we could have 819 stages in a 16kb UBO or 3276 in a 64kb one. Most GPUs support the latter, but the former might be a limiting factor on Intel. It's possible that comparing the remaining stage values would reduce their total amount enough, or perhaps just binding different ranges of the buffer for different drawcalls if it's beyond the limit would work well.