Move MeshUniform to an instance vertex buffer for tight packing and much better performance

superdump commented 2 years ago

While hacking on optimisations for the many_cubes -- sphere example, I looked into some optimisations for the mesh entity object to world matrices. With frustum culling, the example results in extraction and preparation of ~11k MeshUniform per frame.

MeshUniform are currently stored in a DynamicUniformVec, that is, a uniform buffer using dynamic offsets. Currently using crevice, and indeed depending on the platform, the dynamic offsets impose an alignment requirement of 256 bytes. The MeshUniform is currently 132 bytes, so a lot of space, bandwidth, and time is wasted in padding that up to 256 bytes.

If we instead use an instance vertex buffer, we can use only 132 bytes per mesh.

We can even go further with another optimisation where we recognise that object to world matrices are affine and so can be represented in a 4x3 matrix as the bottom row is always 0,0,0,1. Also the inverse model matrix, the transpose of which is used to transform the vertex normals, can be a 4x3 matrix. The inverse translation part is just -translation of the forward transform, so we can obtain that from the forward transform and just store a the inverse 3x3 instead, bringing the total per-mesh data down to 88 bytes - a saving of about 32%!

These can then be reconstructed into 4x4 matrices in the shader as necessary by unpacking into a 4x4 matrix and using numeric literals for the 0,0,0,1. transpose() is apparently free on all modern GPUs, and interestingly the shader compiler in the graphics drivers will remove the unnecessary operations of multiplying by 0 or 1, and then remove the unnecessary registers that would have been used for those.

So, while the maths in the shader remains identical by using 4x4 matrices, the combination of these changes reduces the number of registers used, the number of operations executed, the amount of VRAM used, the amount of memory bandwidth used to transfer the data to VRAM, and the time spent each frame doing all of these things. It is an all-around win.

I have implemented it here https://github.com/superdump/bevy/tree/mesh-matrix-instance-buffer and will clean it up to make a PR soon.

james7132 commented 2 years ago

This frees up a binding group, lays some of the ground work for automatic instancing, and enables pipelines to define secondary vertex buffers (something we need for morph target rendering). This is great!

superdump commented 2 years ago

Well, it’s pretty hard-coded on the branch. I didn’t do anything to handle specialisation I don’t think.

james7132 commented 2 years ago

One other alternative here to use push constants https://docs.rs/wgpu-types/0.12.0/wgpu_types/struct.Features.html#associatedconstant.PUSH_CONSTANTS. It would free up a uniform slot and binding group too. The typical size here is 128 bytes, which is barely big enough for us to fit 2 Affine3x4s and a bitflag field, which is what currently resides in MeshUniform.

It unfortunately does not support the web right now. There's an ongoing investigation as to whether this is something that can be supported: https://github.com/gpuweb/gpuweb/issues/75

JMS55 commented 2 years ago

superdump commented 1 year ago

Coming back to this after having learned a lot more about batching/instancing/otherwise merging draw commands and the road to bindless and GPU-driven rendering, I think using an instance-rate vertex buffer is actually too inflexible and restrictive compared to arrays in uniform/storage buffers. At some appropriate point I'll apply the same data compaction but using arrays in uniform/storage buffers instead.

superdump commented 1 year ago

This has been implemented in 0.12 in GpuArrayBuffer.

bevyengine / bevy

Move MeshUniform to an instance vertex buffer for tight packing and much better performance #4288