buildaworldnet / IrrlichtBAW

Build A World fork of Irrlicht
http://www.buildaworld.net
Apache License 2.0
122 stars 28 forks source link

Add an Animated Mesh Instance Cacher #97

Open devshgraphicsprogramming opened 6 years ago

devshgraphicsprogramming commented 6 years ago

Skinning is an expensive vertex shader.

When a skinned meshbuffer stays in the same pose across multiple draw-calls because either:

  1. The subsequent draw call renders it from another viewport (shadowmaps, z-prepass, etc.)
  2. The bones have not moved recently (very high FPS, animation update frequency lower that rendering frequency)
  3. Mesh is static (animated doors, switches, etc.)

It would be highly beneficial to cache the results first into a large amorphous buffer containing the vertices in world-space. This cache could be shared by ALL cacheable skinned mesh buffers, it should be enabled/disabled globally and per scenenode (instance or even meshbuffer instance).

NOTE: Max triangles/s appears to be 4.5 billion on high-end NVidia (9-11 theoretical) cards which have at least 4GB or 8GB of VRAM running at 200-300GB/s (500-1000GB for HBM2), obviously this triangle rate decreases quite linearly with vertex size. The theoretical max bandwidth gives us 12.5-18.75 billion triangles for a 16byte vertex (float4). Divide all values by 30 (minimum interactive FPS) and we obtain 150 Mil, 300 Mil, 450 Mil and 625 Mil respectively for the tested throughput, theoretical datasheet throughput and bandwidth limited throughput. In order to benefit from the cache, we'd have to draw the cached triangles at least 3 times, 2ce to even bother using the cache plus one pass to fill the cache (although that pass is limited 120Hz or less, but at 30FPS it always happens). So this becomes 50, 100, 150, ~210 Million respectively, and lets consider the cases separately. 50-100 Million triangles require between 50-300 Million vertices, the top tri/s performers have the smallest vertex sizes so we can assume that it was between 4-16 bytes giving us max cache size of 0.2 - 4.8 GB which is definitely practical. If our vertex size is N>16 bytes then we can resort to our bandwidth limit in the calculation, which limits vertices per frame @ minimum 30Hz to [15016/N,62516/N] which in turn gives us cache consumption of between 2.4GB and 10GB. However my max 30Hz triangle rate appears to be around 80 Mil (not the theoretical 110) and the 1080 boasts only theoretical 6.9 bilion tri/s rate. Let me also note that a very optimized and light skinning shader eats into compute shader time requiring 100% more GPU time than a simple pass through shader. So really the post skinning cache memory pressure we are looking at is 0.15 - 1.8 GB on cards with 6 or 8GB of VRAM overall.

This approach is necessary for the really advanced well-performing-in-4k Wolfgang Engels/ConfettiFX's/Intel's prefiltered/cached deferred rendering+shading+texturing.

However its probably best to implement #96 first.

devshgraphicsprogramming commented 4 years ago

Need to know:

Need to skin if both are true:

devshgraphicsprogramming commented 3 years ago

Better idea is to use an SSBO as a "selective" Transform Feedback, but requires a GPU-side GLSL allocator to grab free ranges from a cache.

Discord dump for posteriority

the idea would be that an allocation only happens if an instance transitions LoD from lowpoly to high poly or becomes visible after being invalidated after becoming invisible (its like transitioning from LoD 0 poly I guess)
I guess I could have them as an LRU cache
<instance_id,lod_level,allocation>
evict the stuff that hasnt been used/visible recently
and when I get a failure to allocate a LoD, I could try allocating a lower poly LoD
(degrade LoD not due to AABB size or distance from camera, but due to memory pressure too)
nanokatze — Today at 14:26
well I don't fully understand the problem you have but it doesn't seem like you need any weirdo complex allocators, even something as simple as a free list would probably work more or less, provided you have atomics and don't mind living with multiple arenas (given that you're targeting gl with relatively low ssbo limits)
devsh — Today at 14:27
the problem is that I do all culling, skinning, etc. in compute
nanokatze — Today at 14:27
well I figured that, yes, and that you need to allocate on gpu, that's fine
it's just for example attaching extra piece of complexity in form of cache of any kind on top doesn't seem to be required
you could perhaps come up with an allocator that lets you mark your allocations "weak"
so that it's as if they were freed
so future allocation requests can be served from there
but you can also upgrade allocations to strong
need a big counter to avoid aba here
so with this you don't need any cache LRU or not of any kind
devsh — Today at 14:52
anyway
I did some napkin math
I cant do this as long as I'm limited to 128MB SSBOs
because even with 18k vertex LoDs
I cant have 100k units visible
tbf I dont even need to skin in compute
I could skin in vertex shader
and "cache" the results (write out skinned modelspace vertex positions and normals to SSBO)
then evict the allocations
then the next time the vertex shader runs for that particular instance, it can check whether its skinned vertices are in the cache (using the objectID+LoD_ID as the key)
probably a better system tbf
devshgraphicsprogramming commented 2 years ago
vec3 mPos,mNormal;

uint slot, expected_tag = cacheHash(skinGUID,drawcallGUID,gl_VertexID); 
#define WRITE_BIT (-1)

const int prevLockState = atomicAdd(lock[slot],1);
int unlockVal = -1;
const bool beingWrittenTo = bool(prevLockState&WRITE_BIT);
memoryBarrierBuffer();
if (!beingWrittenTo && extract_tag(prevLockState)==expected_tag)
  mPos,mNormal = decode(cache[slot]);
else
{
  mat4x3 skinTform = boneMatrices[vBoneID[0]]*vBoneWeight[0];
  for (int i=1; i<4 && vBoneWeight[i]!=0.f; i++)
    skinTform += boneMatrices[vBoneID[i]]*vBoneWeight[i];
  mPos = nbl_glsl_transform(skinTform,vPos);
  mNormal = nbl_glsl_fastInverse(mat3(skinTform),vNormal);

  // try to save the results
  if (!beingWrittenTo)
  {
    unlockVal = 1-WRITE_BIT;
    if (atomicAdd(lock[slot],WRITE_BIT-1)==1) // got exclusive
    {
      memoryBarrierBuffer();
      cache[slot] = encode(mPos,mNormal);
    }
  }
}
memoryBarrierBuffer();
atomicAdd(lock[slot],unlockVal);

Just need to figure out the hash function, or some other cache allocation scheme