Devsh-Graphics-Programming / Nabla

Vulkan, OptiX and CUDA Interoperation Modular Rendering Library and Framework for PC/Linux/Android
http://devsh.eu
Apache License 2.0
471 stars 59 forks source link

Clustered Lighting #258

Open devshgraphicsprogramming opened 2 years ago

devshgraphicsprogramming commented 2 years ago

Description

System must take as input:

Output must be:

Example:

Testing:

Light Struct

struct nbl_glsl_ext_ClusteredLighting_SpotLight
{
vec3 position;
float outerCosineOverCosineRange; `cos(outerHalfAngle)/(cos(innerHalfAngle)-cos(outerHalfAngle))`
uvec2 intensity; // rgb19e7 encoded
uvec2 direction; // xyz encoded as packSNorm2x16, w used for storing `cosineRange`
};

All lights shall be spotlights (if you want a point light, make it have an outer and inner cosine of -1).

Description of the related problem

Attenuation

Computed as

const float radiusSq = ...; // will be constant for all lights in the scene and settable globally via a define
const nbl_glsl_ext_ClusteredLighting_SpotLight light = ...;

const vec3 fromLight = wpos-light.position;
const float lenSq = dot(fromLight,fromLight);
const float attenuation = 0.5f*radiusSq*(1.f-inversesqrt(1.f+radiusSq/lenSq));

// at some point I'll encode the direction better, allowing me to squeeze a a windowing coefficient into 16 bits freed up
const vec2 dir0 = unpackSNorm2x16(light.direction[0]);
const vec2 dir1 = unpackSNorm2x16(light.direction[1]);
const vec3 spotDir = vec3(dir0.xy,dir1.x);
const float spotHalfAngle = dot(spotDir,fromLight)*inversesqrt(dot(spotDir,spotDir)*dot(fromLight,fromLight));

const float innerCosine= dir1.y;
const float spotPenumbra = clamp(spotHalfAngle*innerCosine-light.outerCosineOverCosineRange,0.f,1.f);

return attenuation*spotPenumbra;

as per https://www.youtube.com/watch?v=wzIcjzKQ2BE

Solution proposal

Shading budget

All we care about is keeping the amount of ALU ops and memory ops summed across all subgroups below a global threshold.

This translates to keeping the sum of light index lists lengths across all tiles/clusters below a soft (performance) as well as a hard (storage) threshold.

Use R32_UINT format for the cluster pointer texture and partition the bits between offset and count.

Tiling/Clustering granularity

In Deferred Shading, a whole workgroup (usually 16x16) maps to a screen tile, it is therefore extremely efficient to bin lights at the granularity of tiles and preload all the necessary lights into shared memory.

In Forward+ the story is similar, as primitives' fragments must be blended in draw order, which enforces an execution order at the macro scale, even on tilers. This means that subgroups will try to stick to the same triangle, or at least the same drawcall. 4-size aligned units are going to rasterize quads which themselves are aligned to even pixel coordinates.

We're mostly going to be targetting Forward+

This is why most fine tiles/clusters covering 16x16 pixel footprints make sense. You can however give a choice of 32x32 or even 64x64 tiles.

To promote stable cluster rejection of lights, do the clustering in world space (not frustum space)..

Light-Cluster Assignment

Main focus is stability for this reason, do not slice up the view frustum into clusters.

Instead create a 3D clipmap (nested grids) around the camera, this way if lights are dropped because of shading budget, the light popping will be stable.

If we make a 3D texture and atlas it with bricks of size KxKxK (suggestion, K=4), we can run compute dispatches to have 1 workgroup test 1 light against every single voxel (map invocation to voxel).

would be a good idea to warp the voxels logarithmically

linear = mix(a,b,percentage);
logarithmic = exp2(mix(log2(a),log2(b),percentage))

Then we can take the innermost 12.5% and subdivide them and repeat the process.

Initial assignment as a linked list could work nicely in order to keep worst case memory budgets down and avoid having to do two culling passes over the same list of lights and grid, interrupted by a prefix sum (do not want). However at the very end a compaction and linearization would be required (turn linked list into flatpacked arrays).

So instead of having each cluster traverse a light BVH, we have each light traverse a cluster BVH.

Light Rejection (Budgeting)

To achieve storage of light lists in one SSBO, while a 3D texture points to it with R32_UINT, we require a fundamental limit on the "length" of the light list of a cell. Lets call this limit L, we can set it even lower to improve performance.

A worst case scenario would be 4M lights, which would leave us with 10 bits to store the count of lights per cluster (L=1023).

If we know that a cell cannot store more than L light references, then we know that during the build (subdivision) process, no cell should have a linked list longer than L*2^subdivisionLevelsRemaining.

Also if we want to not blow the linked list reserved capacity (or overall end capacity) we should be probably keeping a minheap of all the lights.

Additional context

See: https://github.com/buildaworldnet/IrrlichtBAW/issues/216 for previous ideas.

https://www.zora.uzh.ch/id/eprint/107598/1/a11-olsson.pdf http://www.humus.name/Articles/PracticalClusteredShading.pdf

devshgraphicsprogramming commented 2 years ago

Spotlight Culling

Basically you need Cone and Sphere vs. AABB culling first. The cases split into 2

In both cases you first want to test the sphere.

Cone culling https://www.3dgep.com/forward-plus/#frustum-cone-culling

Convex

After testing sphere, test that the cone is not fully outside the AABB.

Concave

When the spotlight's half angle > 90 degrees it flips inside out, then you need to test that the AABB is not fully inside the cone.

https://www.zora.uzh.ch/id/eprint/107598/1/a11-olsson.pdf

devshgraphicsprogramming commented 2 years ago

https://www.intel.com/content/www/us/en/developer/articles/technical/forward-clustered-shading.html

devshgraphicsprogramming commented 2 years ago

https://www.jonmanatee.com/blog/2015/1/20/improved-spotlight-culling-for-clustered-forward-shading.html

devshgraphicsprogramming commented 2 years ago

Old idea for constructing the Light AS

Start by dispatching a NumLights invocations, then test each light against 2x2 frustums (each with its own custom far value), add the light to a per-frustum bucket/list.

We can duplicate each light reference 0 to 4 times.

We can also keep track of how many light references overall we have, and do an indirect dispatch. If we tag each light reference with the frustum (i,j) coord, we can process all the light references for a lower level of the hierarchy in a single dispatch and we can rinse and repeat until we hit the lowest level. This should be pretty fast.

We could actually tag the light references with (i,j,level) [unique Tile ID] and they could go onto a append buffer and no special bucketing or linked listing necessary.

Then after we're all done, we can sort the light references by TileID, prefix sum and write the (offset,lightcount) into a texture that matches the tile hierachy.

So we end up with what we want, a contiguous list of lights thats easy to iterate in Forward+ and easy to coopeartively load into shared mem in Deferred/VisBuffer

Old idea for limiting light References

ooh and if we want a limit per tile... we can pack (i,j,level) together with some importance_score (approx attenuation, distance from camera or projected area on screen) in the LSB to form the sorting key 😉 one last problem is the duplication of references, if we allow K-lights per node and we build top-down, each node needs to be able to add K4^ancestry_level references to the global append buffer but we also cannot know the lowest importance_score that allows us to throw away lights that will be culled because of the limit when doing the first pass okay its doable, when we do a global frustum cull, we only need a scratch buffer with as many uint as we have lights! If the total of unculled lights exceeds KChildrenTileCount, we can do some GPU (or approximate) version of quick_select to find the importance_score of the K*ChildrenTileCountth element

devshgraphicsprogramming commented 2 years ago

Newer idea for Light AS

We frustum cull and then sort the lights by importance, scratch memory needed to do this is O(numLights) we then take the first min(KTileCount,LightCount) most important lights then we subdivide the frustum into 2x2 (or 4x4 or 8x8) sub frustums and output uint4/uint16/uint64 per light, with each bit enabled/disabled depending on if the light intersects the subfrustum then we Scan (prefix sum courtesy of @achal ) these per-sub-frustum bits to obtain an offset table and we scatter the lightID[i] into memory location subFrustumIDKsubFrustumTileCount+prefixSum[i] but only if prefixSum[i]<KsubFrustumTileCount this lets us fill the per-frustum lightID buckets with the lights that intersect it, but in a stable ordering (most important first!) the only question that remains now, given that we'll use this with Forward+ is the choice of the frustum hierarchy shape

devshgraphicsprogramming commented 2 years ago

How to produce a compact list with a fixed size memory pool and not having to recull the lights

Linked List Rewrite

Use the pool to give you linked list nodes (or segmented array).

In your grid cells keep an offset to forward linked list head, as well as a counter of total nodes in the list.

Exclusive Prefix Sum the cell's light counts, these are now your light list offsets.

Go through the original linked list and rewrite it linearly to the buckets.

P.S. You can use atomicCompSwap on the head pointer and late-record the next offset into your newly allocated linked list node, to allow for multiple invocations to append nodes to the same cell's linked list (probably suboptimal though).

Sort

Output each lightID to be paired with a (x,y,z) position key (maybe importance too) tranlsated to a space filling curve.

Sort.

Perform a binary search to find the offsets into the array.

Perform a differencing between node after in the space filling curve to get the light count.

devshgraphicsprogramming commented 2 years ago

3 ways of Light to Cluster Assignment

There's two ways of doing a traversal.

Local (stack based)

We assign one invocation to one testee, then traverse all tree nodes to test against.

Global (partition based)

We start with a list of testees and a tree node, 1 invocation per testee.

We filter this list into a separate list for each of the node's children.

Repeat the process. Multiple nodes not sharing subtrees can be processed in parallel.

Light BVH

BVH of Lights is constructed from AABBs (preferably BVH8 or 16).

An invocation is fired out for each cluster and we traverse the BVH to find lights which intersect the cluster AABB.

The intersection is recorded according to previous post.

Because lights are written to Clusters, its best to traverse locally via stacks (avoid atomics and multiple invocations writing the same cluster).

Cluster BVH (Preferred)

Really an octree or a clipmap/cascade in our case of world space clustering.

Flat list of lights is taken and pruned & partitioned amongst the clusters.

The intersection is recorded according to previous post.

Because lights are written to Clusters, its best to traverse globally via partitioning (avoid atomics and multiple invocations writing the same cluster).

Assigning invocations to lights and traversing cluster BVH would be a nightmare of work balancing and atomic updates

I prefer this option because lights are very likely to move and need the BVH updating frequently, whereas worldspace cluster BVH just follows camera position, and keeps a constant topology.

Light and Cluster BVH

Light BVH is constructed as in the first option.

Then we traverse both trees at the same time.

Instead of a cluster tree node containing a direct list of light IDs, it would contain a list of light BVH nodes which are roots of disjoint branches.

Every traversal step, you either split the light BVH (turn small list into bigger list while retesting) or split the cluster BVH (partition the list while culling against smaller nodes). In both cases you'd map invocation to light BVH node ID.

A good heuristic would be to split whichever BVH node is larger.