devshgraphicsprogramming commented 5 years ago

We will work on getting a serialized Mitusba scene to display, and be rendered with:

High Performance
Correct BRDF + Material
Global sunlight
~Reinhard Tonemapping Operator (ACES optional extra)~
~AA / Denoising (TSSAA, SVGF, or something else)~
~Raytraced Shadows (RR 2.0, or own solution for OpenGL)~
Second Bounce (Importance sampled specular)

Division of Work (very rough, missing issues):

devshgraphicsprogramming commented 5 years ago

@devshgraphicsprogramming : Supervision, ~#206, #165, #7, #162, #148~ @Crisspl : #351, ~#130, #119,~ #173, ~#151,~ #227, #280, #298 @Przemog1 : ~#241,~ #215, ~#207, #53, #208, #85~

devshgraphicsprogramming commented 4 years ago

New division of labour:

@devshgraphicsprogramming : Supervision~, #162, #148, #119~ @Crisspl : ~get the video namespace operational, create BSDFs for loaded Mistuba Meshes,~ #280, #173, #298, @AnastaZIuk: #351, ~#241,~ #215, #207, #85, #384

devshgraphicsprogramming commented 4 years ago

The transforms will be on the GPU [DONE]

An allocator will give each IDummyTransformationSceneNode a UUID (offset) into a buffer holding the values updated by a compute shader.

Input

Double Buffered (CPU and GPU) dynamic read_only buffer of relative transforms as 3x 4 component vector.
Separate Buffer of uint (or uvec2) containing absolute offsets to parents (parent ID), with special dead value indicating no parent. Perhaps some flags (like the ready flag).

Output

GPU-only buffer that contains view-independent resultant values + maybe a ready_flag

Related to #119

Special Considerations

Use Timothy Lottes method of hierarchical raytracing, instead of launching N dispatches per N hierarchy levels, just recompute if parent not ready. Pseudocode

buffer read_only restrict RelTransformSSBO
{
   mat4x3 rel_tform[];
};
buffer restrict coherent ParentReadySSBO
{
   uint parent_array[];
};
buffer restrict coherent AbsTransformSSBO
{
   mat4x3 abs_tform[];
};
buffer restrict write_only AuxSSBO
{
   AuxOutputStruct aux[];
};
const uint STRIP_READY_FLAG = ~READY_FLAG;

// method number 1 - save on registers, do duplicate work
uint parent_ready = parent_array[selfID]; // parent_array holds {0:30 parent_abs_offset,31: self_ready}
uint parentID = parent_ready&STRIP_READY_FLAG;
if (parent_ready&READY_FLAG == 0u) // actually checking self
{
   mat4x3 absoluteTForm = rel_tform[selfID];
   for (parent_ready=parent_array[parentID]; parent_ready&READY_FLAG==0u; parent_ready=parent_array[parentID])
   {
      absoluteTForm = rel_tform[parentID]*absoluteTForm;
      parentID = parent_ready&STRIP_READY_FLAG;
   }
   if (parentID!=0xdeadbeefu)
      absoluteTForm *= abs_tform[parentID];
   abs_tform[selfID] = absoluteTForm;
   memoryBarrierBuffer();
   parent_array[selfID] |= READY_FLAG;

   aux[selfID] = computeAuxData(absoluteTForm);
}

// method number 2 - spend registers
// explore up to first ready node
assert(MAX_HIERARCHY_LEN>1);
uint nodeID[MAX_HIERARCHY_LEN];
nodeID[0] = selfID;
int firstReady=0u;
while (firstReady<MAX_HIERARCHY_LEN-1)
{
   uint parent_ready = parent_array[nodeID[firstReady]];
   if (parent_ready&READY_FLAG!=0u)
      break;
   nodeID[++firstReady] = parent_ready&STRIP_READY_FLAG;
}
// compute transforms
if (firstReady!=0)
{
   mat4x3 absoluteTForm = abs_tform[nodeID[firstReady]];
   for (int i=firstReady-1; i>=0; i--)
   {
      uint thisID = nodeID[i];
      abs_tform[thisID] = absoluteTForm *= rel_tform[thisID];
   }
   memoryBarrierBuffer();
   for (int i=firstReady-1; i>=0; i--)
      set_ready(nodeID[i]);

   aux[nodeID[0]] = computeAuxData(absoluteTForm);
}

To make adding and removing nodes fast, we can let go of the requirement that each hierarchy level sit in contiguous area of memory (don't have to shift everything when adding and removing instances - like Skinning Manager, unlike Instanced Mesh). However it would make everything faster if generally the higher level nodes had lower value indices.

There shall be a scene::IGPUTransformManager which shall run once per frame, after all relative transforms have been updated.

For CPU operations that need to know absolute transforms, a similar approach to method 2 could be used on the CPU (with a separate backing buffer, a NAN absolute transform could signal unreadyness).

When relative transforms update, the change needs to be propagated to the GPU-side buffer, then all children transforms must be marked as "not-ready" . Ideally a compute shader should scatter the updated values (avoid a buffer copy of redundant ranges), as we emphasise the performance of small updates rather than whole scene re-sets.

Questions

Does the flag buffer and the absolute transform buffer need to be volatile or coherent ? coherent and memoryBufferBarriered
3x vec4 SoA or 1x mat4x3 AoS SSBO for best bandwidth? AoS as long as every data field actually needs to be read
TBO vs read_only SSBO for best perf? SSBO because of size

devshgraphicsprogramming commented 4 years ago

Ideal GPU Driven Rendering

CPU side [DONE]

Without any culling at all the command buffer would look like this

bindPipeline(pipeline[0]);
bindIndirectBuffer(indirectdrawbuffer[0]);
if (supported)
   bindIndirectBuffer(indirectparameterbuffer[0]);
bindDescriptors(N,descriptors[0]);
if (supported)
   multiDrawIndirect(MODE,TYPE,drawBufferOffset[0],indirectParameterOffset[0],maxDraws,stride[0]);
else
   multiDrawIndirect(MODE,TYPE,drawBufferOffset[0],maxDraws,stride[0]);
...

Each MDI call is either:

typedef  struct {
        uint  count;
        uint  instanceCount;
        uint  first;
        uint  baseInstance;
    } DrawArraysIndirectCommand;

or

typedef  struct {
        uint  count;
        uint  instanceCount;
        uint  firstIndex;
        uint  baseVertex;
        uint  baseInstance;
    } DrawElementsIndirectCommand;

WIP: Ideal GPU Culling [DONE]

We want a good distribution of work, so each instance should cull itself independently. So we need to know:

Total number of instances
Knowing cumulative instance ID, how to map it to indirect draw command offset
Knowing global or local instance ID, how to get its data

struct MultiDraw
{
   uint drawOffset;
   uint stride;
   uint countOffset;
};
restrict read_only buffer DLUT // runtime buffer
{
   MultiDraw multidraw[];
} DrawLookUpTable;

restrict coherent buffer MDIB_Out
{
   uint data[];
} MultiDrawIndirectBufferOut;
restrict coherent buffer MDIB_In
{
   uint data[];
} MultiDrawIndirectBufferIn;

{
   uint multiDrawID; // the index in the total number of MDI commands
   uint localDrawID; // the index in the MDI command
   uint instanceID; // the index in the DI command

   MultiDraw multidraw = DrawLookUpTable.multidraw[multiDrawID];
   uint mdiStructOffset = (multidraw.drawOffset&DRAW_OFFSET);
   bool isDrawElements = multidraw.drawOffset&DRAW_TYPE==DRAW_ELEMENTS;

   bool isCulled = cull(multiDrawID,localDrawID,instanceID);
   if (!isCulled)
   {
      uint inOffset = multidraw.countOffset+multidraw.stride*localDrawID;
      uint outOffset = multidraw.countOffset+multidraw.stride*localDrawID;
      // count
      MultiDrawIndirectBufferOut.data[outOffset+0u] = MultiDrawIndirectBufferIn.data[inOffset+0u];
      // instancecount
      uint newInstanceID = atomicAdd(MultiDrawIndirectBufferOut.data[outOffset+1u],1u);
      // first / firstIndex
      MultiDrawIndirectBufferOut.data[outOffset+2u] = MultiDrawIndirectBufferIn.data[inOffset+2u];
      // baseInstance / baseVertex
      MultiDrawIndirectBufferOut.data[outOffset+3u] = MultiDrawIndirectBufferIn.data[inOffset+3u];
      if (isDrawElements)
         MultiDrawIndirectBufferOut.data[outOffset+4u] = MultiDrawIndirectBufferIn.data[inOffset+4u];

      // move instance data to proper offsets
   }
}

devshgraphicsprogramming commented 4 years ago

WIP: MultiDraw, Bindless and Deferred Friendly shading

#ifdef ARB_shader_draw_parameters
restrict read_only buffer IDBuffer
{
   uvec2 globalInstanceID[];
};
#else
in uvec2 instance_mesh_IDs; // fake with gl_BaseInstance = gl_DrawID+pushConstants.BaseDraw
#endif

void main()
{
...
#ifdef ARB_shader_draw_parameters
uvec2 instance_mesh_IDs = globalInstanceID[gl_DrawID+pushConstants.BaseDraw];
#endif
...
}

Can we settle on a single uint for instance_mesh_IDs ?

If we reserve N (N==meshCount) mesh-instance datas together with the per-instance data when allocating, it might be fine.

devshgraphicsprogramming commented 4 years ago

Buffers

(restrict coherent) Node Parent link list + last and current animation frame + global armature keyframe list ID
(restrict read_only) Node Relative Transform + Per-Node Data for Transform Computation
(restrict coherent) Node Absolute Transform
(restrict read_only) Armature bone keyframe lists struct {uint count; uint offsetToKeyframeData; float timestamp[];};
(restrict read_only) Skinning Animation for all Armature Parts (can have separate keyframe lists per bone)
(optional restrict write_only) Node Auxilary Data that can be computed in the transform shader

First and second buffer can be merged together probably.

(restrict read_only) MultiDrawIndirect Input buffer for all command buffers from all renderpasses
(restrict coherent) MultiDrawIndirect Output buffer for all command buffers from all renderpasses
(restrict write_only) MultiDrawIndirect Output buffer to clear
(restrict coherent) Indirect Parameter Buffer to keep count of MultiDrawIndirect counts
(restrict read_only) Buffer that maps gl_GlobalInvocationIndex to (cameraIndex/viewIndex,parameterIndex,drawIndex,instanceIndex,objectID,meshBufferID,uniqueDrawInstanceID) implicitly or explicitly
(restrict write_only) Buffer that holds important per-uniqueDrawInstanceID data such as the fully concatenated projection, view and inverse matrices (at least 192 bytes per draw).

CameraID is to get the view and projection matrices (plus other global stuff).

ParameterIndex can retrieve per-pipeline things.

ObjectID (pseudo-nodeID) is to get the absolute world transformation.

MeshBufferID is to get the bounding box and any other data necessary for culling.

uniqueDrawInstanceID obtains the output index for successfully computed data.

Observations

Need following flags for our Culling Manager:

USE_INDIRECT_PARAMETERS whether to use gl_DrawID and a parameter buffer
DYNAMIC_COMMAND_BUFFER whether to re-record the command buffer
DOWNLOAD_RESULTS (requires DYNAMIC_COMMAND_BUFFER) whether to download the culling results and skip some pipeline binds and draws

This shall be a 1 million draw-call-equivalent capable system.

Extensions

Could set-up an extra parameter/query to forego updating absolute transforms and animations if we know self and all children nodes will not be visible for certain.

devshgraphicsprogramming commented 4 years ago

Overview and Order of New Rendering

Animate and BoundingBox Update

Scene Node hierarchies are updated, nodes are animated from armatures if necessary.

Per View Draw Cull

After cameras/views enqueue the meshbuffers to their command buffers to be drawn later.

First pass updates the per-instance-per-meshbuffer bounding boxes for skinned meshes (could be folded into the first stage).

Second pass modifies multi-draw indirect argument and parameter buffers by culling per-instance-per-meshbuffer-per-view.

Skinning and Triangle Culling

It is determined whether skinning is to be performed, only if update is required (determined by movement of bones) and some drawcall of the instance is visible (any view at all). The skinning is performed in world space and per-vertex.

Then primitive culling is performed per index tuple (depends on primitive type). Triangle broadcast is followed by per-view transform and culling.

It is imperative that triangle-culling may be disabled for some draw-calls.

Extension

Turn the primitive culling into a DispatchIndirect, run a cluster cull (view cone, frustum, hiZ) against groups of 256 primitives.

Compaction

Need to remove zero instance or zero-index-count draw calls from MDI lists, optionally remove entire MDI calls and pipeline binds from the command buffer if DOWNLOAD_RESULTS present.

Dispatch

Command Buffer containing the render commands after compaction is fired off.

devshgraphicsprogramming commented 4 years ago

Assign InvocationIDs so that whole workgroup (or subgroup) has the same Coarse ID (LDS or Subgroup Intrinsic optimization)

devshgraphicsprogramming commented 4 years ago

Triangle and cluster culling should be outside the scope of the raytracing example.

devshgraphicsprogramming commented 4 years ago

Allow for Per-View-Draw cull etc. for multiple views, but not needing all the drawcalls for a view (opens up better scheduling).

devshgraphicsprogramming commented 4 years ago

see https://github.com/buildaworldnet/IrrlichtBAW/issues/119#issuecomment-598330860

devshgraphicsprogramming commented 4 years ago

We have something that works nicely.

buildaworldnet / IrrlichtBAW

Raytracing Example #303

The transforms will be on the GPU [DONE]

Input

Output

Special Considerations

Questions

Ideal GPU Driven Rendering

CPU side [DONE]

WIP: Ideal GPU Culling [DONE]

WIP: MultiDraw, Bindless and Deferred Friendly shading

Buffers

Observations

Extensions

Overview and Order of New Rendering

Animate and BoundingBox Update

Per View Draw Cull

Skinning and Triangle Culling

Extension

Compaction

Dispatch

Assign InvocationIDs so that whole workgroup (or subgroup) has the same Coarse ID (LDS or Subgroup Intrinsic optimization)