buildaworldnet / IrrlichtBAW

Build A World fork of Irrlicht
http://www.buildaworld.net
Apache License 2.0
122 stars 28 forks source link

Raytracing Example #303

Closed devshgraphicsprogramming closed 4 years ago

devshgraphicsprogramming commented 5 years ago

We will work on getting a serialized Mitusba scene to display, and be rendered with:

  1. High Performance
  2. Correct BRDF + Material
  3. Global sunlight
  4. ~Reinhard Tonemapping Operator (ACES optional extra)~
  5. ~AA / Denoising (TSSAA, SVGF, or something else)~
  6. ~Raytraced Shadows (RR 2.0, or own solution for OpenGL)~
  7. Second Bounce (Importance sampled specular)

Division of Work (very rough, missing issues):

devshgraphicsprogramming commented 5 years ago

@devshgraphicsprogramming : Supervision, ~#206, #165, #7, #162, #148~ @Crisspl : #351, ~#130, #119,~ #173, ~#151,~ #227, #280, #298 @Przemog1 : ~#241,~ #215, ~#207, #53, #208, #85~

devshgraphicsprogramming commented 4 years ago

New division of labour:

@devshgraphicsprogramming : Supervision~, #162, #148, #119~ @Crisspl : ~get the video namespace operational, create BSDFs for loaded Mistuba Meshes,~ #280, #173, #298, @AnastaZIuk: #351, ~#241,~ #215, #207, #85, #384

devshgraphicsprogramming commented 4 years ago

The transforms will be on the GPU [DONE]

An allocator will give each IDummyTransformationSceneNode a UUID (offset) into a buffer holding the values updated by a compute shader.

Input

  1. Double Buffered (CPU and GPU) dynamic read_only buffer of relative transforms as 3x 4 component vector.
  2. Separate Buffer of uint (or uvec2) containing absolute offsets to parents (parent ID), with special dead value indicating no parent. Perhaps some flags (like the ready flag).

Output

  1. GPU-only buffer that contains view-independent resultant values + maybe a ready_flag

Related to #119

Special Considerations

Use Timothy Lottes method of hierarchical raytracing, instead of launching N dispatches per N hierarchy levels, just recompute if parent not ready. Pseudocode

buffer read_only restrict RelTransformSSBO
{
   mat4x3 rel_tform[];
};
buffer restrict coherent ParentReadySSBO
{
   uint parent_array[];
};
buffer restrict coherent AbsTransformSSBO
{
   mat4x3 abs_tform[];
};
buffer restrict write_only AuxSSBO
{
   AuxOutputStruct aux[];
};
const uint STRIP_READY_FLAG = ~READY_FLAG;

// method number 1 - save on registers, do duplicate work
uint parent_ready = parent_array[selfID]; // parent_array holds {0:30 parent_abs_offset,31: self_ready}
uint parentID = parent_ready&STRIP_READY_FLAG;
if (parent_ready&READY_FLAG == 0u) // actually checking self
{
   mat4x3 absoluteTForm = rel_tform[selfID];
   for (parent_ready=parent_array[parentID]; parent_ready&READY_FLAG==0u; parent_ready=parent_array[parentID])
   {
      absoluteTForm = rel_tform[parentID]*absoluteTForm;
      parentID = parent_ready&STRIP_READY_FLAG;
   }
   if (parentID!=0xdeadbeefu)
      absoluteTForm *= abs_tform[parentID];
   abs_tform[selfID] = absoluteTForm;
   memoryBarrierBuffer();
   parent_array[selfID] |= READY_FLAG;

   aux[selfID] = computeAuxData(absoluteTForm);
}

// method number 2 - spend registers
// explore up to first ready node
assert(MAX_HIERARCHY_LEN>1);
uint nodeID[MAX_HIERARCHY_LEN];
nodeID[0] = selfID;
int firstReady=0u;
while (firstReady<MAX_HIERARCHY_LEN-1)
{
   uint parent_ready = parent_array[nodeID[firstReady]];
   if (parent_ready&READY_FLAG!=0u)
      break;
   nodeID[++firstReady] = parent_ready&STRIP_READY_FLAG;
}
// compute transforms
if (firstReady!=0)
{
   mat4x3 absoluteTForm = abs_tform[nodeID[firstReady]];
   for (int i=firstReady-1; i>=0; i--)
   {
      uint thisID = nodeID[i];
      abs_tform[thisID] = absoluteTForm *= rel_tform[thisID];
   }
   memoryBarrierBuffer();
   for (int i=firstReady-1; i>=0; i--)
      set_ready(nodeID[i]);

   aux[nodeID[0]] = computeAuxData(absoluteTForm);
}

To make adding and removing nodes fast, we can let go of the requirement that each hierarchy level sit in contiguous area of memory (don't have to shift everything when adding and removing instances - like Skinning Manager, unlike Instanced Mesh). However it would make everything faster if generally the higher level nodes had lower value indices.

There shall be a scene::IGPUTransformManager which shall run once per frame, after all relative transforms have been updated.

For CPU operations that need to know absolute transforms, a similar approach to method 2 could be used on the CPU (with a separate backing buffer, a NAN absolute transform could signal unreadyness).

When relative transforms update, the change needs to be propagated to the GPU-side buffer, then all children transforms must be marked as "not-ready" . Ideally a compute shader should scatter the updated values (avoid a buffer copy of redundant ranges), as we emphasise the performance of small updates rather than whole scene re-sets.

Questions

devshgraphicsprogramming commented 4 years ago

Ideal GPU Driven Rendering

CPU side [DONE]

Without any culling at all the command buffer would look like this

bindPipeline(pipeline[0]);
bindIndirectBuffer(indirectdrawbuffer[0]);
if (supported)
   bindIndirectBuffer(indirectparameterbuffer[0]);
bindDescriptors(N,descriptors[0]);
if (supported)
   multiDrawIndirect(MODE,TYPE,drawBufferOffset[0],indirectParameterOffset[0],maxDraws,stride[0]);
else
   multiDrawIndirect(MODE,TYPE,drawBufferOffset[0],maxDraws,stride[0]);
...

Each MDI call is either:

typedef  struct {
        uint  count;
        uint  instanceCount;
        uint  first;
        uint  baseInstance;
    } DrawArraysIndirectCommand;

or

typedef  struct {
        uint  count;
        uint  instanceCount;
        uint  firstIndex;
        uint  baseVertex;
        uint  baseInstance;
    } DrawElementsIndirectCommand;

WIP: Ideal GPU Culling [DONE]

We want a good distribution of work, so each instance should cull itself independently. So we need to know:

struct MultiDraw
{
   uint drawOffset;
   uint stride;
   uint countOffset;
};
restrict read_only buffer DLUT // runtime buffer
{
   MultiDraw multidraw[];
} DrawLookUpTable;

restrict coherent buffer MDIB_Out
{
   uint data[];
} MultiDrawIndirectBufferOut;
restrict coherent buffer MDIB_In
{
   uint data[];
} MultiDrawIndirectBufferIn;

{
   uint multiDrawID; // the index in the total number of MDI commands
   uint localDrawID; // the index in the MDI command
   uint instanceID; // the index in the DI command

   MultiDraw multidraw = DrawLookUpTable.multidraw[multiDrawID];
   uint mdiStructOffset = (multidraw.drawOffset&DRAW_OFFSET);
   bool isDrawElements = multidraw.drawOffset&DRAW_TYPE==DRAW_ELEMENTS;

   bool isCulled = cull(multiDrawID,localDrawID,instanceID);
   if (!isCulled)
   {
      uint inOffset = multidraw.countOffset+multidraw.stride*localDrawID;
      uint outOffset = multidraw.countOffset+multidraw.stride*localDrawID;
      // count
      MultiDrawIndirectBufferOut.data[outOffset+0u] = MultiDrawIndirectBufferIn.data[inOffset+0u];
      // instancecount
      uint newInstanceID = atomicAdd(MultiDrawIndirectBufferOut.data[outOffset+1u],1u);
      // first / firstIndex
      MultiDrawIndirectBufferOut.data[outOffset+2u] = MultiDrawIndirectBufferIn.data[inOffset+2u];
      // baseInstance / baseVertex
      MultiDrawIndirectBufferOut.data[outOffset+3u] = MultiDrawIndirectBufferIn.data[inOffset+3u];
      if (isDrawElements)
         MultiDrawIndirectBufferOut.data[outOffset+4u] = MultiDrawIndirectBufferIn.data[inOffset+4u];

      // move instance data to proper offsets
   }
}
devshgraphicsprogramming commented 4 years ago

WIP: MultiDraw, Bindless and Deferred Friendly shading

#ifdef ARB_shader_draw_parameters
restrict read_only buffer IDBuffer
{
   uvec2 globalInstanceID[];
};
#else
in uvec2 instance_mesh_IDs; // fake with gl_BaseInstance = gl_DrawID+pushConstants.BaseDraw
#endif

void main()
{
...
#ifdef ARB_shader_draw_parameters
uvec2 instance_mesh_IDs = globalInstanceID[gl_DrawID+pushConstants.BaseDraw];
#endif
...
}

Can we settle on a single uint for instance_mesh_IDs ?

If we reserve N (N==meshCount) mesh-instance datas together with the per-instance data when allocating, it might be fine.

devshgraphicsprogramming commented 4 years ago

Buffers

First and second buffer can be merged together probably.

CameraID is to get the view and projection matrices (plus other global stuff).

ParameterIndex can retrieve per-pipeline things.

ObjectID (pseudo-nodeID) is to get the absolute world transformation.

MeshBufferID is to get the bounding box and any other data necessary for culling.

uniqueDrawInstanceID obtains the output index for successfully computed data.

Observations

Need following flags for our Culling Manager:

This shall be a 1 million draw-call-equivalent capable system.

Extensions

Could set-up an extra parameter/query to forego updating absolute transforms and animations if we know self and all children nodes will not be visible for certain.

devshgraphicsprogramming commented 4 years ago

Overview and Order of New Rendering

Animate and BoundingBox Update

Scene Node hierarchies are updated, nodes are animated from armatures if necessary.

Per View Draw Cull

After cameras/views enqueue the meshbuffers to their command buffers to be drawn later.

First pass updates the per-instance-per-meshbuffer bounding boxes for skinned meshes (could be folded into the first stage).

Second pass modifies multi-draw indirect argument and parameter buffers by culling per-instance-per-meshbuffer-per-view.

Skinning and Triangle Culling

It is determined whether skinning is to be performed, only if update is required (determined by movement of bones) and some drawcall of the instance is visible (any view at all). The skinning is performed in world space and per-vertex.

Then primitive culling is performed per index tuple (depends on primitive type). Triangle broadcast is followed by per-view transform and culling.

It is imperative that triangle-culling may be disabled for some draw-calls.

Extension

Turn the primitive culling into a DispatchIndirect, run a cluster cull (view cone, frustum, hiZ) against groups of 256 primitives.

Compaction

Need to remove zero instance or zero-index-count draw calls from MDI lists, optionally remove entire MDI calls and pipeline binds from the command buffer if DOWNLOAD_RESULTS present.

Dispatch

Command Buffer containing the render commands after compaction is fired off.

devshgraphicsprogramming commented 4 years ago

Assign InvocationIDs so that whole workgroup (or subgroup) has the same Coarse ID (LDS or Subgroup Intrinsic optimization)

devshgraphicsprogramming commented 4 years ago

Triangle and cluster culling should be outside the scope of the raytracing example.

devshgraphicsprogramming commented 4 years ago

Allow for Per-View-Draw cull etc. for multiple views, but not needing all the drawcalls for a view (opens up better scheduling).

devshgraphicsprogramming commented 4 years ago

see https://github.com/buildaworldnet/IrrlichtBAW/issues/119#issuecomment-598330860

devshgraphicsprogramming commented 4 years ago

We have something that works nicely.