Closed devshgraphicsprogramming closed 4 years ago
@devshgraphicsprogramming : Supervision, ~#206, #165, #7, #162, #148~ @Crisspl : #351, ~#130, #119,~ #173, ~#151,~ #227, #280, #298 @Przemog1 : ~#241,~ #215, ~#207, #53, #208, #85~
New division of labour:
@devshgraphicsprogramming : Supervision~, #162, #148, #119~
@Crisspl : ~get the video
namespace operational, create BSDFs for loaded Mistuba Meshes,~ #280, #173, #298,
@AnastaZIuk: #351, ~#241,~ #215, #207, #85, #384
An allocator will give each IDummyTransformationSceneNode a UUID (offset) into a buffer holding the values updated by a compute shader.
Related to #119
Use Timothy Lottes method of hierarchical raytracing, instead of launching N dispatches per N hierarchy levels, just recompute if parent not ready. Pseudocode
buffer read_only restrict RelTransformSSBO
{
mat4x3 rel_tform[];
};
buffer restrict coherent ParentReadySSBO
{
uint parent_array[];
};
buffer restrict coherent AbsTransformSSBO
{
mat4x3 abs_tform[];
};
buffer restrict write_only AuxSSBO
{
AuxOutputStruct aux[];
};
const uint STRIP_READY_FLAG = ~READY_FLAG;
// method number 1 - save on registers, do duplicate work
uint parent_ready = parent_array[selfID]; // parent_array holds {0:30 parent_abs_offset,31: self_ready}
uint parentID = parent_ready&STRIP_READY_FLAG;
if (parent_ready&READY_FLAG == 0u) // actually checking self
{
mat4x3 absoluteTForm = rel_tform[selfID];
for (parent_ready=parent_array[parentID]; parent_ready&READY_FLAG==0u; parent_ready=parent_array[parentID])
{
absoluteTForm = rel_tform[parentID]*absoluteTForm;
parentID = parent_ready&STRIP_READY_FLAG;
}
if (parentID!=0xdeadbeefu)
absoluteTForm *= abs_tform[parentID];
abs_tform[selfID] = absoluteTForm;
memoryBarrierBuffer();
parent_array[selfID] |= READY_FLAG;
aux[selfID] = computeAuxData(absoluteTForm);
}
// method number 2 - spend registers
// explore up to first ready node
assert(MAX_HIERARCHY_LEN>1);
uint nodeID[MAX_HIERARCHY_LEN];
nodeID[0] = selfID;
int firstReady=0u;
while (firstReady<MAX_HIERARCHY_LEN-1)
{
uint parent_ready = parent_array[nodeID[firstReady]];
if (parent_ready&READY_FLAG!=0u)
break;
nodeID[++firstReady] = parent_ready&STRIP_READY_FLAG;
}
// compute transforms
if (firstReady!=0)
{
mat4x3 absoluteTForm = abs_tform[nodeID[firstReady]];
for (int i=firstReady-1; i>=0; i--)
{
uint thisID = nodeID[i];
abs_tform[thisID] = absoluteTForm *= rel_tform[thisID];
}
memoryBarrierBuffer();
for (int i=firstReady-1; i>=0; i--)
set_ready(nodeID[i]);
aux[nodeID[0]] = computeAuxData(absoluteTForm);
}
To make adding and removing nodes fast, we can let go of the requirement that each hierarchy level sit in contiguous area of memory (don't have to shift everything when adding and removing instances - like Skinning Manager, unlike Instanced Mesh). However it would make everything faster if generally the higher level nodes had lower value indices.
There shall be a scene::IGPUTransformManager
which shall run once per frame, after all relative transforms have been updated.
For CPU operations that need to know absolute transforms, a similar approach to method 2 could be used on the CPU (with a separate backing buffer, a NAN absolute transform could signal unreadyness).
When relative transforms update, the change needs to be propagated to the GPU-side buffer, then all children transforms must be marked as "not-ready" . Ideally a compute shader should scatter the updated values (avoid a buffer copy of redundant ranges), as we emphasise the performance of small updates rather than whole scene re-sets.
volatile
or coherent
? coherent and memoryBufferBarrier
edWithout any culling at all the command buffer would look like this
bindPipeline(pipeline[0]);
bindIndirectBuffer(indirectdrawbuffer[0]);
if (supported)
bindIndirectBuffer(indirectparameterbuffer[0]);
bindDescriptors(N,descriptors[0]);
if (supported)
multiDrawIndirect(MODE,TYPE,drawBufferOffset[0],indirectParameterOffset[0],maxDraws,stride[0]);
else
multiDrawIndirect(MODE,TYPE,drawBufferOffset[0],maxDraws,stride[0]);
...
Each MDI call is either:
typedef struct {
uint count;
uint instanceCount;
uint first;
uint baseInstance;
} DrawArraysIndirectCommand;
or
typedef struct {
uint count;
uint instanceCount;
uint firstIndex;
uint baseVertex;
uint baseInstance;
} DrawElementsIndirectCommand;
We want a good distribution of work, so each instance should cull itself independently. So we need to know:
struct MultiDraw
{
uint drawOffset;
uint stride;
uint countOffset;
};
restrict read_only buffer DLUT // runtime buffer
{
MultiDraw multidraw[];
} DrawLookUpTable;
restrict coherent buffer MDIB_Out
{
uint data[];
} MultiDrawIndirectBufferOut;
restrict coherent buffer MDIB_In
{
uint data[];
} MultiDrawIndirectBufferIn;
{
uint multiDrawID; // the index in the total number of MDI commands
uint localDrawID; // the index in the MDI command
uint instanceID; // the index in the DI command
MultiDraw multidraw = DrawLookUpTable.multidraw[multiDrawID];
uint mdiStructOffset = (multidraw.drawOffset&DRAW_OFFSET);
bool isDrawElements = multidraw.drawOffset&DRAW_TYPE==DRAW_ELEMENTS;
bool isCulled = cull(multiDrawID,localDrawID,instanceID);
if (!isCulled)
{
uint inOffset = multidraw.countOffset+multidraw.stride*localDrawID;
uint outOffset = multidraw.countOffset+multidraw.stride*localDrawID;
// count
MultiDrawIndirectBufferOut.data[outOffset+0u] = MultiDrawIndirectBufferIn.data[inOffset+0u];
// instancecount
uint newInstanceID = atomicAdd(MultiDrawIndirectBufferOut.data[outOffset+1u],1u);
// first / firstIndex
MultiDrawIndirectBufferOut.data[outOffset+2u] = MultiDrawIndirectBufferIn.data[inOffset+2u];
// baseInstance / baseVertex
MultiDrawIndirectBufferOut.data[outOffset+3u] = MultiDrawIndirectBufferIn.data[inOffset+3u];
if (isDrawElements)
MultiDrawIndirectBufferOut.data[outOffset+4u] = MultiDrawIndirectBufferIn.data[inOffset+4u];
// move instance data to proper offsets
}
}
#ifdef ARB_shader_draw_parameters
restrict read_only buffer IDBuffer
{
uvec2 globalInstanceID[];
};
#else
in uvec2 instance_mesh_IDs; // fake with gl_BaseInstance = gl_DrawID+pushConstants.BaseDraw
#endif
void main()
{
...
#ifdef ARB_shader_draw_parameters
uvec2 instance_mesh_IDs = globalInstanceID[gl_DrawID+pushConstants.BaseDraw];
#endif
...
}
Can we settle on a single uint
for instance_mesh_IDs
?
If we reserve N (N==meshCount) mesh-instance datas together with the per-instance data when allocating, it might be fine.
struct {uint count; uint offsetToKeyframeData; float timestamp[];};
First and second buffer can be merged together probably.
gl_GlobalInvocationIndex
to (cameraIndex/viewIndex,parameterIndex,drawIndex,instanceIndex,objectID,meshBufferID,uniqueDrawInstanceID)
implicitly or explicitlyuniqueDrawInstanceID
data such as the fully concatenated projection, view and inverse matrices (at least 192 bytes per draw).CameraID is to get the view and projection matrices (plus other global stuff).
ParameterIndex can retrieve per-pipeline things.
ObjectID (pseudo-nodeID) is to get the absolute world transformation.
MeshBufferID is to get the bounding box and any other data necessary for culling.
uniqueDrawInstanceID obtains the output index for successfully computed data.
Need following flags for our Culling Manager:
USE_INDIRECT_PARAMETERS
whether to use gl_DrawID
and a parameter bufferDYNAMIC_COMMAND_BUFFER
whether to re-record the command bufferDOWNLOAD_RESULTS
(requires DYNAMIC_COMMAND_BUFFER
) whether to download the culling results and skip some pipeline binds and drawsThis shall be a 1 million draw-call-equivalent capable system.
Could set-up an extra parameter/query to forego updating absolute transforms and animations if we know self and all children nodes will not be visible for certain.
Scene Node hierarchies are updated, nodes are animated from armatures if necessary.
After cameras/views enqueue the meshbuffers to their command buffers to be drawn later.
First pass updates the per-instance-per-meshbuffer bounding boxes for skinned meshes (could be folded into the first stage).
Second pass modifies multi-draw indirect argument and parameter buffers by culling per-instance-per-meshbuffer-per-view.
It is determined whether skinning is to be performed, only if update is required (determined by movement of bones) and some drawcall of the instance is visible (any view at all). The skinning is performed in world space and per-vertex.
Then primitive culling is performed per index tuple (depends on primitive type). Triangle broadcast is followed by per-view transform and culling.
It is imperative that triangle-culling may be disabled for some draw-calls.
Turn the primitive culling into a DispatchIndirect
, run a cluster cull (view cone, frustum, hiZ) against groups of 256 primitives.
Need to remove zero instance or zero-index-count draw calls from MDI lists, optionally remove entire MDI calls and pipeline binds from the command buffer if DOWNLOAD_RESULTS
present.
Command Buffer containing the render commands after compaction is fired off.
Triangle and cluster culling should be outside the scope of the raytracing example.
Allow for Per-View-Draw cull etc. for multiple views, but not needing all the drawcalls for a view (opens up better scheduling).
We have something that works nicely.
We will work on getting a serialized Mitusba scene to display, and be rendered with:
Division of Work (very rough, missing issues):