Open RossNordby opened 12 months ago
Note: there are two single threaded paths for partitioning. One uses a pingpong buffer to avoid having to do in-place shuffles. It's marginally faster and is used by default when a BufferPool
is provided to the build. The other works in-place and avoids the need for additional allocations for a slight penalty. These two implementations result in different subtree orders. Users seeking determinism across multiple binned builds should pick one of the two and stick with it (probably just use the BufferPool
path).
By default, internally multithreading a binned build is nondeterministic. The first instance of nondeterminism being introduced in the process is in multithreaded partitioning.
Partitioning is responsible for moving all pending subtrees into contiguous regions, one for each child of the current node. The multithreaded implementation uses interlocked operations to deposit blocks of subtrees into children. The final order of subtrees for each child is undefined.
This can lead to a different topology in two ways:
subtreeCount == 2
), no binning process or sort is applied and the order of the children is defined by the order that the subtrees were provided. Since that order depends on partitioning, the result varies.Addressing number 1 is pretty easy. For example, modify the terminal node sections to call a function like:
Number 2 is more difficult. By default, microsweep uses a small vectorized counting sort. In order for it to take into account additional tiebreaker state (like the index), you need to include the index in the least significant bits of the sort keys. Bumping up to 64 bits cuts the performance of the vectorized sort by half and compressing into 32 bits would be lossy.
The nonvectorized comparison-based sort fallback can be easily modified with the tiebreaker, but it is a chunk slower. For example:
To address the partitioning problem directly, the easiest solution is to disable multithreaded partitioning. Another option would be to synchronize the accumulation of per-thread partitioning efforts so that the order matches the ST result. Would come with a pretty minor performance penalty.
I'm attempting to speedrun a bunch of stuff, so I'm not going to do that at the moment. Instead, I'll give the
BinnedBuild
adeterministic
parameter that will (for now) simply disable multithreaded partitioning.The broad phase is the main relevant user of this codepath within the engine. While the broad phase's topology is not required to be deterministic for collision constraints to be deterministic (the narrow phase flush performs a sort on new constraints anyway, because the collision test was not built to deliver results in a deterministic order), nondeterministic topology could result in some unexpected result orders for queries like ray casts. So, when
Simulation.Deterministic
is set, the broad phase will force deterministic refinements.I don't anticipate this being very noticeable in the broad phase use case. Typically, many refinements are being executed in parallel. It's actually pretty rare to benefit from more threads being dedicated to a single refinement since they're not very big. Also, refinements are just kinda cheap.