bepu / bepuphysics2

Pure C# 3D real time physics simulation library, now with a higher version number.
Apache License 2.0
2.25k stars 261 forks source link

`Tree.BinnedBuild` multithreaded determinism options #276

Open RossNordby opened 12 months ago

RossNordby commented 12 months ago

By default, internally multithreading a binned build is nondeterministic. The first instance of nondeterminism being introduced in the process is in multithreaded partitioning.

Partitioning is responsible for moving all pending subtrees into contiguous regions, one for each child of the current node. The multithreaded implementation uses interlocked operations to deposit blocks of subtrees into children. The final order of subtrees for each child is undefined.

This can lead to a different topology in two ways:

  1. When creating terminal nodes (subtreeCount == 2), no binning process or sort is applied and the order of the children is defined by the order that the subtrees were provided. Since that order depends on partitioning, the result varies.
  2. Microsweeps can choose partitions that depend on the order of subtrees when multiple subtrees have equal centroid positions along the measured axis. Subtrees with equal centroids may fall into one child or the other depending on the order.

Addressing number 1 is pretty easy. For example, modify the terminal node sections to call a function like:

        static void BuildTerminalNode<TLeaves>(Buffer<NodeChild> subtrees, Buffer<Node> nodes, Buffer<Metanode> metanodes, int nodeIndex, int parentNodeIndex, int childIndexInParent, ref TLeaves leaves)
            where TLeaves : unmanaged
        {
            //Multithreaded partitioning (or simply using a different partitioning path) can result in different orderings of subtrees passed down to each child node.
            //This *mostly* doesn't affect topology; the binning process is locally deterministic and the same set of children will be created.
            //BUT: if there are only two subtrees, we immediately build a node out of them. The order we received the children then affects the A/B positioning of the subtrees.
            //So, to avoid nondeterminism, we'll choose to sort the subtrees within this node by the subtree's index.
            ref var subtree0 = ref subtrees[0];
            ref var subtree1 = ref subtrees[1];
            var shouldSwap = subtree0.Index > subtree1.Index; //Note that there cannot be ties!
            ref var subtreeA = ref shouldSwap ? ref subtree1 : ref subtree0;
            ref var subtreeB = ref shouldSwap ? ref subtree0 : ref subtree1;
            BuildNode(
                Unsafe.As<NodeChild, BoundingBox4>(ref subtreeA), Unsafe.As<NodeChild, BoundingBox4>(ref subtreeB),
                subtreeA.LeafCount, subtreeB.LeafCount, subtrees, nodes, metanodes, nodeIndex, parentNodeIndex, childIndexInParent, 1, 1, ref leaves, out _, out _);
        }

Number 2 is more difficult. By default, microsweep uses a small vectorized counting sort. In order for it to take into account additional tiebreaker state (like the index), you need to include the index in the least significant bits of the sort keys. Bumping up to 64 bits cuts the performance of the vectorized sort by half and compressing into 32 bits would be lossy.

The nonvectorized comparison-based sort fallback can be easily modified with the tiebreaker, but it is a chunk slower. For example:

        struct BoundsComparerX : IComparerRef<NodeChild>
        {
            public int Compare(ref NodeChild a, ref NodeChild b)
            {
                var aCentroid = a.Min.X + a.Max.X;
                var bCentroid = b.Min.X + b.Max.X;
                if (aCentroid == bCentroid)
                    return a.Index.CompareTo(b.Index);
                else
                    return aCentroid > bCentroid ? -1 : 1;
            }
        }
        struct BoundsComparerY : IComparerRef<NodeChild>
        {
            public int Compare(ref NodeChild a, ref NodeChild b)
            {
                var aCentroid = a.Min.Y + a.Max.Y;
                var bCentroid = b.Min.Y + b.Max.Y;
                if (aCentroid == bCentroid)
                    return a.Index.CompareTo(b.Index);
                else
                    return aCentroid > bCentroid ? -1 : 1;
            }
        }
        struct BoundsComparerZ : IComparerRef<NodeChild>
        {
            public int Compare(ref NodeChild a, ref NodeChild b)
            {
                var aCentroid = a.Min.Z + a.Max.Z;
                var bCentroid = b.Min.Z + b.Max.Z;
                if (aCentroid == bCentroid)
                    return a.Index.CompareTo(b.Index);
                else
                    return aCentroid > bCentroid ? -1 : 1;
            }
        }

To address the partitioning problem directly, the easiest solution is to disable multithreaded partitioning. Another option would be to synchronize the accumulation of per-thread partitioning efforts so that the order matches the ST result. Would come with a pretty minor performance penalty.

I'm attempting to speedrun a bunch of stuff, so I'm not going to do that at the moment. Instead, I'll give the BinnedBuild a deterministic parameter that will (for now) simply disable multithreaded partitioning.

The broad phase is the main relevant user of this codepath within the engine. While the broad phase's topology is not required to be deterministic for collision constraints to be deterministic (the narrow phase flush performs a sort on new constraints anyway, because the collision test was not built to deliver results in a deterministic order), nondeterministic topology could result in some unexpected result orders for queries like ray casts. So, when Simulation.Deterministic is set, the broad phase will force deterministic refinements.

I don't anticipate this being very noticeable in the broad phase use case. Typically, many refinements are being executed in parallel. It's actually pretty rare to benefit from more threads being dedicated to a single refinement since they're not very big. Also, refinements are just kinda cheap.

RossNordby commented 12 months ago

Note: there are two single threaded paths for partitioning. One uses a pingpong buffer to avoid having to do in-place shuffles. It's marginally faster and is used by default when a BufferPool is provided to the build. The other works in-place and avoids the need for additional allocations for a slight penalty. These two implementations result in different subtree orders. Users seeking determinism across multiple binned builds should pick one of the two and stick with it (probably just use the BufferPool path).