bepu / bepuphysics2

Pure C# 3D real time physics simulation library, now with a higher version number.
Apache License 2.0
2.35k stars 272 forks source link

Fallback batch improvements #162

Open RossNordby opened 2 years ago

RossNordby commented 2 years ago

2.4 will swap the jacobi fallback batch out for a sequential fallback solver. The sequential fallback solver has better convergence per CPU cycle, but cannot benefit from multithreading. The end result is that having many constraints in the fallback batch has a larger performance impact but produces far more consistent results. Simulation results are no longer heavily dependent on constraint add order when the fallback batch is in use. (See here for more information: https://twitter.com/RossNordby/status/1450607193284390913)

Bundles are still internally vectorized; sequential execution for any individual body is guaranteed by preventing two constraints sharing the same body from being inserted into the same bundle. This results in partially filled bundles and constraints are no longer contiguous in memory within the fallback batch's type batches. Allocation/deallocation/sleeping/waking is significantly more complex as a result.

Some possible improvements:

  1. Pack bundles so that constraints are contiguous again, but execution repeats the same bundle multiple times with masking to guarantee sequential execution. Hard to say what the net performance impact would be- a bit more complexity on a per-bundle processing basis, but no need to load 7/8 unused lanes over and over. Memory bandwidth usually isn't a problem for the sequential fallback, but reducing bandwidth could be nice to other asynchronous processes outside of physics. Main benefit is that add/remove would have less special case work.
  2. As an extension to 1, if things turn out sufficiently shared, we could reintroduce the jacobi solve. We removed it just because of the complexity of maintaining two different codepaths with different memory layouts.
  3. Revisit the jacobi solve, but with more intelligent mass splitting heuristics. It would still unavoidably have worse convergence behavior per cycle, but it doesn't have to be quite as bad as the naive 1/N mass splitting version.
  4. Bundle jacobi solve. Like number 1, pack constraints contiguously, but rather than re-executing bundles with masks, treat each bundle as a parallel jacobi solve. This would resort to mass splitting again, but mass splitting over 4 to 16 constraints is way less of a problem than mass splitting over hundreds of constraints.

Leaning towards number 1 and 4. 4 would likely end up outperforming the old jacobi solver in raw throughput and convergence per cycle for most use cases (technically single threaded, but fallback batches rarely have enough bundles to benefit from multithreading). Given the same memory layout, would be easier to swap between them as options. Could possibly bring in less naive mass splitting in number 4 too.

Value of this change is fairly low. Very few simulations end up putting any constraints in the fallback batch at all at default configuration (64 synchronized batches). It could be valuable for corner cases where hundreds or thousands of constraints affect a single dynamic body, and there's no practical way to split the single body up.