bepu / bepuphysics2

Pure C# 3D real time physics simulation library, now with a higher version number.
Apache License 2.0
2.4k stars 273 forks source link

Consider hybrid solver parallelization #193

Open RossNordby opened 2 years ago

RossNordby commented 2 years ago

The solver currently batches all constraints into sets that share no bodies and executes them in parallel (with some finnicky stuff for the fallback batch).

A common alternative is to solve independent simulation islands in parallel. The naive implementation of this approach can't use multiple threads within a simulation island, so if there's one big pile, performance will suffer horribly compared to the current approach.

The advantage of island-based parallelism (assuming enough islands to actually saturate the machine) is that a single thread can complete the entirety of a solve for all bodies and constraints in an island, often resulting in more local data and less of a memory bandwidth bottleneck. Further, the number of iterations and substeps required can be configured on a per island basis. In other words, if an island is composed of only constraints with low iteration requirements, then the whole island can finish more quickly.

In principle, the two approaches could be combined. The existing system could be used for all large islands combined, while any islands in the sweet spot could be handled locally. Vector bundles could be executed either across constraints within the island (which requires bookkeeping to avoid constraints in a bundle sharing bodies), or across islands. Either way, island topology changes would require changes in vector bundle layouts, which is not cheap or easy.

There's a lot of complexity involved here. The benefit could be large on a subset of simulations (tons of independent islands that don't tend to interact at all, running on memory bandwidth constrained hardware), but getting there would be very hard. Increases in bookkeeping costs could create new worst case spikes (think about island merging overhead).

Questionable value.

erincatto commented 6 months ago

I think it is worth considering the cell based solver using k-means. It is possible to keep determinism by using a target cell count regardless of thread count. No parallelism is needed within each cell, however you could use graph coloring in the cell to enable SIMD. The cell boundaries could also use graph coloring so they can be solved in parallel and/or SIMD.

Sleeping would be a separate system based on persistent islands. When an island wakes, the bodies are distributed to the cells.

RossNordby commented 6 months ago

Definitely something I've been thinking more about lately.

Building cells

I think I'd probably go for some kind of incremental heuristic-guided graph update to identify subislands/cells. In other words, the boundary between two cells could be scooted body by body by asking whether the trade minimizes a cost metric. Similar concept to SAH-guided sweep builds, but across a graph instead of along spatial axes. This would require a lot more detailed work to fill out; a lot of possible local-greedy heuristics would e.g. collapse into mega-cells due to no single incremental trade away from that state appearing helpful.

Cells as generalization

The general version of cells is basically the generalization of island-based solving. Island-based solving is a special case that doesn't need any boundary solves; having no boundaries is pretty nice and the cell construction heuristic could potentially capture that and opt to eat an entire island to reap those benefits.

Likewise, there's nothing stopping multiple islands from being incorporated into a single cell if there happens to be room. If all the bodies and constraints of those islands fit into cache anyway, might as well; it'll slightly help with vector occupancy.

Per-cell coloring

Maintaining colors on a per-cell basis gets a little bit trickier. Using the IndexSet across all handles approach gets a bit more expensive; that works fine for ~64 total constraint batches since it's just a bit mask, but if you are targeting 256 cells and each has up to 64 constraint batches and the simulation contains 2^18 constraints, that's 536 megabytes of bitmasks.

Some options:

  1. Hash maps. Fairly heavyweight indirection, but would work. The fact that cells would intentionally target cache-compatible sizes makes hash maps a little less nasty than they would otherwise be.
  2. Handle remapping. When a body is allocated to a cell, give it a cell-local handle stored in the body's solver state (we have a few slots left around the inertia tensor). (May be wise to look at the access patterns more closely; being in the solver state may not be that useful.) A body is only ever contained in one cell, so one cell index/handle tuple suffices. Use the same IndexSet style approach as with current coloring, but on the cell-local indices. This drops the total use to maximumConstraintBatchCount maximumCellCount maximumCellBodyCount. For the previous example of 2^18 constraints, 64 batches, and 256 cells, if we assume there are 2^17 bodies, that brings the cost down to (2^17 / 256) 256 64 bit slots, or ~1 megabyte.
  3. Keep using global coloring. While the cell-local colors would ignore the boundary constraints and so would potentially have fewer batches than the global coloring would imply, the difference may not actually matter in practice. On the other hand, as we've already experienced, global colors do get in the way of parallelization of insertion. Might actually be handy to have local colors for that reason, even though you do still have to handle interactions on the boundary.

Boundary solves

The boundaries are annoying. The performance argument for this scheme depends pretty much entirely on the ratio of boundary:non-boundary solves. In principle, if all the overhead associated with maintaining the cells can be avoided and all cells/boundaries have decent occupancy, any nonzero amount of non-boundary work is a win. The question is, how much of it do you actually need for it to be a win in practice?

One consideration is how large of a boundary you can have without evicting the cell-local constraints. This seems pretty forgiving; even being in L3 is enough to dramatically improve performance compared to having to roundtrip main memory, and I suspect staying within L2 wouldn't be that hard.

Scheduling boundary solves in a non-wasteful way is another pain point. The entire point is for them to be minimal, but if they're always held until after all cell-local solves are complete, you may end up with a pretty meh degree of boundary parallelism. This may not matter too much (we already pay for it in the globally colored solve, after all; the only additional cost is that now there's a separate local solve dispatch required).

It may be worth trying to get clever with the local solve scheduling. If you can solve all local constraints associated with bodies that have a boundary constraint first, the boundary solve can be dispatched immediately while the rest of the local solves proceed. The boundary solve would have less workers allocated to it, but that's exactly what we want; the overhead is proportional to the number of workers involved. Many simulations might get away with 1-4 boundary workers while the rest chew away on the fully sheltered bodies and constraints. This trades a sync point in the middle of the local solve to avoid the sync point after all local solves in the hope that it allows better scaling overall; I'm not sure it's a win.

Bunch more to think about, too.

erincatto commented 6 months ago

Yeah, I think a local IdPool per cell is the way to go. Keep the IndexSet small.

One thing appealing about this approach is that we can experiment with building cells and boundaries and visualize them without actually building the solver. So we can evaluate the performance of building them, the quality of the results, the graph coloring, etc in stages before fully committing. This code could all be present within an existing solver, it just doesn't do anything.