The btSoftBody object from the BulletSDK includes an array of Nodes and Links.
These links appear to be first set up to connect a node to between 5 and 6 of
its neighbors [480 links], and then to the rest of the nodes after the
execution of the Floyd-Warshall graph algorithm [another 930 links]. The way
the links are stored by default, we have a number of cases where adjacent links
share a node in common - this leads to the creation of a data dependency
through memory. The PSolve_Links() function reads and writes nodes as it
iterates over each link. So, we now have the possibility of a data dependency
between iteration X that processes link L with iteration X+1 that processes
link L+1 because L and L+1 have one node in common, and iteration X updates the
positions of that node, and iteration X+1 reads in the position of that shared
node.
Such a memory dependency limits the ability of a modern CPU to speculate beyond
a certain point because it has to respect a possible dependency - this prevents
the CPU from making full use of its out-of-order resources. If we re-order the
links such that we minimize the cases where a link L and L+1 share a common
node, we create a temporal gap between when the node position is written, and
when it is subsequently read. This in turn allows the CPU to continue execution
without risking a dependency violation. Such a reordering would result in
significant speedups on modern CPUs with lots of execution resources. In our
testing, we see it have a tremendous impact not only on the A7, but also on all
x86 cores that ship with modern Macs. The attached source file includes a
single function (Reoptimize_Link_Order) which can be called on a btSoftBody
object in the solveConstraints() function before the actual solver is invoked,
or right after generateBendingConstraints() once we have all 1410 links.
Original issue reported on code.google.com by erwin.coumans on 11 Nov 2013 at 6:40
Original issue reported on code.google.com by
erwin.coumans
on 11 Nov 2013 at 6:40Attachments: