Ideas For a High-Performance System Design

Genbox / VelcroPhysics

High performance 2D collision detection system with realistic physics responses.

MIT License

662 stars 114 forks source link

Ideas For a High-Performance System Design #29

Open ilexp opened 7 years ago

ilexp commented 7 years ago

Hey there,

I've recently done a bit of work on physics-related tasks and in the process spent some thoughts on how Farseer performance could be improved.

Performance Observations

First, some assumptions on certain performance aspects that hopefully are uncontroversial enough to agree on as a precondition:

The fastest way to process data is to iterate over an array of structs.
It tends to be faster to bulk-process batches of data in a tight loop than to process one item at a time.
Parallelization has a synchronization overhead, except where no synchronization is necessary.
Executing user code has a guard overhead, except where no context-specific guard is necessary.
There is a non-zero overhead to abstractions like List<T> vs. T[] when it comes to bulk-processing large amounts of data.

Potential Performance Sinks

Next, here is a list of Farseer system design decisions that potentially drag performance below the theoretical maximum:

Shape, Joint and Body are all classes stored in lists or similar data structures.
Users can subscribe to events that are invoked from the middle of the physics simulation.
User events are invoked callback style, item-by-item.
Parallelizing physics simulation internally (i.e. not by putting Farseer into its own thread) is complicated because every object may alter the state of every other.

Design Decision Draft for Maximizing Efficiency

Finally, I'll just throw in some rather radical ideas on how to restructure the overall system to address the above issues:

Make Body a struct and store all bodies in a World-global (potentially sparse) array.
Make Joint a struct (generalized) and store all joints in a World-global (potentially sparse) array.
Make Shape a struct (discriminated union) and store all shapes in a Body-local (dense) array.
Double-buffer all of the above data within World, so a physics step / update can strictly read-only from buffer A (current frame) and write-only to buffer B (next frame). This enables internal parallelization without synchronization using Parallel.For and similar.
Grant users direct access to all of the above World-global arrays so they can perform efficient bulk operations.
References between Body, Joint and Shape take the form of int-based handles that represent the access index in their world.
Instead of event callbacks, gather all occurrences (structs!) to "subscribed" events in an array over an update step and deliver them in bulk once the update step is finished. This enables users to do efficient batch-processing and frees Velcro from the need for state guards and their overhead. See issue #8.
User code that absolutely requires to be executed within the physics frame (such as collision filters, but be really strict here) is required to consist of a pure function that gets all required data read-only via parameter and returns a value.

As a side effect to improved efficiency, less overhead and internal parallel execution, the above changes could also make serialization (see issue #6) a lot easier. It would also make pooling (see issue #5) obsolete.

On the downside, this would require users to adopt a new API and potentially be more careful due to less safety options being around. There will be a bigger need for good documentation so users are primed for certain aspects of the system. This might also generate a large amount of issues that need to be resolved in order to adapt to the new design, but if any of these changes should make it into Velcro, now would likely be a better opportunity to tackle them than later.

I'm totally aware that the above are quite radical suggestions and that I don't have any idea about most of the internal structures and design decisions of Farseer, so please read it as a collection of ideas by an interested observer :)

damian-666 commented 7 years ago

Something like this could be awesome and may open up the possibility of first- and even second-order speedup (N10x or N100x) factors using multicore , SIMD, or GPGPU, and maybe even future reentrant server-side execution for multi-player or cloud use. It would be best to do this on the implementation side without changing existing Body, Shape, Fixture, Joint classes so much as to alienate existing Farseer clients.

For a maintainable component, Callbacks such as CollidesWith and other are not dependent on the nuances of being called from within a tight loop and that this previous frame->next frame state model sounds like a good approach.

The old adage is "Restrain from premature optimizing", but this effort would not be like that as its going for big gains by reducing algorithmic complexity.

GPU acceleration using OpenCL.Net or similar as is done by the BulletPhysics engine which has very rigid and incredibly fast piling allowing for impressive building falling simulations.

With .net we value rapid prototyping, safe code, and steady APIs, and component software. Open source has the tendency to branch and explode in versions unless it is contained. I think for existential reasons Velcro physics should be positioned away from PhysX, Bullet in 2d constrained mode, Jitter-2d, and other physics engines. BulletPhysics piling performance has been shown to be much higher with GPU acceleration and there is a server component, but the cost of integration of any different physics engine to an existing client is high and the inclusion of 3d increases the system complexity by a whole dimension.

From what I have seen, Piling of bullet objects performance in Farseer is not good, issues seen with only 20 items, on a phone, less.

Not using CCD and allowing sticking, tunneling to happen can make physics games buggy enough to not be ship-able. As a physics engine client most of the heavy work I think I have to do is trying to work without CCD when possible, to flip the IsBullet.

As a heavy api user of Farseer, I would forward that Body and Joint and Fixture classes not be overly changed for this if practical. The state could be copied from the vectors. But since we did not upgrade to the very latest version of Farseer because the changes where not compelling enough vs the difficulty of merging, a different data structure or API would not be a major deterrent. My team decided to put DataContract persistence on the properties of Body that are not derivatives of other state so if body were a struct then that fits with this model.

If this is pursued I can produce some links of prior work if that might help.

Genbox commented 7 years ago

Great ideas @ilexp

So far I've tried to be really close to Box2D and only provide some much-needed tools and APIs to make it easier for beginners to use a physics engine. For a clean Box2D, there used to be Box2DX. I will be going in another direction with Velcro Physics and your proposal sounds interesting.

structs are definitely faster than classes for many reasons. However, an engine that uses only structs is really hard to design. Velcro could definitely use structs a lot of places where it currently uses classes, simply due to the fact that they are read-only "data-holding" classes.
Bulk operations on large arrays of structs is fast when we talk about a warm cache, instruction prefetching and memory-copying; but it is not always the way to go. Cache coherency is a big factor and having small structs in the inner loops can greatly speed up things due to cache hits on the CPU.
Branch prediction is yet another big factor. Branches should be moved outside of inner loops as much as possible.
Pipelining is yet another big factor. The CPU can automatically detect prefetch if calculations can be done in parallel

In an unrelated project, I've done some research into RyuJIT64 as well as the CLR runtime and optimizations. Let's just say that I'm deeply troubled by my findings, as it seems even trivial optimizations are not performed. One example is int i = 64 % 8 is slower than int i = 64 & (8 - 1). Usually, for hard coded constants that are power of 2, we can do this trick. The compiler should do it automatically, but it doesn't! Well, at least not according to my benchmarks. I still need to dive down into the actual IL code and perhaps the machine instructions to see what is going on. I tried searching both corefx and coreclr on Github to see if anybody mentioned these issues, but it seems nobody cares.

Genbox commented 7 years ago

@damian-666 I looked into using SIMD to speed up the calculations in the collision resolution code, as it does either 1, 2 or 3 points at the same time. The results were less than impressive on PC. Note that this is based on outdated SIMD instruction sets as I have an Intel 4th generation CPU - it might be very different on newer hardware due to larger registers.

I also tried to build a multi-threaded version of the engine which failed miserably. The engine restricts the number of points in polygons to 8, which keeps the workload low, so any optimization to the collision resolution code did not benefit in the inner loops. MT to the outer loops were also unsuccessful due read/write constraints. @ilexp's proposal to make a step read-only will help a lot with that though.

The CCD has always been very expensive as it substeps inside each step. This is really an algorithm problem rather than an optimization problem.

As for API backwards compatibility, I don't think I will have that with Velcro. Farseer Physics Engine is stable and if it does the job, you should stay with it. I might have to move shapes, fixtures etc. out of the API for safety reasons, and there is just no way I can keep it compatible with FPE's API while doing so.

Genbox commented 7 years ago

This proposal with require a redesign of most of the engine. Louis over at his fork has done a great job at simplifying the engine to accommodate refactorings like this, so I will be copying some of his changes first.

ilexp commented 7 years ago

In an unrelated project, I've done some research into RyuJIT64 as well as the CLR runtime and optimizations. Let's just say that I'm deeply troubled by my findings, as it seems even trivial optimizations are not performed. [...] I tried searching both corefx and coreclr on Github to see if anybody mentioned these issues, but it seems nobody cares.

There's the tiered JIT issue in coreclr. To me, it seems like the biggest blocker for more JIT optimizations is the requirement to be fast at compiling vs. fast at executing. A tiered JIT has the potential to work around this in a way. But in any case, that's kind of future / hypothetical stuff. The other part is extending C# to allow writing more efficient code in the first place.

Anyway - really glad you think some of the above ideas are worth pursuing. I'm currently a bit short on time, but I'll take an occasional look at this issue (and project) to see how things develop, or whether there's opportunity for me to chime in with an occasional comment. And from my side, as a long-term Farseer user, feel free to break as much as you want with Velcro as long as it's for the better. 👍

damian-666 commented 7 years ago

Seems like if significant redesign is going to be considered, I see that the batches that must be organised are not the same as existing Islands, which were originally designed to be done in parallel, according to its original designer Erin Catto. But that does not give incredible single pile performance like bullet shows .

From what I gather the key is to batch contacts that do not share a body as described here on slide 6 https://www.slideshare.net/takahiroharada/solver-34909157

There are other tricks to parallelizing in there.

That way no sync is needed since the position or velocity state of a body is not being pushed by another contact on a parallel thread, so no race condition or lock is needed.

here they discuss the parallelizing issues around box2d / Farseer Islands , some of which may be relevant even in the many-batch-in-one-pile scheme: http://box2d.org/forum/viewtopic.php?t=9350

Looks like the box2d-MT has the box2d islands but its not super-impressive on the tests. Erin chimed in somewhere else on this that the Islands would need to have separate piles for testing.

Using AABB in world coordinates from the last frame I am guessing.

The reference code in C++ I think is in Bullet by Coumans he has made incredible huge piles of boxes and very stiff joints and contacts for robot and AI researchers who might also be interested in 2d for certain problems.

If parallelizing with Parallel.For or something works then maybe GPU could using OpenCL.Net or something could be looked at. For 2d render without 3d shaders loaded, and for 2d physics the most GPUs would have quite alot of spare memory and could allow all the data to go into the GPU memory which is super important for that. If all of the bodies data, the lists of vertices in Fixtures, the AABBs in world ( i guess from the last frame), the tree, and the positions and velocities fit into GPU memory then they can be batched in there.

I know this is a tall order but parallel physics is almost standard as even $50 phones have 4 cores now and cheap AMD processors now use 6 or 12.

One must be wary of little-gain optimizations if they are not thread-safe or away from the functional programming methodology of passing in the parameters that might not be the same for all callers in a frame, as they may turn up as issues in making it threadsafe. Not to sweat the little stuff if 4, 12, or even 60-factor gains are possible

Genbox commented 7 years ago

Very interesting read in relation to performance: https://blogs.msdn.microsoft.com/dotnet/2017/06/29/performance-improvements-in-ryujit-in-net-core-and-net-framework/

damian-666 commented 7 years ago

its interesting but seems like bounds-checking type optimization is never going to give the possible performance boosts of 2x, 3x, 8x when algorithmic complexity is pretty high and there is an opportunity for parallel or batch processing. given 4 core is on budget phones and xbox one is 8 core, and now AMD is selling 8 core CPUs for 118$, isnt there any sort of parallel processing that can be pursued without doing a massive overhaul, in the CCD or the collision detection and response?

Although we havent had much improvement with Parallel.For in spots here and there, My team had wild success with running ray casts ( uses broad and narrow phase) in parallel with farseer , with just one or two fixes in our branch . With .net it was easy to find the race condition as it was crashing in a static variable which was a negligible optimization, and just making that a member viable was all that was needed to make our pseudocollier's ray casts just about 4 x faster on a quad. I know that collision and ccd is more complex that just collision detection.

according to Erin Cattos comment in one newsgroup, the existing box2d islands were originally meant to be run in parallel and i believe iforce2d gave it a try in c++ where threading conflicts are more mysterious to diagnose. In my experience a functional-like programming ( passing in the parameters you need) and elimination of static vars that are both read and write is usually enough to fix the threading conflicts.

I am not sure that c# is great for xbox one or UWP development at the moment since there are graphics performance issues, but allot of us, including unity developers, own allot of c# code and might want to put our physics games on xbox and not have 4 or even 6 idle cores

On Thu, Jun 29, 2017 at 3:39 PM, Ian Qvist notifications@github.com wrote:

Very interesting read in relation to performance: https://blogs.msdn.microsoft.com/dotnet/2017/06/29/performan ce-improvements-in-ryujit-in-net-core-and-net-framework/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/VelcroPhysics/VelcroPhysics/issues/29#issuecomment-312127642, or mute the thread https://github.com/notifications/unsubscribe-auth/AP_LmQAIYnbOkFbD0it9-s2Am79_wJeXks5sJCefgaJpZM4NZqKu .

damian-666 commented 6 years ago

this is a big job, and is certainly not a pullrequest. Maybe we can establish a Git Bounty then fork this to try and assemble a pair of engineers who have done this before, to try and do this using a "try an work through the runtime issues" approach

A ton of work in the solid library ImageSharp has parrallel.For here and they just run it in parallel then fix the exceptions found. That is how my team did parallel Raycasts in Farseer.

C# is great for parallel for as it allows easy "try and see" approaches for batching. i suggest this with all the naiveté mabye Ian can stop me but, , before any massive rewrite, to just try it in the simplest case of two vertical piles of boxes ( box2d island), and then one vertical pile with two threads (Bullet lib- type batch, or "sets of contacts that do not share a body") collect the issues.

box2d islands means disparate piles, and in real game use this is not going to be as useful as the other batch of body states, those" sets of contacts that do not share a body"

Try a rough approach on boxes piling in 2d with a simple 4 body test. that will have piling issues to see if it runs without crashing. make two arbitrary lists of bodies and run them in parallel.
piling will likely have also sorts of issues but its just to see if it doesnt crash.
sort the bodies by batches that dont share a body as in the Bullet lib and the paper. From what I gather the key is to batch contacts that do not share a body as described here on slide 6 https://www.slideshare.net/takahiroharada/solver-34909157

This way no two threads are fighting over any body state since no two batches share a body

break the collision handlers outside the loop.

And / or I would recommend to copy the method by ecoumans used in bullet if there are some parallels. Their c++ lib can pile boxes with orders of magnitude faster and does have a 2d mode. But pure .net has all the deployment and development advantages Bullet will never have.

In parallel with this, try to run CCD in parallel, that is of equal or more importance.

Another thing is to research parallel techniques for the iterative constraint solver. Again, this was all achieved in Bullet and with GPU, with two orders magnitude faster piling in the case GPU is used, in the bullet lib. Once CPU parallelisation is done, a GPU method is about memory management and might be explorer with Alea Parallel.For

if this is done then a LiquidFun SPH type extension to Velcro and water with boats can be explorer from there. on 6 or 8 cores thats could be interesting and playable water.

Genbox commented 6 years ago

I've had the multi-threading discussion more times than I care to count :)

First of all, It is certainly possible to multi-thread the engine in specific scenarios, but years of experiments have taught us that a general solution hurts the common case. The things that have made a difference was cache-coherency and modern CPU features such as prefetch and pipelining. Keep small structures in CPU cache and you are suddenly running 2x speed. If I add just a bit of complexity such as keeping track of locks and have state objects for threads, then performance drops.

The most common case for this engine is not large games with incredibly advanced physics - it is small to medium simulations with 10-500 objects, and if I add on multithreading to speed up scenarios where you have 5000 objects, then we slow down the 10-500 objects case by x2. Not a great trade off.

In the 5000 objects case, you can certainly find a hotspot like ray-casts and multi-thread it in order to juggle more. We just have to remember that more threads do not equal faster simulation, it equals more simultaneous objects. In my tests, I had a ~20% increase in throughput by multi-threading with 4 cores; a terrible waste of resources gone to locking, state copying etc.

However, if you create a world and put that into a thread by itself, you suddenly have a 100% increase in throughput, and that scales linearly! So in larger games that actually need threading, it is much easier to copy state manually between the worlds on the macro level, than trying to squeeze anything out on the micro level.

With regards to islands - I did try to multithread (MT) them but keeping track of state across islands again degraded performance for the common case, while increasing performance for large simulations. I ended up with a version with A LOT of compiler conditions in order to get around it, and I was about to make a "MT edition" and a single thread edition, but I figured most people would choose the MT and then disregard the engine due to its bad performance in PoCs.

That being said, I'm very interested in what people can come up with. I'd love a multi-threaded version of the engine :)

damian-666 commented 6 years ago

Thats an interesting idea running entirely separate world in parallels, it can help for some ideas such as the current concept of "Islands", which are not the same as the batches or islands used in a successful MT piling implementation. Two piles could be solved in parallel by putting the bodies all in separate farseer worlds as you mentioned and running the whole worlds in parallel. But for a single pile of boxes, it would all be in one island and so not taking advantage of the 4, 6 or 8 cores that are now in almost all devices.

I wasn't proposing introducing any fancy synchronization, just something like presorting the workloads and batching them all without need for locks or sync inside the main loop(s), as mentioned by the thread originator. I haven't have much success with complex locks either, and I think @ilexp was proposing batching and running it all in parallel without sync except maybe a wait for all batch working threads to join, outside the outer loop(s).

From what i gather, the batches needed for MT within a pile or other set of interacting bodies are sets of constraints that do not share a body as described here. https://www.slideshare.net/takahiroharada/solver-34909157

As i wrote earlier, I got a 3.9x faster performance for 4 cores when i simply ran cay casts in parallel over sets of bodies big or small, in a real game scenario, to implement a wind or explosion effect with shadowing. This was just divide bodies in one batch per core, run workers and then wait.

ImageSharp uses Parallel.For throughout, I used a simple ThreadPool , run batches, and WaitAll for them to all finish, with zero locks or complex sync.

For Raycasts, its an embarrassingly parallel case since the collision detection geometry fixtures are organized in a tree which does not change inside the loop, but I know the contact and constraint solver is another matter.

For my simulation, the performance of Farseer is OK for now, except when I run CCD and things are in collision range, or when I use allot of constraint iterations ( setting.VelocityIterations and or positioniterations higher) or a small timestep. But for these cases, the batching before the loop would have to happen also.

Better would be some more specific propositions based on the existing Farseer code and internal datatypes , how to sort them by "sets which do not share a body or fixture" and which loops to address specifically. I did look at the Box2d MT discussion, but that is about solving disparate "Island" in parallel but that approach is not how Bullet does begin to successfully stack 100,000 boxes.

Genbox commented 6 years ago

Ray-casts has always been the low-hanging fruit since they are pretty much stateless. I'm worried that if it is multi-threaded, it will result in thread starvation or too much overhead due to crappy schedulers on mobile devices etc. It could definitely be configurable with low overhead, but as you said, the solvers are a completely different matter and much more interesting.

damian-666 commented 6 years ago

i looked deeper into the velcro code and have some promising things to try. But if this conjures up old failed or controversial attempts please direct me to those pre-Github branches if they are not lost.

I have added info below on the the engine fix we put for parallel ray casts in for farseer client in section 1) , then a partial proposition for implementing parallel collision and constraint solving work in 2) .

The parallel raycast is a very clear trouble-free 4x speedup we tested on quad MacOSX and PC that required at least two minor engine changes in the Farseer branch. For the client code that calls the raycast, the outer sync was just break the body list in to N sections, where using processorCount for N makes sense, and use ThreadPool.QueueUserWorkItem, then WaitHandle.WaitAll(waitHandles) after all the raycasts for all batches were done, then do the remainder. We didn't try Parallel.For but that is the new and simple to do it , and the way ImageSharp iterates nearly all of its batches of work. For my ray code the speedup is obvious but for the constraint loops in Velco, benchmarking timers around each Parellel.For loop would be needed. Then the power of managed .net is exploited that any runtime issues are easy to find than in c++.

Now that i look at some of the heavy loops, with that i'm an optimist to the absurd, I think we can do speed up Velcro incrementally and practially without any noticable tradeoffs or ifdefs, use of the structs, copying, or even drastic changes to API or objects. To realize the full benefit of multicore , by using Parallel.For on lists of Bodies, we would next need to look at how Bullet sorts contacts and or constraints by -those batches that do not share a body-. Then, all the loops that apply changes to body.AngularVelocityInternal. and body.LinearVelocityInternal, and the positions, will not be having conflicting write to the same body, since -no two batches share a body-. This genius idea is described and illustrated by Takahiro Harada in many of his publications, but he goes much further to the GPU methods. I see that GPU in .net is something much harder to be deferred for future, and would likely need this constraint and contact sorting step done as a prerequisite. No #ifdef would be required for Parallel.For, but the BatchSize could be a runtime parameter for testing, setting to 1 means don't bother with MT loops. This sorting would be done on the bodyLists inside an Island. As for running the current box2d "Island"concept in parallel, I do not recommend to even bother with that, since with the Takahiro-described batching method, each island will be sped up and the cores will all be loaded.

looking at Bullet where it might be implemented in the OpenCL version, it is hard to understand the batching code, I found some openCL code appearing to make batches but its really hard to read and undocumented. We could maybe use HashSets and put the bodies in there by keyed by ID, then try to organise the pairs such that pair.BodyA and pair.BodyB are all unique in each batch of pairs. Harada describes an interative method to make the batches (serial is slow, parallel is not straightforward, maybe more details in here: http://dl.acm.org/citation.cfm?id=2077406. There are some loops like SynchroniseFixtures that may not require batching, just a foreach.

I'm not proposing GPU rigidbodies as in Bullet3, it is a massive problem to implement and deploy, or stacking of 100,000 boxes as seen in the demos, but just maybe taking the batching technique if it can translate from that baremetal openCL code to .net

1.To make raycast multithreaded there are several changes to the RayCast method implementation, and make it use the parameters input and less of the tree state.

A crash was caused byissue with using the stack as a shared member in DynamicTree.RayCast method. For my team it was a one or two line change in older Farseer to allocate a new 256 item stack per Raycast, . I see the similar problem in Velcro, latest version.

rather than do try to pull request and then merge/ commit i am setting you the three important changes, i am using my own branch not on GitHub. but i'd be happy to commit them if someone will piont me to the unittest .

The change to _stack was to allocate a new stack as box2d did. Reusing this tiny 256-item stack in the tree really is negligible impact in the context of nearly any usecase. This is an example of a small optimization making a big one crash. Then i have shown a client can call RayCasts in parallel, reusing a big or small dynamictree. the offending code is in dynamictreeboadphase (and other Raycast implemenations ) line 341.
_raycastStack.Clear(); <---dont reuse a raycastStack member, remove it and make on on the stack. a) Stack stack = new Stack(256); <-- use that as box2d did.

then b) RayCastInput subInput; subInput.Point1 = input.Point1; subInput.Point2 = input.Point2; subInput.MaxFraction = maxFraction; subInput.Callback = input.Callback;//dh added this to pass the shape callback without using member var.. (thread safe)
subInput.TestInside = input.TestInside;

                float value = callback(ref subInput, nodeId);

then c) This might not crash but is faster and causes less GC

AABB segmentAABB; //shadowplay Mod use stack instead of heap for parallelism and less garbage { Vector2 t = p1 + maxFraction*(p2 - p1); Vector2.Min(ref p1, ref t, out segmentAABB.LowerBound); Vector2.Max(ref p1, ref t, out segmentAABB.UpperBound); }

For MT implementation inside Velco, BodyLists could be done in parallel if the bodies were batched, per Island, by those not sharing a body. I dont have time now but i think Bullet does this type of sort and probably very quickly

for example here is a heavy , expensive process that writes to body state, velocity and position. internal void SolveTOI(ref TimeStep subStep)

//add benchmark code here.. for (int i = 0; i < BodyCount; ++i) <--- after bodies are sorted to batches that do not share a body, then just try parallel.for here and hope for a

Parallel.Foreach batch... do this instead.. { for (int i = 0; i < batch.BodyCount; ++i) }

SynchroniseFixtures is all very expensive and trying that in parallel could be worth a try.

ilexp commented 6 years ago

I didn't read the entire conversation in detail since my last comment, but please be aware that replacing a for loop with a Parallel.For is not a trivial drop-in replacement.

If you're writing a game library, you need to be aware that one of the contexts in which it is used will be editor applications, which often have a special SynchronizationContext due to their strict requirements regarding message queue processing and blocking UI code. This means that Parallel.For and other TPL classes may actually process window messages while waiting for its threads to join / finish, which can lead to reentrancy problems in the editor application that is hosting this code, i.e. crashes and undefined behavior. There are ways around that, but it can get kind of tricky, and it's not the most pleasant of pitfalls to be in.

With this in mind and also regarding what @Genbox mentioned, it seems preferable to not do multithreading on a micro-level like individual for loops, but instead (continue to) provide the facilities to allow users of the library to do multithreading on a macro-level, e.g. a defined API that allows to perform certain operations in a thread-safe way. Give users the tools to multi-thread if they need or care to instead.

Edit: Another thing I'd like to mention is that from the perspective of me as an engine developer, I would very much prefer to just put the entire physics simulation in a different thread, which I can do already. Multithreading on a micro-level as a library implementation detail of Velcro seems like it would not add much value at the cost of increasing complexity.

damian-666 commented 6 years ago

Thanks for raising the possible issue of "This means that Parallel.For and other TPL classes may actually process window messages" . I have an editor and we put the physics outside of the UI thread, not sure if my setup would suffer issues like this . So in my system it is graphics and , UI in one thread, and physics on the other. We used means of trying to load the other cores that are normally there in a quadcore, but sometimes CCD or piles is the bottleneck. Any code proposed could use parallel for and batching only if proposed Settings.MaxBatches was >1.. , or some boolean to prevent it. so clients using UI thread also for physics could just set that to 1 for one thread, or it be the default.

ImageSharp is a very solid general-purpose .net imaging engine with many professional level contributors around the clock. They use parallel.For throughout since raster processes are parallelizable , and I never saw mention that clients didn't want it or how to get around it. But to shut it off clients can set paralleloptions.maxdegreeofparallelism to 1 i believe, it will be passed to the .net method parallel.For

Here is an example of a parallel.For inside a general purpose engine implementation: https://github.com/SixLabors/ImageSharp/blob/master/src/ImageSharp/Processing/Processors/Convolution/Convolution2DProcessor.cs

ImageSharp may well could be used in editors with UI, not sure if the window message issue would come up for that. For my level editor, in my setup, visible objects are drawn while physics updates bodies, then the body positions are copied to the visible objects , then repeat.

btw batching as in Haradas paper is really non- trivial, so im not sure if this will happen. The batchin could be the existing "Islands" though, or maybe a parallel batching scheme testing the bodies in new constraints to those in existing batches, all in parallel. Again in MaxBatches was ==1 , then it would not need to incur this overhead

nkast commented 6 years ago

I used the TumblerTest with 400 bodies to stress the engine and see if it's possible to multithread some methods. SolveVelocityConstrains() took most of the time, about 23.4% overall. My approach was to lock the two Velocity structs with Interlocked.CompareExchange().

I found that Parallel.For() had way too much overhead. Even the documentation says that you will see benefit only when your Parallel loop has an inner loop. In other words Parallel.For() can't be uses as an in-place replacement for a for loop. So I split the loop in batches and had Parallel.For() dispatching batches to a regular for-loop. At the end I replaced Parallel.For() with tasks. The result was to drop SolveVelocityConstrains() down to ~7.6% (on a 4-core Athlon) and a design that can easily switch to regular for-loops at some body threshold. Still need to verify if it's 100% correct and maybe switch back to Parallel or use threads directly instead of Task.

nkast commented 6 years ago

SolvePositionConstrains() will be harder to optimize with threads, mainly because it removes Contacts from the list. I know that @Genbox plans to replace the List<> with a LinkedList<>, so here's an interesting article about LinkedLists and Multithreading: http://www.drdobbs.com/parallel/linked-lists-are-like-so-last-century/232400466?pgno=1

damian-666 commented 6 years ago

The proposition here in the original thread is to multithread without locks of sync inside the loops, only this approach can provide the performance multipliers per core that allow Bullet3 to stack 100,000 boxes ( but that is with GPU) . if body set was presorted into bins by "those contacts that do not share a body", then (c.BodyB.LinearVelocityInternal and such in SolveVelocityContrants and similar methods would not need to be accessed with Interlocked.CompareExchange if that is what you tested. But to sort "those contacts that do not share a body" in .net code, this is nontrivial and we would have to obtain and study the research paper or read the Bullet3 openCL code that does this presort.

nkast commented 6 years ago

I prefer lock-free algorithm myself but in this case I don't think it's possible. What you describe is the Graph coloring problem](en.wikipedia.org/wiki/Graph_coloring). You can either re-color a body whenever a new contact is created or use a greedy algorithm to recolor the entire graph on every step. Both approaches use a lot of CPU. So my reasoning goes like this: At some point you will think about using multithreading to speed up this step as well. Then you will have to use some form of barriers or sync mechanism between bodies to make it work. Unfortunately you can't avoid it. So let's skip the extra step and just use barriers inside the existing algorithms. In my test I tried to find if it's possible to sync threads on bodies with low overhead. CompareExchange seems to work pretty well. (Initially I tried Increment/Decrement but that couldn't go anywhere..) If you look at the benchmark you will see that the code runs x3 or more on a 4core cpu, which I think it's pretty good. Most of the lost pref is related to thread setup/sync and not the actual locks. If you have 4 or 8 threads and hundreds of bodies you will rarely get a thread blocked on a body.

damian-666 commented 6 years ago

Oh 3 or more times performance for 4 cores is a pretty good result, but when you have 3.95X for four core you know you are not wasting cycles in syncing that could be used for AI or render . However this might be the best that is practical. If you would share a github branch with this work it would be interesting, especially to try on a pile of boxes all marked bullet, with the threading and this type of lock in the Island.SolveTOI. But see, the compelling thing about the "upfront batching" is that it has been implemented that way and tested years ago for bullet3, albeit for GPU, and, there are a few loops that all could reuse the per frame body batches . Yes, the" graph coloring" batching is illustrated in the https://www.slideshare.net/takahiroharada/solver-34909157 with chess pieces , i am still trying to download the paper.

As you note it should be , batching itself lends itself to be a parallel process with locks. Here might be the batching code if anyone can understand it, in openCL, its treats the static bodies as readonly and so batches can share those. this is just an idea to explore and might not be as practical on CPU as for GPU. .https://github.com/bulletphysics/bullet3/blob/master/src/Bullet3OpenCL/RigidBody/kernels/batchingKernelsNew.cl

If I google "A parallel constraint solver for a rigid body simulation" Takahiro Harada, I get GPU Pro 5: Advanced Rendering Techniques section 4, which has the pseudo code for the batching and that appears in my google books preview. But I am not sure, If this algorithm, which itself contains locks, is practical to implement in .net for CPU. Its discussed at page 458 and we might be able to figure out if N batches , where N = number of cores, would be practical to implement in .Net.

So ,while bullet3 does stack 100,000 bodies in amazing videos, it is hard-to-deploy- gpu code and just a simpler multi -threaded stack solver of 2d "bulleted" boxes would be useful. Even stacking 50 items with CCD can get bogged... https://www.youtube.com/watch?v=7TkOi-6LCEU <--big stacks

I see a couple OpenCL and WebCL implementation of box2d with c++ but for not for .net. maybe the https://github.com/wolfviking0/webcl-box2d is worth a look. Also Box2d MT that does islands in parallel, that might be another topic to explore that might be more practical for 4 , 6, or 8 threads in parallel, rather than for GPU, which wants to do something like 64 or more parallel batches

louis-langholtz commented 6 years ago

Hi all. Hope my dropping in on this conversation is okay.

I really appreciate @Genbox 's mention, back on May 14, of my C++ work with Box2D. Incidentally, I'm still working on it but I've shifted my work over to a new repository with a new project name of PlayRho in case anyone is interested in following it.

As for speed optimizations, so far the thing I've seen in my experimenting that's most impressed me is what happened when I compared using various different data structures vs. the existing Box2D structure for detecting when a contact already existed in code for finding new contacts. I got inspired to look into the Add Pair Stress Test Testbed demo where related sources in Erin's Box2D code had commented about considering using a hash table (see b2ContactManager::AddPair ) . What I saw suggested to me that using an array of pointers (std::vector) was more performant than the existing intrusive linked-list data structure usage and significantly more performant than the hash table (std::unordered_map). At least in regards to the demo. I don't know for certain why the array of pointers was fastest but presumably it was because it was cache friendlier at least for the 400 or so small bodies that get impacted in the demo.

In regards to multi-threading, I haven't seen any performance improvement with the multi-threading that I've tried so far, at least on a pyramid like stacking test. Given that a pyramid is all one island, I wouldn't expect solving separate islands in separate threads to speed up a test with just one island but it was discouraging that multi-threading the contact updating didn't help. Admittedly, I haven't played around with this as much as I think it needs and realize that there's probably ways I could optimize what I've done. OTOH, I'm worried that cache misses may have more to do with performance losses than what multi-threading can gain and so more of my focus has been on refactoring code towards becoming cache friendlier.

In any case, this continues to be a very interesting problem to me.

Genbox commented 6 years ago

Hi @louis-langholtz - thanks for dropping by and writing about your experience. I suspect you are right regarding the cache coherence versus code running time.

Genbox commented 6 years ago

C# is soon going to have this: https://github.com/dotnet/csharplang/blob/master/proposals/csharp-7.2/readonly-ref.md

I remember back 10 years ago when they argued that Java/C's const parameters were too much for most developers and it should be something handled by the JIT compiler. Well, they seem to have reversed that decision completely and are now going with const parameters. In any case, I expect the compiler to be able to inline much more aggressively and do a lot more tricks for arithmetic reductions.

Genbox commented 6 years ago

A quick update on SIMD: https://github.com/dotnet/coreclr/pull/13576 It looks like we finally have a full implementation of platform-based instructions!

ilexp commented 6 years ago

As far as SIMD hardware acceleration goes, would it be possible to use System.Numerics.Vectors? Already out there since 2015, no need to wait for cutting edge CLR / FX changes to hit mainstream .NET.

Genbox commented 6 years ago

@ilexp sure, that's not really the news though :) The Vector classes have several shortcomings that I've hit in trying to use them and they are far from a full implementation of the SIMD instruction set. One of the largest limitations with the current Vector based SIMD is that you are locked into using their Vector classes. Not we can augment our own Vector classes and keep backwards compatability or optimize without the overhead of copying into different structs.

I've implemented part of the engine solver using System.Numerics.Vector and ran some tests - the results were not really what I expected. Much of it had to do with boxing, conversions and memory copying. With this API, it does not seem to be an issue.