dartsim / dart

DART: Dynamic Animation and Robotics Toolkit
http://dartsim.github.io/
BSD 2-Clause "Simplified" License
890 stars 286 forks source link

Bisection contact solver #1375

Open costashatz opened 5 years ago

costashatz commented 5 years ago

With this issue, I want to bring to your (DART's developers) attention the release of the raiSim physics engine. I am sure that you are already aware, but I performed a few tests that I think are worth mentioning.

The overall conclusion is that RaiSim is ~10-13x faster than DART when collisions are involved and ~1.5-2x faster when no collisions exist. My experiments were performed using either their ANYmal robot or an iCub humanoid robot simulation in a static case: PD controller in joint space to stay standing. (If you want, I can post the code but I would need to clean it up; it should be straightforward to reproduce the results though -- see also their benchmarks)

There are two interesting thoughts for improvement in DART:

Is your feature request related to a problem? Please describe. This feature request is related to previous issues that I have reported (see for example #1183).. In essence, I would be very happy if DART was much faster.

Many thanks!

mxgrey commented 5 years ago

I think having an optional constraint solver that's higher performance but maybe less feature-rich is a great idea. The biggest catch is that it might be difficult to communicate the limitations, advantages, and disadvantages of each constraint solver option, but I don't think that should stop us from moving forward on this.

I'm going to start profiling dartsim this week, so I'll see if I can identify any obvious waste that's happening. I might also look over the raiSim code to see there significant differences in their programming techniques. I think one plausible source of inefficiency could be our use of dynamic-sized Eigen objects. One idea I have would be to make the Skeleton class templated to a maximum number of degrees of freedom (set by the user) with -1 meaning unlimited degrees of freedom. Using -1 would be equivalent to the current implementation, but using any other value would result in fixed-size arrays, which may perform better.

jslee02 commented 5 years ago

One idea I have would be to make the Skeleton class templated to a maximum number of degrees of freedom (set by the user) with -1 meaning unlimited degrees of freedom.

Does this imply that moving the joint values to the skeleton? I believe one of the sources of inefficiency would be the scattered memory in terms of the properties and states in the joints and bodies in a skeleton (so low data locality). Since we store those data in each BodyNode and Joint, it could cause cache misses in updating those values, which frequently happens in the recursive kinematics and dynamics algorithms. I think it would be worth to test storing them in Skeleton in contiguous memory for each data type (e.g., joint positions/velocities/accelerations and body transforms).

This could be achieved without the public API. The setters and getters of Joint and BodyNode will be the same, but the will access to the data stored in the Skeleton.

jslee02 commented 5 years ago

@costashatz Thanks for investigating on this issue. The benchmark code would be very helpful for testing and improving the DART performance. It'd appreciated if you could share the code once it's ready.

I might also look over the raiSim code to see there significant differences in their programming techniques.

@mxgrey Unfortunately, we may not able to look at the code. As far as I know, they haven't made the code public but only headers and binaries for the implementation.

mxgrey commented 5 years ago

Does this imply that moving the joint values to the skeleton?

I was mostly thinking of all the cached Eigen::VectorXd data structures. It will be significantly more difficult to centralize all the state information, because the extensible state concept expects state information to be distributed instead of centralized. It can certainly be done by having the state of the Joint aspects refer to their parent Skeleton, but the implementation will make things even more convoluted than it already is.

I'm certainly on board with trying this; I'm just warning that the implementation could be kind of dizzying.

jslee02 commented 5 years ago

That's a good point. I might first try to move the built-in data in Joint and BodyNode except for the state and properties in Aspect. If this shows a meaningful result, we may want to consider redesigning the extensible data structure to store the data in a centralized location for the sake of performance. But, yeah, not now since it would be a challenge to quickly test it.

mxgrey commented 5 years ago

I think rather than trying to "centralize" things into Skeleton, we'd probably be better off devising a custom allocator that can keep the memory together.

mxgrey commented 5 years ago

Well not so much "devising" a custom allocator, but enabling the dartsim API to accept a custom allocator.

costashatz commented 5 years ago

I think having an optional constraint solver that's higher performance but maybe less feature-rich is a great idea. The biggest catch is that it might be difficult to communicate the limitations, advantages, and disadvantages of each constraint solver option, but I don't think that should stop us from moving forward on this.

What I meant is to have it in the pipeline, that is:

Is this doable or am I missing something from DART's way of handling things?

I think one plausible source of inefficiency could be our use of dynamic-sized Eigen objects.

I do not think this is an issue. They are using dynamically-sized Eigen as well.. Also, in their paper they state (copy-pasting): We believe there is enough room to improve the computation of the dynamic properties and the Delassus matrix using a faster implementation and vectorization. This will make the performance gap between the two solvers more significant. Although it is not within the scope of this paper, we found that a proper implementation of Featherstone’s algorithms outperforms the RBDL by a factor of 4. I have never looked at Featherstone's algorithm, but I guess you could do some better implementation (if you haven't already).

The benchmark code would be very helpful for testing and improving the DART performance. It'd appreciated if you could share the code once it's ready.

OK! Do not expect this before August 15th.. Too many things to do!

I believe one of the sources of inefficiency would be the scattered memory in terms of the properties and states in the joints and bodies in a skeleton (so low data locality).

This might be one reason why DART might be losing time. I would first look at what I said for the Featherstone's algorithm though. We have a library for Gaussian Processes (https://github.com/resibots/limbo) and no matter how we organized things, only 2 factors mattered for performance:

  • However, I would need some help on understanding the DART's architecture on the contact/constraint solver (I have a very naive knowledge of it) to know where I should add/alter things.

Could you please give me a quick sketch of how things work? You have the odeLCP solver in the middle that complicates a bit reading the code..

costashatz commented 4 years ago

@mxgrey @jslee02 any news on this one? I might have a bit more time this period to work on this..

I am preparing the benchmark code as a gist to help in this direction..

costashatz commented 4 years ago

@mxgrey @jslee02 any updates on this one?

costashatz commented 4 years ago

@mxgrey @jslee02 I finally managed to find some time to write-up a clean version of my benchmarking code for comparing RaiSim to DART. Here's a standalone repo that I made for this purpose: https://github.com/costashatz/raisim_dart_benchmarks

Here's a quick description:

Benchmarks All simulations are ran for 10 seconds of simulated time with an integration time-step of 0.001 seconds. We replicated each experiment 10 times on a laptop CPU.

Quick Analysis The exact timings are not important, but the relative timings showcase that RaiSim is consistently faster than DART (ranging from 5-30x times faster depending on the number of contacts and the scene). It is normal that RaiSim is faster than DART when there are contacts, since RaiSim implements a novel idea about contact dynamics solving that is supposed to be faster. It is interesting to note though that RaiSim is much faster than DART even in contact-free scenarios by at least 5x. The results more or less replicate the findings of RaiSim Benchmarks.

I have implemented the following benchmarks:

I am planning to add a humanoid also for comparing in a more complicated robot, and a moving six-legged robot to account for possible dynamic movements. I do not believe that the timings will change much though. So my conclusion is two-fold:

Hope this helps to improve the library.

costashatz commented 4 years ago

@mxgrey @jslee02 did you have a look at this? Any hints on how we can implement this and improve the library? Imho, this should be top priority for DART: this would be on of the top reasons to switch to a different library for a user.

jslee02 commented 4 years ago

@costashatz Thanks for the provided benchmarks. I haven't had a chance to take a look at it. Let me go over the benchmarks, check what we can do, and get back to you.

costashatz commented 4 years ago

@costashatz Thanks for the provided benchmarks. I haven't had a chance to take a look at it. Let me go over the benchmarks, check what we can do, and get back to you.

@jslee02 did you get the chance to have a look?

jslee02 commented 3 years ago

Sorry, no update yet. Let me work on this after releasing v6.10 addressing a couple of issues, which might be next week or so.

costashatz commented 2 years ago

@jslee02 @mxgrey I am coming back to this because the performance issues of DART are really something that could make me switch physics engine and I do not want to do that (I like DART and all my code is based on that, but I am seriously considering adapting all my code to use MuJoCo once it is open-sourced [around April as promised by DeepMind]).

Did you have a look on why DART is so much slower than Raisim (or even MuJoCo) even without collisions? Also, why is each contact point making DART so much slower? I tried to dig into this and it seems that DART is using the LCP solver of ODE; ODE is so much faster than DART in handling multiple contact points (at least from the demos, we can insert many objects without losing fps). I really do not want to change my codebase (and I really want to continue to support this nice project), but we need to do something about speed. I can accept being slightly slower without contacts (this is not super crucial to me), but being less than real-time for 25 spheres colliding is too limiting.

Thanks and sorry for the long text.