artofnothingness / mppic

MIT License
84 stars 19 forks source link

Optimize trajectories integration and generation #74

Closed artofnothingness closed 2 years ago

SteveMacenski commented 2 years ago

I'll profile on Monday, though breaking these into separate files seems odd to me. For the noise generator, we already had some vx_noises_ objects in the optimizer class (that you didn't remove) so I don't see that we're storing any more or less data that would need to be broken out into a separate class, just what was there already just needed to be tweaked in how its represented. The code is pretty much exactly the same but now with the overhead of the indices

The TrajectoryIntegrator you are storing some legit new stuff that is pretty big so I can see why that is broken out into a new class - although its not clear to me that you need to store that new stuff (?) It was just the concatenation that was heavy, not allocating the yaw / etc method variables

artofnothingness commented 2 years ago

removed redundant classes

artofnothingness commented 2 years ago

I removed prefer forward from defaults. We can just clip velocity from the bottom if we need more forward / only forward moves

SteveMacenski commented 2 years ago

It looks like this reduced the integrate state velocities CPU by about half (29% down to 15%) but didn't make much change in the generate noised controls function (~15-16%). It looks like from that, about 11% is spent on the random sampling processes + related container-y stuff from xtensor to align things / stride stepping. I'm not sure there is much we can do about that?

Most of the time I can account for is spent in xtensor. So if there's anywhere we can minimize changes / strides / hooking up GPU / generally optimizing the use of xtensor, this would be the time to do it so we can push up our batch_size as much as effectively helpful.

We now spend about 1/3 of the time on evaluating the trajectories and about 2/3 of the time generating trajectories. Of the controller time (e.g. where 100% is made up of the components within MPPIC, not other things), 22% is on generating noised controls and another 22% is on integrating state velocities. There's a number of < 8% tasks that I don't think are the big fish to fry. The critics change on a per-run basis just depending on what's going on in a particular test run, but right now none of them look to me to be a bottleneck comparatively - so there's no real need to try to minimize the number of critics at all. That's a drop in the bucket of the compute time. Anyway, this tells us our settings scale with the number of points and batches, not critics or complexity of the critics. And the number of batches and points within those batches are important things for us to push up to improve performance.

With a set of parameters I am playing with (doesn't really matter what they are, just for comparison of runtime improvement):

I removed prefer forward from defaults. We can just clip velocity from the bottom if we need more forward / only forward moves

I don't think that is the best to remove PreferForward, although I totally agree we should have a separate minimum Vx and the way you set it up. We could still have then a situation where there is a goal behind us that we wouldn't reverse to move forward towards. It would then just move backward but at a slower pace. I showed as much in simulation. Also the analysis above shows that it really doesn't matter if we include PreferForward from a compute standpoint, it doesn't change anything for run-time but changes alot in expected / good behavior.

SteveMacenski commented 2 years ago

In pushing up numbers, I can get to about ~500 samples with 3s forward simulation time (currently 30 timesteps @ .1s increments, but still tuning and playing with those numbers wildly) at 40hz. I could do more, but that's as much as I would feel confident in at steady state, with the current performance. If we can improve things a bit, I'd love to have ~1000 which doesn't seem to need that much more optimization (10-20%).

The autorally folks used 2560 batches at 2s increments (0.02 timestep, 100 of them). I don't think we need to go that level, but even with my 30 timesteps at .1s using 2560, the behavior is really nice. The differences between 500, 1000, 2560 is hard to gauge since I can't visualize the trajectories without making rviz get choppy. That's a hardware testing need. Though given the fact that this is random sampling based optimization, we need as many samples as possible so we can model our system as best as can - especially if we can't iterate multiple times per control frequency.

SteveMacenski commented 2 years ago

I got down to a tuned value of temperature around 0.35 as a good trade off & simulation time of 3-3.5 seconds broken down between 0.05 - 0.1 increments. I'm currently testing with 3s @ 0.075s (or 40 time steps) as my middle ground. I found that below 2.5s it had a hard time with back out maneuvers because it didn't think far enough ahead. Anything about 4.5s thought too far ahead and had problems being responsive.

In terms of batches, minimum of 500 is necessary for smoother behavior. With the visualization stuff fixed, I'm trying to understand how much it helps / doesn't impact for 1000, 2000 to find the sweet spot. One major reason I want to squeeze a bit more performance out of this if possible to support a 1,000 batch system at 40hz on a "regular" intel CPU (raspberry pis or something will obviously need less). Anything above 50hz I think is just excessive, so we're closing in on that number and I couldn't be happier about that. If we can support 50hz @ 2000 batches with the previous settings, I can't imagine we'll ever need more. Though realistically I'd make the defaults less than that