Optimize tensors further

SteveMacenski commented 2 years ago

well I don't see that this helps much with the jitter, but it also did another big drop in CPU - which will also correspondingly drop the jitters' impacts. I see this running well under 10ms now (we started the week at 20ms!) with maximum peaks mid-25ms, which is below the 30 hz rate. 6-11ms is typical.

I'd say its worth the time to finish up this PR. The general performance is more stable now and average slightly lowers.

SteveMacenski commented 2 years ago

I cleaned things up & handled the merge conflicts. The remaining stuff is w.r.t. visualization getOptimizedTrajectory and getGeneratedTrajectories based on the updated Trajectories wrapped tensors. Could you work your xtensor magic? You can merge this one that's added, I've tested everything else and very happy with the stabilization improvement this adds.

SteveMacenski commented 2 years ago

After this change, I re-benchmarked where the jitter is coming from and see numbers where the PathAlign is the biggest contributor now. The entire spike of the trajectory generation process is going from 2-3ms to 5-6ms (2-3ms deltas are hardly worth doing much about, I doubt we even could). PathAlign can go from 4-6ms steady state up to 11-12ms.

I think if we can improve PathAlign, that would be the point where we can/should stop hunting for optimizations and pivot to #79/#80 to complete the first iteration of this software.

By the way, I'm going to be on vacation from July 24 - Aug 17 so you won't hear from me much over the next couple of weeks while I'm traveling. But this remains a priority for me! I'd like to close out the optimization ticket before I leave, but I fear I will not get the Path Align optimizations done tomorrow :cry:

SteveMacenski commented 2 years ago

I tested with TBB parallel for loops for path align, it doesn't appear to be any faster and the jitter still has the same consistent range. I also made an attempt to use xtensor's xaxis_iterator but after about an hour, I couldn't get it to compile with the types and whatnot. Might be worth a shot on your end, since using iterators will compile some xstrides which might be more optimized than manual looping over the data.

besides that, we could try to reduce the steps, but otherwise I think we'd be reaching the end of what we can do with it as currently implemented and might be worth rethinking how that's implemented to see if there's a lighter weight alternative.

Perhaps if we established a distance potential field to the path, we could use that to reference the distance from the path for each trajectory point without needing to iterate through the path for each trajectory point and compute the distance. The discrete nature of a grid potential field though would reduce the quality of its exact alignment, however.

We could also try to fit a spline to the path in a localized region to have an analytical equation to work with. That would buy us some simplification of the math if the spline's order was analytically differentiable to program in (or wolfram alpha-able).

but I'll have a few hours tomorrow before I leave to think about this more and see if there's solutions with what we have.

SteveMacenski commented 2 years ago

I'm quite surprised by the degree at which increasing the point steps for both trajectories and path points seems to not really impact performance of the critic much. Making them larger than would be good (5, 10 each) reduces by only 0.5ms but the peaks still hit 11ms. There's no amount of tuning those numbers higher than currently set which will meaningfully impact performance (though lower obviously has alot more compute).

I removed the contents of the loops, so that they just loop emptily and I can still actually see spikes occurring upwards of 10ms. The CPU time drops about in half to ~2ms steady state, so half of the compute time isn't even the main loop.

Given those 3 pieces of evidence (multithreading the loops doesn't help; reducing the resolution of the loop doesn't help; removing the loop content doesn't help), I don't think the reason that this critic is expensive is actually to do with the core logic itself, which is interesting - or perhaps the lambdas.

artofnothingness commented 2 years ago

Now that works with 35 time steps and 2000 batch on my 4 gen notebook within 20 hz

SteveMacenski commented 2 years ago

@artofnothingness I did some testing on the current state of dev on runtime. I never see anything > 20ms anymore! Really great work on finishing this up :1st_place_medal: That puts is at 50hz at the settings I've been playing with.

artofnothingness / mppic

Optimize tensors further #88