Optimize Code to run real-time by removing or reducing CPU jitter

SteveMacenski commented 2 years ago

Highest compute portions of the program at ~20, 40, 12, 20%, respectively (with different base measurements, but illustrative of their relative size to others)

SteveMacenski commented 2 years ago

Obstacle footprint https://github.com/artofnothingness/mppic/pull/72

SteveMacenski commented 2 years ago

I did some more looking at the valgrind results, it looks like the concatenate calls are the root of alot of the compute time. The xt::concatenate in integrateStateVelocities accounts for nearly 18% of compute time and the xt::concatenate in generateNoisedControls accounts for about 13%. Together that's a third of the controllers total time!

Is there a way to do this another way that doesn't involve concatenation operation? Maybe just having that shape reserved from the start?

Looking at the path align critic, it looks like about 11% of the time is spent on just evaluating std::hypot

SteveMacenski commented 2 years ago

https://github.com/artofnothingness/mppic/pull/73 addresses path align

SteveMacenski commented 2 years ago

Path align buys is 14% improvement. If we can make a real hit in the concatenates, that could get us nearly 50% speed up without any crazy changes to the algorithms or getting really into the weeds.

I also sat down baffled about what else is happening in the controller server and I totally forgot for a moment that the controller server also processes the costmaps. So that's a non-trivial amount of time as well, so the process isn't just the controller. With that said, before these changes proposed above, the controller took up 70% of the compute time of the process.

artofnothingness commented 2 years ago

I'll try to optimize those two integrateStateVelocities and generateNoisedControls. Really great job profiling this stuff with valgrind

SteveMacenski commented 2 years ago

Within the prefer forward critic, utils::normalize_angles takes about 9% CPU time now, since we're normalizing thetas to yaws.

Otherwise, the random noised control generation process and the integrate state velocities continue to be the biggest contributors to CPU time. Noised control generation is mostly in the random sampling process, while the integrateStateVelocities has it spread out all over the function (2% here, 5% there, 3% over there) that adds up to 20%. That seems like the lowest hanging fruit to get creative if we can combine or optimize operations to streamline it (or use xsimd calls for parallel processing?).

SteveMacenski commented 2 years ago

@artofnothingness I don't suppose you looked at ways to optimize this performance?

SteveMacenski commented 2 years ago

I'm going to start prototyping a concurrent version where the noised trajectories for the next timestep are being generated in parallel to the execution of the current iteration using an extra thread. My thoughts are using https://en.cppreference.com/w/cpp/thread/condition_variable with a specific thread to coordinate requests to create the noised trajectories. Adding them to the current states / controls will need to be done in the main control thread at the next iteration though.

But I have a feeling I might only get ~10% out of this (context switching isnt free), which might not be worth the additional threading complexity if we can improve performance in the actual operations. But it could help up to about 25%.

artofnothingness commented 2 years ago

I don't suppose you looked at ways to optimize this performance?

i don't have much ideas for now

btw separate thread for noise generation is a good idea. I would say that even if a thread doesn't manage to evaluate new noises at the point we need them we could get the old ones. So we request NoiseGenerator to genereate new noises, but if by this time he has not succeeded to evaluate them, then we get the old ones. So this would remove blocking noise generation from our pipeline at all. We need just sync noises assignment with getting the noises

SteveMacenski commented 2 years ago

I'm starting off with just seeing if I collapse some functions if that improves things. Caches and scope are weird and I've found sometimes you can get easy 5-10% back in complex classes just by reorganizing the code a bit.

I just basically collapsed everything possible into a few functions and saved about 5-10%. Its not great architecturally, but its more preformant. I'm going to pull back out of this and see where I can make the fewest changes for the most impact. Most iterations prior took about ~22ms, now its ~20ms

Edit: even so, this is not enough I don't think, moving on. The acceptable changes aren't the ones that really help

artofnothingness commented 2 years ago

I just basically collapsed everything possible into a few functions and saved about 5-10%

You mean make less function calls ? Mb just make them forced inline ?

SteveMacenski commented 2 years ago

Perhaps, I'm trying the threading now. I think it could help a little bit, but I don't think we're desperate enough yet to be trying to measure / optimize for +/- 1-2%. I think we still have a few big swings we can take first

SteveMacenski commented 2 years ago

Ugh, actually now that I look at this, I don't think its possible to break this out into another thread. The generateNoisedControls depends on the optimal control_sequence_.data, which I was going to delay adding until the main thread, but we do actually need that for the next step applyControlConstraints to know if it exceeds the limit. We could add that also to the main thread (that wasn't a huge CPU hog), but then we can't integrate the velocities at all without the control constraints applied.

The most we could do is generate the noised controls. That's not nothing, but its about 10-15%. With context switching overhead, we might be 5-8% of that captured. Its an improvement, but not as much as I would have hoped. I wanted to get both integration and noise generation in the new thread, since together that's about 25%

artofnothingness commented 2 years ago

I thought the main bottleneck was random noise generation and not controlsequence.data assignment in generateNoisedControls

SteveMacenski commented 2 years ago

No, that's one of many bottlenecks as part of the generateNoisedTrajectories process. Generating the noised controls is one, integrating the velocities into poses is another, etc. The callgrind file calls this out if you open with kcachegrind. The actual normal distribution is only about 6-8%, ut the process of generating the noised controls is about 13%. So 5-6% is that addition / views / iterating to populate the normal distribution, I suppose.

Question: https://github.com/artofnothingness/mppic/blob/develop/src/optimizer.cpp#L187-L198 If I modify noises_ after state_.getControls() = control_sequence_.data + noises_;, is that going to mess with the state_getControls() value (e.g. is there a reference there or is that a copy)? To parallel process, I'd need to be able to modify noises_ after its added to the control sequence from the last iteration to start preparing for the next. Given this is random noise, its kind of hard to tell, so I figure I'd just ask.

It really looks more and more like this might be better to try another tensor library if there's a faster one

artofnothingness commented 2 years ago

stategetControls() gives a reference to the part of tensor in state it doesn't relate to noises_ so you can change noises after assignment

SteveMacenski commented 2 years ago

I haven't looked into it yet, but most of the time that we miss our loop rate, its due to 1 odd cycle that is high. I suspect that's due to the path handler from a new path given, so that might be an area to consider for optimization.

SteveMacenski commented 2 years ago

Rather than randomly sampling from a normal distribution each iteration, what if we used shuffle to reorder them? https://xtensor.readthedocs.io/en/latest/api/xrandom.html#random-shuffle-function-reference

I think we'd need a larger sample size to choose from, but I'm not sure if that would be realistically any faster. Worth a shot though.

SteveMacenski commented 2 years ago

Adding a new thread for the generating noised controls makes a real impact. I haven't formally characterized it yet, but from the initial results, its good enough that I should. Its something in the ballpark of 15%. There's some extra work I need to do to make this reliable and handle the spawning / destruction of the extra thread cleanly, but its all pretty straight forward. The numbers hover below 20ms for the most part, from a small sampling it was ~16ms on average.

I'm still seeing though periodic spikes which makes the controller miss the control rate. I'm not sure what that is right now. I tracked it down to the optimize() function at the highest level so far (unsurprisingly). The spike occurs both in generateNoisedTrajectories and evalTrajectoriesScores simultaneously. Example, when it occurs:

Time distribution: generate noised trajectories / evaluate trajectories scores / update control sequence
Time distribution: 6.593988 / 7.205724 / 1.760187
Time distribution: 19.716556 / 15.716074 / 1.630075

I don't necessarily think yet its just process related, I don't see it related to red lining the CPU. When I run Nav2 without composition to separate this server from the others in its own process, it only takes about 75% of a core, with only about half of my total CPU in use. So there's plenty of room. That makes me think this is not just random process noise, this is something going on with either our logic or xtensor.

When I break down the spike within generateNoisedTrajectories, I see that the multithreading I added is definitely not the cause, its time to trigger and synchronize is uniform. Its applyControlConstraints / updateStateVelocities / integrateStateVelocities that all spike. Though, integrateStateVelocities spikes the most. But to be fair, its roughly proportional to the fact that was the most heavy operation. Everything spikes by about 3-5x evenly. Since integrateStateVelocities takes the most time, it makes sense that whatever is happening would hit it the most. I expect if I analyzed evalTrajectoriesScores, I'd see the same phenomenon. I profiled Prepare, the path handler, the fallback function, and a few other things -- none of them spike at all so its not consistent across everything in the process, its specifically things with heavy uses of xtensor.

Any thoughts on this?

At this point, the main thing for performance is this spike. If we can remove that, then we're done after #82 is ready. The run-time with the multithreading and the compiler optimizations is enough to call it good. Steady state, we can run this now at 40hz @ 2000 batches if it weren't for the spike. That's pretty exciting @padhupradheep!

Do you guys see with the current settings missing control loop rates relatively frequently (e.g., every few seconds)? It's not just me, right?

artofnothingness commented 2 years ago

i noticed the same

SteveMacenski commented 2 years ago

I'd appreciate if you could do some digging in parallel with me. Its a subtle issue and more eyes on it could make the difference.

artofnothingness commented 2 years ago

@SteveMacenski Are you sure they spike together always ? Did you try profile controller without costmap layers updates ?

SteveMacenski commented 2 years ago

I'm only adding timers within MPPI itself, never relating to the costmaps or controller server code. Those times I gave you are embedded within the Optimizer code to only represent those changes. Plus, I've shown that only some parts of MPPI spike when a spike occurs, many just stay the same. It seems only to be within operations using xtensor heavily - but that's not causation, just a general trend, I haven't been able to get deep enough into it to see which lines are specifically causing it.

For reference, my current configuration has only the static layer and inflation layer, so there's no active sensor processing. While the local costmap has a rolling costmap so its moving data around a little, this is a very efficient process and when I did profile the system, MPPI took the vast majority of the CPU time, so I highly doubt that is the cause. The costmap updates also happen on a different thread. But I cannot dismiss it out right.

artofnothingness commented 2 years ago

[component_container_isolated-5] [INFO] [1657911129.172699758] [controller_server]: Noises / scores: 10 10 [ms] [component_container_isolated-5] [INFO] [1657911129.272031240] [controller_server]: Noises / scores: 32 14 [ms]

From my observations they spike not always together

artofnothingness commented 2 years ago

Gen noises 4 / Apply Constrs 0 / Update State Vels 12 / Integrate vels 3 Gen noises 3 / Apply Constrs 0 / Update State Vels 1 / Integrate vels 12 Gen noises 12 / Apply Constrs 0 / Update State Vels 1 / Integrate vels 3 Gen noises 4 / Apply Constrs 14 / Update State Vels 2 / Integrate vels 3

Absolute random stuffs are blowing up

SteveMacenski commented 2 years ago

Huh, that's not what I notice with #82 branch, but I haven't been trying it with anything else.

SteveMacenski commented 2 years ago

I did a test with the local costmap never updating of a fixed size without any non-initialized layers, didn't change anything. I don't think the costmap is the cause. I think its the controller.

I also tried actually just removing the noise sampling entirely (e.g. sample 1 time and then reuse the rest of the controller iteration) and the CPU jitter is still happening. So pretty certain its not from normal distribution sampling / thread / noise generation process. Also removed the control constraints entirely as well, also still had the jitter.

Its really odd because most every one of these operations occurs on the exact same size of data each time.

SteveMacenski commented 2 years ago

I'm wondering if part of the problem here is that we have 1 massive tensor for all the control / states / etc. If we break the L2 cache things get super slow and that might only happen sometimes. I'm also thinking using floats instead of doubles could help (but speculative). I have a really hacky incomplete prototype with float that seems to be helping both speed things up and when the loop rate is exceeded, it seems slightly better. I'm seeing for the first time a few 10ms iterations around, so it definitely seems to be good in general, whether or not that actually helps with the spiking issue (hey, I'll take performance improvements!?). That might point to the cache being the issue.

It might be worth trying to separate our data out into several smaller tensors in State. That's a little out of my depth. Is that something you can prototype @artofnothingness and see if it helps?

I pushed to branch optimization3 (view changes relative to optimizations) with these changes which also adds a few more xt::eval()s which seems to also help with trig functions. Take a look if you want, but keep in mind its a work in progress and not everything is moved over to floats that should be. It's 7:30pm on Friday and I need to walk away from my computer before my eyes melt into it :laughing:

But real progress this week. We started in the 20-22ms range, and ending in the mid-teens (16ish, if ignoring the spikes)

SteveMacenski commented 2 years ago

To chronicle what I tried today:

Changing all xt calls to evals, some helped (mostly trig functions) but overall not much real impact. Didn't help with jitter
Tried putting scope deliminators in all the functions between yaw and x calculations in case some of the case is local cache too large and needs to move some from memory before accessing others -- did not help
Tried removing the Prefer Forward critic (one of the heavier ones) which dropped the compute time, so jitter jumps were smaller, but still occurred. There's probably some optimization work w.r.t. Prefer Forward, but that's another problem for another time

If you reduce the batch sizes, the spiking drops (ex. 500 batch sizes takes 2-3ms regularly but occasionally pops up to 10-15). Which is particularly interesting, it seems to occur regardless of size, so it seems like something within xtensor, or our use of it, rather than the sheer size of things causing issues.

Something I want to try is moving the state data into the optimizer class and remove as much of the class members as possible and only store as class members the minimal amount required between runs.

I also read somewhere that xtensor's trig functions aren't greatly optimized in xsimd, so it might be worth trying implementing our own xfunctions for when we use trig and applying it instead to see if that helps. The current most heavy functions use trig in them, and I suppose that could explain things if there was something odd going on with the numerical solutions to the trig functions. Like for instance, what if we don't handle wrap arounds correctly in our angles? In particular, I'm wondering right now about:

  yaw = xt::cumsum(w * settings_.model_dt, 1) + initial_yaw;
  yaw = utils::normalize_angles(xt::cumsum(w * settings_.model_dt, 1) + initial_yaw);

Where we don't actually enforce some kind of standard on the data (e.g. this isn't -PI to PI). Maybe there are other places this is happening and unwrapping in the trig functions is problematic? Seems not super likely, but it is a bug here that is worth investigating. Edit: added it, didn't help or hurt. But would be good to add in since its more correct.

If that doesn't work, I'm starting to run out of ideas beyond ripping out xtensor and trying another library, unless I'm missing something super simple. If the spike was happening in a particular part of the code, I could narrow into that, but it appears really everywhere and if its not caching / context switching / thread resource access, it seems like it might be a trait of xtensor. Its worth trying to figure out which APIs its happening in (or really if its them all) to see if there's options we have. But I can't seem to narrow it down to any particular kind of operation.

artofnothingness commented 2 years ago

It might be worth trying to separate our data out into several smaller tensors in State. That's a little out of my depth. Is that something you can prototype @artofnothingness and see if it helps?

Yeah, did this. thought to answer after PR

If that doesn't work, I'm starting to run out of ideas beyond ripping out xtensor and trying another library, unless I'm missing something super simple. If the spike was happening in a particular part of the code, I could narrow into that, but it appears really everywhere and if its not caching / context switching / thread resource access, it seems like it might be a trait of xtensor. Its worth trying to figure out which APIs its happening in (or really if its them all) to see if there's options we have. But I can't seem to narrow it down to any particular kind of operation.

Yeah, that looks strange to me.

SteveMacenski commented 2 years ago

Today I tried eliminating the noise, generated trajectories, plan, and cost tensors between iterations - leaving just the control sequence (technically the only thing that needs to persist between iterations) and state. The idea is to leave only the memory strictly needed in each scope, within each scope with the exception of state which is so embedded its very difficult to fully remove.

It may have helped, but certainly didn't resolve it. Its hard to tell what "helps" given that it doesn't happen in a regular pattern. I'm going to assume it didn't really help.

So I went back to basics later in the afternoon and logging times after all the changes we've made recently. I see that usually major spikes that result in missing the control time happen when the evaluation of critics is happening. The variance of generation times are 2.5-10ms, but evaluation times are 6-25ms. So the major iterations when we break, its probably from evaluation of critics, although both can contribute to the prefect storm.

I've spent alot of time / energy on the generation times, but now shifting to the critic times. Varying by 4x seems a bit extreme in either case.

The 2 critics that really go up are the Prefer Forward and Path Align critics (no shock there). The other critics' variance is so low or the compute time is such a small fraction of a millisecond, its hardly worth looking at them.

In Prefer Forward, the computation of yaws_local and data.costs are the 2 that have the compute time in them. I assume that's because that's where some of the expressions are being evaluated. Its very odd about this, because everything prior to and including yaws_local is very similar to path angle and goal angle, and neither of them use that much CPU at all, close to 0.1ms or less. Do you look at that and see anything obvious we could do? Something else that jumps out to me is the hypot that we replaced with a sqrt(xx + yy) elsewhere in the code bc in profiling I found it was super heavy. I'm also wondering if this relates to the fact that dx and dy are very small.

Maybe we should think of another way to evaluate Prefer Forward. For all I know, if we used the control velocities instead of subtracting the poses that could be it :shrug:. Or if we can find a way to evaluate less "stuff".

The 2 major conclusions I can draw:

normalize_angles is part of the issue for the jitter on yaws_local, likely the majority of it. Is there a way to make this differently that would vectorize it or another fashion that might be more stable?
There's something else independently happening at data.costs that varies as well from 1.5 - 10ms, but I can't put my finger on it right now.

SteveMacenski commented 2 years ago

I was able to vectorize the shortest angular distance function via

float shortest_angular_distance(float from, float to)
{
  auto theta = std::fmod(to - from + M_PI, 2.0 * M_PI);
  return theta <= 0.0 ? theta + M_PI : theta - M_PI;
}
...
auto vec_ang_dist = xt::vectorize(shortest_angular_distance);
auto yaws_local = xt::abs(vec_ang_dist(thetas, yaws));

Which then drops that time to eval yaws_local to near zero. But I think its the case that now everything is just evaluating at the final line now since there are no more explicit xt::eval() calls nor xtensor type returns. I still see jitter on the output. Critic takes 3-6ms usually with spikes to 10-12ms with the vectorized and utils version, not much of a difference.

SteveMacenski commented 2 years ago

What I'm thinking about right now is a special case for PreferForward. If we know that both the trajectory angles and the output of atan2 are in the form -PI to PI, then we can actually subtract them in PreferForward raw, without normalization, because the distance is only used in the cosine function which will not care about wrap around.

The issue I see though is that we have xt::abs(shortest_angular_distance(thetas, yaws)), where the abs() creates a bit of an issue. If we're using the cosine function, is that abs() really necessary? I think not. That would actually saves us alot of headache if we could get around that.

Thoughts? That would non-trivially improve performance and remove 1/2 of the jitter sources from the PreferForward critic. Still not sure where the other one is...

If we can get rid as many trig functions from the critics as possible, that would be great, especially in Path Align and Prefer Forward, they are by far the most heavy operation. Each time we call a cos or atan2, that's about 1ms each alone. Considering we're talking about run-times of ~15ms, that's by itself a worthy optimization point.

Edit: @artofnothingness oh! oh! I just put evals and prints after every single line in PreferForward, and when the spikes occur it is deterministically during the evaluation of trig functions. It goes from ~1ms to evaluate to 3-5ms (or more), per instance. That explains why we see it in both the optimizer and critic evaluations. If we can get rid of some of them, that would be great. Especially in PreferForward. It is possible that it is not due to the trig functions themselves, but their application across the entire tensor, but its impossible to say. I can say though that the only 2 code blocks when immediately evaluated that increased in CPU during spikes were the atan2 and cos, and both increasing largely. I suppose it could still be from cache breaking with the size of the tensor, but that's seemingly less likely from previous experiments. Either way, priority (1) would be to remove anything we can, priority (2) would be to approximate the ones we can to reduce compute time.

All of the trig used in the project:

Optimizer: cos and sin to compute yaws for the trajectory -- Not sure we can do anything about that one. It might not need floating point precision, but needs at least 4-5 decimals due to compounding errors.
Prefer forward: cos and atan2, cosine to find the component of the trajectory moving in reverse, and inverse tangent to find angle of trajectory position points relative to each other. 2-3 decimals would be fine.
Path angle: atan2 to find angle to goal from trajectory points. 2-3 decimals would be fine.

I wonder if we could replace xt::trig with the vectorized versions of some faster approximate functions. Its not like we need more than say 3-4 decimal points, if that. Approximate values would be perfectly fine for the critics. I might opt not to do that for the Optimizer if possible.

Do you think its possible to rewrite PreferForward without at least one of the trig functions using other data we have available like the control velocities (e.g. if we know vx for an iteration is negative, that's obviously a reversing move)?

SteveMacenski commented 2 years ago

Sigh. I also see sometimes even updateControlSequence increases (not a whole lot, but still isn't independent of the effect), which contains no trig functions, but instead has other functions which iterate through the tensor to apply some function.

I think this is a general issue with xtensor and iterating through its contents to apply a function effectively. I don't see any difference if I remove the xsimd or openMP stuff from the CMakelists, which is concerning to me. So who knows if that's being applied. I was thinking maybe the iterations we see odd artifacts xtensor is spinning up background threads based on OpenMP which would add some delay. But if there's no background thread from xtensor being setup, I have no clue why even applying functions through small tensors like costs_ would have variable results from 0.7ms - 4ms on occasion.

Working on minimizing the number of these functions, including trig's time it spends per-call, would still help, since that's still a major source of the jitter.

Any thoughts on this? Maybe someone elses' debugging skills would come to another conclusion or think to go down a route I did not. I'm pretty much out of things to try to find the root cause. At this point, I have characterized some options to help mitigate the worst impacts of it by reducing, as much as possible, the compute time spent out these kinds of operations or eliminating as many as possible, but I don't know how to outright solve it. This feels like a quirk of xtensor's vectorization of functions applied to a tensor, regardless of what they are. The heavier the function, the more substantial the jitter from potentially extremely low-level operations / prioritization. Tensor assignments / views / simple functions applied don't seem to have a problem.

Interestingly, if I change the controller server to run while true without loop rate sleeps for a consistent rate, it seems that the overall run time of the controller is more stable within a narrower band of execution times - and when it does spike, it only does so about 2x, not the usual 3-5x we see. In fact, I never see anything above 25ms whereas 30-40ms are pretty typical for the current spike numbers. The runtime is pretty well clustered in the 9-13ms timeframe with occasional jumps around 16-20ms (subtract 2ms from those numbers when noises generated in another thread). Never misses the loop rate once, totally acceptable behavior if we could get that out of the box.

I have no idea why that would be, but maybe that points to the fact that I've been looking in the wrong place.

artofnothingness commented 2 years ago

As i said before, absolute random stuff could blow up. Something we could do is take some function that definitly spikes and implement it in pure c++/Eigen and see if it helps

artofnothingness commented 2 years ago

Do you think its possible to rewrite PreferForward without at least one of the trig functions using other data we have available like the control velocities (e.g. if we know vx for an iteration is negative, that's obviously a reversing move)?

I took it almost as is from fastsense implementation. I think we could optimize it

SteveMacenski commented 2 years ago

You were right, I was just convinced that it was something in our control. I still think there's something we could do, but this is starting to feel like a rabbit hole and perhaps pivoting to reducing computation for this "type" of function that causes a problem is a better move.

I have 2 last things I want to try:

Reducing the trajectories tensor into 3 separate for each axis X, Y, and W. I'm thinking that maybe applying the vectorized functions to a full tensor rather than to only certain elements of it via views could be contributing. All of the places with major spikes use trajectories views on somewhat major functions -- Will do this today.
Go into the controller server code / process level and see if there's something I can do to get regular run rates while also getting the performance clustering I see when I set the control loop to run un-throttled

Otherwise, I think we're on the same page to optimize things. The areas of major help would be:

Simplifying or removing as much trig as possible
Vectorizing, multithreading, or otherwise optimizing Path Align, as 1 of the critics that spikes when spikes occur (other being Prefer Forward)
Find other functions being applied to elements of the tensor that are heavy like the trig functions for reduction, should the need continue after the first two bullets above.

artofnothingness commented 2 years ago

I've just implemeted integration part in pure c++

void Optimizer::integrateStateVelocities(
  xt::xtensor<float, 3> & trajectories,
  const models::State & state) const
{
  using matrix = std::vector<std::vector<float>>;

  auto start = std::chrono::high_resolution_clock::now();
  using namespace xt::placeholders;  // NOLINT

  auto & w = state.wz;
  double initial_yaw = tf2::getYaw(state_.pose.pose.orientation);

  auto yaws = matrix(settings_.batch_size, std::vector<float>(settings_.time_steps));

  for (size_t i = 0; i < settings_.batch_size; ++i) {
    yaws[i][0] = initial_yaw;
    for (size_t j = 1; j < settings_.time_steps; ++j) {
      yaws[i][j] = yaws[i][j - 1]  + w(i, j - 1) * settings_.model_dt;
    }
  }

  matrix yaw_cos = yaws;
  matrix yaw_sin = yaws;

  for (size_t i = 0; i < settings_.batch_size; ++i) {
    std::transform(yaw_cos[i].begin(), yaw_cos[i].end(), yaw_cos[i].begin(),
                    [](float angle) { return std::cos(angle); });

    std::transform(yaw_sin[i].begin(), yaw_sin[i].end(), yaw_sin[i].begin(),
                    [](float angle) { return std::sin(angle); });
  }

  auto & vx = state.vx;

  auto dx = matrix(settings_.batch_size, std::vector<float>(settings_.time_steps));
  auto dy = matrix(settings_.batch_size, std::vector<float>(settings_.time_steps));

  for (size_t i = 0; i < settings_.batch_size; ++i) {
    for (size_t j = 0; j < settings_.time_steps; ++j) {
      dx[i][j] = vx(i, j) * yaw_cos[i][j];
      dy[i][j] = vx(i, j) * yaw_sin[i][j];
    }
  }

  for (size_t i = 0; i < settings_.batch_size; ++i) {
    trajectories(i, 0, 0) = state_.pose.pose.position.x + dx[i][0] * settings_.model_dt;
    trajectories(i, 0, 1) = state_.pose.pose.position.y + dy[i][0] * settings_.model_dt;
    for (size_t j = 1; j < settings_.time_steps; ++j) {
      trajectories(i, j, 0)= trajectories(i, j - 1, 0) + dx[i][j] * settings_.model_dt;
      trajectories(i, j, 1) = trajectories(i, j - 1, 1) + dy[i][j] * settings_.model_dt;
    }
  }

  auto end = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
  RCLCPP_INFO(logger_, "Integrate inside loop execution time: %ld [ms]", duration);
}

Mostly it's 1-2 ms Integrate inside loop execution time: 1 [ms]

But sometimes it blows up too Integrate inside loop execution time: 11 [ms]

I came to the conclusion that it's not xtensor issue. May be we just allocate too much big stuffs too frequently and break L2 cache as you said

artofnothingness commented 2 years ago

But it blows up very rarely tbh

SteveMacenski commented 2 years ago

Got it - I'd be curious if you separated out the tensors for the trajectory into consistent components x, y, w like we do for the velocities and controls. I'm wondering if we apply them to different data structures potentially having some operations only in the dimensions of interest that would not bust the cache.

It could be that we're not actually busting the L2 cache due to the size of the objects but instead we're busting the data structure caches due to the amount of stuff needed in memory to run an operation and hopping around the tensors. That would make sense why we still see jitter when I reduce the batch size to a mere 500, just reduced proportionally to the size the number of memory cache breaks of the data structure.

I'd be really curious if we separated them, and used your snippet above, and still saw it in that portion of the code. At that point, those tensors really wouldn't be all that large (2000 x 30) so busting the L2 cache would seem an unlikely cause.

It was my plan to do that today, but I've gotten deeply derailed due to updates in other projects I need to attend to. That's my last silver-bullet solution.

SteveMacenski commented 2 years ago

@artofnothingness I just pushed a branch https://github.com/artofnothingness/mppic/tree/sep_ten_traj which sets the foundations for separating out the tensor for the trajectories (see diff w.r.t. opt4 PR branch).

I left some TODOs where some more xtensor specific work needed to be done, but I wanted to give you my starting point. Could you finish this off?

Implement the utils::getLastPoses() function to get the last poses for the trajectories, as its a common operation
Path Align Critic has some specialized stuff, I think I handled most of it right, but there are a few TODOs there too about getting synced sets of the 3 separated trajectories tensors for computing distances
Commented out the visualization code, since this was a rough draft to just see if the performance would change if we separated them. Obviously if we decided to do this because it dropped CPU or removed some of the jitter, then that would need to be reimplemented too. Ignore that unless you see performance improve to justify investing the time.

SteveMacenski commented 2 years ago

Another optimization point would be to make the body of Path Align vectorized, since we apply a double for loop for each of the trajectories in the batch. It would be another case where multithreading could help if we maintained some threads for it, but wouldn't help with the actual jitter since we're not reducing the computation. It would just make that particular critic ~30% lower I'd estimate.

Let me know what you think. I can start on a multithreading solution tomorrow, unless you think this is better served as a xtensor vectorized funciton. I think vectorization is a better place to start personally, less threads always good.

That and the Prefer Forward critic (fixed in #86) are the 2 critics that have alot of jitter. Otherwise, its in the trajectory generation itself. Which I'm not sure I see many opportunities to improve other than implementing approximate functions for sin/cos. Its very possible that just improving the 2 critics could be 'enough' (if separating the trajectory axes doesn't help).

artofnothingness commented 2 years ago

I left some TODOs where some more xtensor specific work needed to be done, but I wanted to give you my starting point. Could you finish this off?

Yeah, i could do this

Another optimization point would be to make the body of Path Align vectorized, since we apply a double for loop for each of the trajectories in the batch. It would be another case where multithreading could help if we maintained some threads for it, but wouldn't help with the actual jitter since we're not reducing the computation. It would just make that particular critic ~30% lower I'd estimate.

I think that we should stick to one threaded solutions till we reduce jitter problem. More threads - less predictabe behavior IMHO

artofnothingness commented 2 years ago

https://github.com/artofnothingness/mppic/pull/88

artofnothingness / mppic

Optimize Code to run real-time by removing or reducing CPU jitter #68