Improve Beluga vs Nav2 benchmark

hidmic commented 1 year ago

Proposed changes

This patch augments beluga_vs_nav2 benchmarks to be a superset of those offered by beluga_benchmark. Depends on #75 (and up the PR dependency chain).

Type of change

[ ] 🐛 Bugfix (change which fixes an issue)
[x] 🚀 Feature (change which adds functionality)
[ ] 📚 Documentation (change which fixes or extends documentation)

Checklist

[x] Lint and unit tests (if any) pass locally with my changes
[x] I have added tests that prove my fix is effective or that my feature works
[x] I have added necessary documentation (if appropriate)

Additional comments

hidmic commented 1 year ago

@glpuga @nahueespinosa check report.pdf for the first draft of this work's output. This was generated with a single iteration per parameter variation (thus why some plots are incomplete). I also have to figure out what's going with Nav2 AMCL, results are unusually bad.

hidmic commented 1 year ago

Well, this is interesting. No matter what I do to AMCL configuration, with current Magazino datasets and maps (ie. the ones I regenerated using Cartographer), Beluga AMCL works just fine but Nav2 AMCL either diverges horribly (when using the beam model) or flat out crashes (when using the likelihood field model):

all_trajectories

It's the maps, clearly. I wonder why though.

hidmic commented 1 year ago

Late night update: it's not the maps. Crashes are yet another instance of https://github.com/Ekumen-OS/beluga/issues/253, so I just disabled recovery. Issues with Nav2 AMCL beam model, on the other hand, are real. Estimates go out the window very quickly. We had not seen this before because https://github.com/ekumenlabs/lambkin/pull/66 used likelihood fields for all datasets. Beluga AMCL has none of these issues, for the exact same configuration.

glpuga commented 1 year ago

Late night update: it's not the maps. Crashes are yet another instance of Ekumen-OS/beluga#253, so I just disabled recovery. Issues with Nav2 AMCL beam model, on the other hand, are real. Estimates go out the window very quickly. We had not seen this before because #66 used likelihood fields for all datasets. Beluga AMCL has none of these issues, for the exact same configuration.

I had the same issue getting the track plot. In my case I added the HEAD nav2_amcl source to the workspace to get the measurements going without changing the default configuration.

hidmic commented 1 year ago

I added the HEAD nav2_amcl source

HEAD of humble or HEAD of main?

glpuga commented 1 year ago

I added the HEAD nav2_amcl source

HEAD of humble or HEAD of main?

Head of main. Head of humble does not have the fix for this issue.

glpuga commented 1 year ago

Of course, the best LT solution would be for us to offer the missing PR for humble. It's about the third time we get bitten by this issue.

For easier future reference:

This is the original issue: https://github.com/ros-planning/navigation2/issues/3311
And this is the only PR I've seen about this: https://github.com/ros-planning/navigation2/pull/3315

PS: There was also one for Foxy, but it was reject on proper processing grounds and never re-made: https://github.com/ros-planning/navigation2/pull/3314

hidmic commented 1 year ago

Head of main. Head of humble does not have the fix for this issue.

Brought in HEAD of main, Crashes when adaptive recovery is enabled are gone. Beam model is still unstable. I'll give it a try with another dataset, just to make sure it isn't something peculiar about Magazino data.

glpuga commented 1 year ago

Head of main. Head of humble does not have the fix for this issue.

Brought in HEAD of main, Crashes when adaptive recovery is enabled are gone. Beam model is still unstable. I'll give it a try with another dataset, just to make sure it isn't something peculiar about Magazino data.

That's strange. I noticed no instability with the beam model in my own runs.

hidmic commented 1 year ago

Yeah. Tried nav2_amcl beam model on the HQ4 mapping capture dataset using the gonbuki stack. No issues, despite the initial pose being off. Something in the Magazino data? Or maybe the nature of the environment (an alley)?

glpuga commented 1 year ago

Yeah. Tried nav2_amcl beam model on the HQ4 mapping capture dataset using the gonbuki stack. No issues, despite the initial pose being off. Something in the Magazino data? Or maybe the nature of the environment (an alley)?

If it were the likelihood map, I'd tell you to check that the occupancy grid does not have wide walls. The map should have been created through an actual mapping process. "Synthetic" maps such as this one render very lousy results, highly dependent on the actual implementation.

hidmic commented 1 year ago

The map should have been created through an actual mapping process.

It was, using Cartographer ROS offline. I'll continue investigating.

hidmic commented 1 year ago

Problem found. Laser scans in Magazino datasets carry some NaN ranges. All sensor models except Nav2 AMCL beam sensor model handle this by ignoring NaN ranges. Nav2 AMCL beam model ends up with NaN weights :roll_eyes:

Handling NaNs in Nav2 AMCL fixes the issue.

hidmic commented 1 year ago

Second report.pdf. It doesn't look as good as the perfect odometry case (for us). The sweep over particle count is odd though. There may be something wrong with the rig setup.

glpuga commented 1 year ago

Problem found. Laser scans in Magazino datasets carry some NaN ranges. All sensor models except Nav2 AMCL beam sensor model handle this by ignoring NaN ranges. Nav2 AMCL beam model ends up with NaN weights 🙄

Handling NaNs in Nav2 AMCL fixes the issue.

Great finding! Good opportunity to create a ticket back in amcl and patch a few bridges, too.

glpuga commented 1 year ago

Second report.pdf. It doesn't look as good as the perfect odometry case (for us). The sweep over particle count is odd though. There may be something wrong with the rig setup.

Looks great to me. I don't really think the perfect odometry dataset with fixed particle counts is really representative of more general behavior.

A few comments:

For the paper I restricted the particle counts to the 250-5000 range, with new datapoints at 2k, 3k, 4k on top of the previous ones in that range. The 3k+ range does not really speak of the performance of a 2D localization algorithm. Performance becomes flat after 1k.
I'm also not including any data about the parallel mode. It has the same error performance; latency, the only difference, cannot be compared against amcl.
(Figure numbers in the resport would make discussing them in more detail easier)
(Markers (dots, cross, triangle, etc) at samples too, since we are not plotting cont data)
(in the APE and RPE plots, error bars are maybe more clear instead of bands; also, I'm assuming those are 1 std)
(vertical scales should start at 0; both APE plots are more similar than they look at first sight)

TBC later.

glpuga commented 1 year ago

Second report.pdf. It doesn't look as good as the perfect odometry case (for us). The sweep over particle count is odd though. There may be something wrong with the rig setup.

A few more comments and questions:

In section 3.1, is the particle count fixed in 2k, or is it allowed to adapt? For the last section of the paper I think we are mostly interested in the case where the configuration is default (including adaptive particle count), and only the sensor model in use is allowed to change.
For "Localization performance w.r.t. map" (page 8) plots, it's better to substract the reference and to plot the error over time to provent the deviations from being drown in mean value.
The plots in page 9 have some problem with the "likelihood_field, seq" plot; it's flat in all plots.
For the cpu plots in page 10, are we using the very latest version of beluga for this comparison? I'd expect the likelihood config to have beluga > nav2 cpu usage, and for beam about the same with intervals where one tops the other and viceversa.
It's strange that cpu usage and RSS profiles remain flat while the particle count is changinng 2 orders of magnitude. See the plots at https://github.com/Ekumen-OS/beluga/tree/main/beluga_amcl#performance for reference.
Similarly, I'd expect a marked difference in the APE performance below 500 particles and above 1000. Is it possible we are not really changing the yaml configuration? This happened in the repo when the launchfiles were incorrectly providing the path to the file, and the node was silently just loading the default values always. https://github.com/Ekumen-OS/beluga/pull/255/files#diff-9c5e25038d1febb8b2f312b037706b152594d78abea7732f31fd7d14ba52d0b0R2

hidmic commented 1 year ago

is the particle count fixed in 2k, or is it allowed to adapt?

Fixed.

are we using the very latest version of beluga for this comparison?

We are.

It's strange that cpu usage and RSS profiles remain flat while the particle count is changing 2 orders of magnitude.

Indeed. There are several numbers that I can't make any sense of. Have to dig in. As for the report, I plan to stick to the comprehensive one (already written, less work upfront). We can build another custom benchmark to get the exact data and plots we need for the paper right after.

hidmic commented 1 year ago

Found the problem deep in our RobotFramework machinery for parameterized benchmarks. I need morning enlightenment to fix it.

hidmic commented 1 year ago

Third try report.pdf. Numbers do make some sense now. I had to cap the number of particles to 5k because single-threaded beam models would take seconds to run per scan for larger filters.

I will adapt the report to fit review above, then merge PRs, then extract the data we need for the paper. @glpuga what did you have in mind for the paper exactly? I recall there isn't that much space left.

glpuga commented 1 year ago

I will adapt the report to fit review above, then merge PRs, then extract the data we need for the paper. @glpuga what did you have in mind for the paper exactly? I recall there isn't that much space left.

With the space left I think what we can aim at is a table like table 3 in the most recent version.

The experimental conditions would be similar to those in section 3.2: default configuration unless needed to adapt to the dataset (please specify the changes in that case) including adaptive particle count in the default range.

The data to gather ideally, would be for:

both datasets (hallway_return and hallway_localization)
both sensor models (likelihood and beam).

And the data itself would be APE rmse, mean, std and max (notice that the Table 3 currently has median instead of std, but I'll change that).

hidmic commented 1 year ago

(Figure numbers in the resport would make discussing them in more detail easier)

Tried and failed to do this with Sphinx. A good argument in favor of raw Latex reports support (someday). Something like pythontex can be used to embed executable Python code.

For "Localization performance w.r.t. map" (page 8) plots, it's better to substract the reference and to plot the error over time to provent the deviations from being drown in mean value.

That is what APE is (well, there is a modulus operation in there too), so it's right above in plots. We could remove the mean for the plot right after, the one that depicts x, y, yaw coordinates separately. It's not trivial to do that using Pandas though. We would need to resample to ensure time alignment. I'll add a utility to lambkin-shepherd to do just that, but I don't want to block this on it now.

Everything else I think I addressed in report.pdf. If you can take a look at #79 and this, that'd be great. If not, that's cool too. I'll merge tomorrow.

On to ERF paper updates.

hidmic commented 1 year ago

And the data itself would be APE rmse, mean, std and max (notice that the Table 3 currently has median instead of std, but I'll change that).

~Point estimates? Or is it OK to add confidence intervals?~ Nevermind, I'll recompute stats across multiple runs.

glpuga commented 1 year ago

~Point estimates? Or is it OK to add confidence intervals?~ Nevermind, I'll recompute stats across multiple runs.

For Table 3 I used the std provided by evo. It's the sample std (fully describes the sample distribution along with the mean), so it's not the same as the confidence interval, since evo does not give me that value out of the box.

Notice that we don't need multiple runs to estimate confidence intervals; we can use https://en.wikipedia.org/wiki/Bootstrapping_(statistics) instead using the individual APE samples.

glpuga commented 1 year ago

If we use confidence intervals, we should use them in previous sections as well.

glpuga commented 1 year ago

If we use confidence intervals, we should use them in previous sections as well.

thinking this a bit more, I think we are ok with the std provided by evo (sample std) for the table we can fit in the paper (see latest version)

Screenshot from 2023-10-29 11-54-39

Confidence intervals are better reported in plots, which we don't really have a lot of space for. I'll more clearly state that Figure 1 reports APE mean and std and not mean and confidence interval for the mean, though.

PS: the fact that 95% conf intervals are quite narrow around the central value in the report probably also make it make sense to only report the central value for the sake of space.

glpuga commented 1 year ago

It would be a nice feature from evo to provide the bootstrap confidence range, though, but it's for another day.

https://github.com/MichaelGrupp/evo/blob/master/evo/core/metrics.py#L138

glpuga commented 1 year ago

~Point estimates? Or is it OK to add confidence intervals?~ Nevermind, I'll recompute stats across multiple runs.

For Table 3 I used the std provided by evo. It's the sample std (fully describes the sample distribution along with the mean), so it's not the same as the confidence interval, since evo does not give me that value out of the box.

Notice that we don't need multiple runs to estimate confidence intervals; we can use https://en.wikipedia.org/wiki/Bootstrapping_(statistics) instead using the individual APE samples.

Nevermind, I see you used bootstrapped confidence.

Report LGTM!

glpuga commented 1 year ago

nit:

In the following, this report sweeps over parameterizations to compare the resulting performance. Error bars represent bootstrapped 95% confidence intervals.

What's the statistic being reported with confidence interval in 3.2? is it the mean APE or the root mean squared value APE?

hidmic commented 1 year ago

Seaborn says:

By default, the plot aggregates over multiple y values at each value of x and shows an estimate of the central tendency and a confidence interval for that estimate.

so I'd assume mean value.

hidmic commented 1 year ago

Problem found. Laser scans in Magazino datasets carry some NaN ranges. All sensor models except Nav2 AMCL beam sensor model handle this by ignoring NaN ranges. Nav2 AMCL beam model ends up with NaN weights 🙄

https://github.com/ros-planning/navigation2/pull/3929

hidmic commented 1 year ago

Other than that, package README remains accurate so I'll go ahead and merge.

Ekumen-OS / lambkin