Compare performance between `beluga_amcl` and `nav2_amcl`

nahueespinosa commented 1 year ago

Description

Now that we have a functional ROS2 node beluga_amcl, we should start thinking about a comparison with nav2_amcl. It would be ideal to have a reproducible environment to compare performance and accuracy whenever we want as we improve the software (looking at you lambkin).

QuickMCL already has done this for the ROS1 case, leaving behind a repository we could use as reference.

Definition of Done

[x] Collect performance and accuracy metrics data from both packages.
[x] Create a report that explains the methodology and results.

nahueespinosa commented 1 year ago

Current results with timemory after https://github.com/ekumenlabs/beluga/commit/17e95ba3f688c4297b77826580a4d78cb72ece77:

$ ros2 launch beluga_example example_rosbag_launch.py prefix:='timem --'
...
[timem-5] [/ws/install/beluga_amcl/lib/beluga_amcl/amcl_node]> Measurement totals:
[timem-5]            33.124544 sec wall
[timem-5]            48.830000 sec user
[timem-5]             0.690000 sec sys
[timem-5]            49.520000 sec cpu
[timem-5]           149.495704 % cpu_util
[timem-5]            33.700000 MB peak_rss
[timem-5]            33.968128 MB page_rss
[timem-5]          1364.967424 MB virtual_memory
[timem-5]                    0 major_page_flts
[timem-5]                 4593 minor_page_flts
[timem-5]                39839 prio_cxt_swch
[timem-5]                72632 vol_cxt_swch
[timem-5]             0.310466 MB char_read
[timem-5]             0.000000 MB bytes_read
[timem-5]             0.036408 MB char_written
[timem-5]             0.016384 MB bytes_written

$ ros2 launch beluga_example example_rosbag_launch.py package:=nav2_amcl node:=amcl prefix:='timem --'
...
[timem-5] [/opt/ros/humble/lib/nav2_amcl/amcl]> Measurement totals:
[timem-5]            33.120430 sec wall
[timem-5]             2.190000 sec user
[timem-5]             0.540000 sec sys
[timem-5]             2.730000 sec cpu
[timem-5]             8.242598 % cpu_util
[timem-5]            35.164000 MB peak_rss
[timem-5]            35.176448 MB page_rss
[timem-5]           901.443584 MB virtual_memory
[timem-5]                    0 major_page_flts
[timem-5]                 4125 minor_page_flts
[timem-5]                  237 prio_cxt_swch
[timem-5]                70541 vol_cxt_swch
[timem-5]             0.185864 MB char_read
[timem-5]             0.000000 MB bytes_read
[timem-5]             0.003188 MB char_written
[timem-5]             0.000000 MB bytes_written

CPU usage is a bit high :sweat_smile:, I think some of this is expected since we're using multiple threads to update the filter. This is good to start a discussion and definitely worth investigating.

nahueespinosa commented 1 year ago

CPU usage is a bit high

One of the main differences between the packages is that nav2_amcl is decimating the laser scan beams before processing them. #84 will add that feature to beluga in order to continue our analysis.

ivanpauno commented 1 year ago

@nahueespinosa did you pass --cmake-args -DCMAKE_BUILD_TYPE=Release to colcon? I notice a significant difference between using that or not.

The default is the cmake None build type, which is different from Debug, Release, RelWithDebInfo. I'm not sure why it's the default, as it's never good to use it.

I also tested with https://github.com/ekumenlabs/beluga/pull/84 merged, the performance is now similar to amcl:

Beluga:

[timem-5] [/ws/install/beluga_amcl/lib/beluga_amcl/amcl_node]> Measurement totals:                                                
[timem-5]            33.231538 sec wall                                                                                                                                                                     
[timem-5]             6.670000 sec user                                                                                                                                                                     
[timem-5]             1.170000 sec sys                                                                                                   
[timem-5]             7.840000 sec cpu                                                                                                                                                                      
[timem-5]            23.591892 % cpu_util                                                                                                                                                                   
[timem-5]            32.788000 MB peak_rss                                                                                                                                                                  
[timem-5]            33.140736 MB page_rss
[timem-5]          1060.876288 MB virtual_memory
[timem-5]                    0 major_page_flts
[timem-5]                 4464 minor_page_flts
[timem-5]                91538 prio_cxt_swch
[timem-5]                77126 vol_cxt_swch
[timem-5]             0.312198 MB char_read
[timem-5]             0.000000 MB bytes_read
[timem-5]             0.026644 MB char_written
[timem-5]             0.016384 MB bytes_written
[timem-5]
[INFO] [timem-5]: process has finished cleanly [pid 12155]

Nav2:

[timem-5] [/opt/ros/humble/lib/nav2_amcl/amcl]> Measurement totals:
[timem-5]            33.241450 sec wall
[timem-5]             4.180000 sec user
[timem-5]             0.670000 sec sys
[timem-5]             4.850000 sec cpu
[timem-5]            14.590110 % cpu_util
[timem-5]            34.856000 MB peak_rss
[timem-5]            34.787328 MB page_rss
[timem-5]           901.599232 MB virtual_memory
[timem-5]                    0 major_page_flts
[timem-5]                 4142 minor_page_flts
[timem-5]                  475 prio_cxt_swch
[timem-5]                69618 vol_cxt_swch
[timem-5]             0.185900 MB char_read
[timem-5]             0.000000 MB bytes_read
[timem-5]             0.006208 MB char_written
[timem-5]             0.008192 MB bytes_written
[timem-5] 
[INFO] [timem-5]: process has finished cleanly [pid 12259]

nahueespinosa commented 1 year ago

@ivanpauno We try to set the build type in beluga if its not set externally:

https://github.com/ekumenlabs/beluga/blob/fdac66c34100844f982cda3c27a7167d225d3987/beluga/CMakeLists.txt#L7-L10

But beluga_amcl doesn't have those lines, so that might explain things :thinking:.

ivanpauno commented 1 year ago

@ivanpauno We try to set the build type in beluga if its not set externally:

That seems fine. Locally or for CI, a colcon defaults file can be used with the same purpose.

But beluga_amcl doesn't have those lines, so that might explain things thinking.

Yes, and beluga is completely templated ...

ivanpauno commented 1 year ago

I'm working on seting up perf in the docker container to run profiling. I have done that in the past, I don't remember exactly but there are some tricks to make it work with docker. I will share instructions when it's ready.

BTW, if anyone has a different profiling tool to suggest that's also fine.

nahueespinosa commented 1 year ago

I ran a few tests on the latest commit https://github.com/ekumenlabs/beluga/commit/15d4caed30bfaa726324917a1e88205f26aabaf4 changing the particle count (min_particles = max_particles).

chart

nav2_amcl uses less CPU for now, but beluga_amcl scales really well. We can handle 100'000 particles with enough CPU cores.

chart (1)

Disclaimer: This does not count as a proper benchmark yet.

ivanpauno commented 1 year ago

I was able to record cpu events using perf and generate a flamegraph. I will create a PR with instructions to reproduce soon. Basically, I'm running the beluga_amcl bag example and profiling that.

The resulting flamegraph is here. To see the interactive version (zoom-in/out, etc), download the svg (open the link and save-as locally), and re-open it in a web-browser (github disables running the svg scripts).

Some comments to make it easier to read:

Ignore the FastRtps part (see [fastrtps.so.2.6.4] and eprosima::...::ResourceEvent::...), not much we can do about that (mainly deserializing and waiting on new events).
The main thread executor is handling the timer callback and the bond publish status (see main -> rclcpp::spin)
Tf is creating an isolated thread to handle its events (see [librclcpp.so] -> spin_once_impl) Using only one executor may improve performance a lot. Something to try is the alternative transform listener constructor.
The tf callback is not shown in the same call stack that the above item (for some reason the perf post-processing didn't merge that with the above), but it's also there (see libstdc++.so.* -> rclcpp::executors::SingleThreadedExecutor::spin()) The libtbb.so call stack is the offloading caused because of using std::execution::par. I haven't processed this part in detail yet, probably easier to analyze using std::execution::seq.

ivanpauno commented 1 year ago

Another thing I want to test in more detail is how much is it worth using std::execution::par. For the default settings, my testing is that the sequential executor is faster.

Parallel

``` [timem-5] [/ws/install/beluga_amcl/lib/beluga_amcl/amcl_node]> Measurement totals: [timem-5] 33.200904 sec wall [timem-5] 7.040000 sec user [timem-5] 1.190000 sec sys [timem-5] 8.230000 sec cpu [timem-5] 24.788292 % cpu_util [timem-5] 33.092000 MB peak_rss [timem-5] 33.415168 MB page_rss [timem-5] 1060.937728 MB virtual_memory [timem-5] 0 major_page_flts [timem-5] 4485 minor_page_flts [timem-5] 112557 prio_cxt_swch [timem-5] 77029 vol_cxt_swch [timem-5] 0.310220 MB char_read [timem-5] 0.016384 MB bytes_read [timem-5] 0.026792 MB char_written [timem-5] 0.016384 MB bytes_written [timem-5] [INFO] [timem-5]: process has finished cleanly [pid 19762] ```

Sequential

``` [timem-5] [/ws/install/beluga_amcl/lib/beluga_amcl/amcl_node]> Measurement totals: [timem-5] 33.214505 sec wall [timem-5] 5.220000 sec user [timem-5] 0.740000 sec sys [timem-5] 5.960000 sec cpu [timem-5] 17.943836 % cpu_util [timem-5] 31.856000 MB peak_rss [timem-5] 32.227328 MB page_rss [timem-5] 822.751232 MB virtual_memory [timem-5] 0 major_page_flts [timem-5] 4196 minor_page_flts [timem-5] 846 prio_cxt_swch [timem-5] 69917 vol_cxt_swch [timem-5] 0.163348 MB char_read [timem-5] 0.000000 MB bytes_read [timem-5] 0.026812 MB char_written [timem-5] 0.016384 MB bytes_written ```

Though that's expected for a reletively low number of particles, I would expect the parallel executor to scale better for larger numbers. BTW, I'm not sure how much that's worth it, i.e. if that scalability is actually ever used in practice. I don't have an actual answer to that.

ivanpauno commented 1 year ago

To compare the resulting trajectories, I saw in lambkin evo was being used. It seems that that library doesn't check the pose covariance, so it doesn't seem to be a good idea to start comparing poses until covariance values are below some threshold. Was lambkin considering something like that?

We could also start by giving an initial pose to amcl to avoid that problem.

It may also be a good idea to compare how much time it takes to get correctly localized without a hint, but I guess that varies a lot so we would need to repeat many times.

@nahueespinosa any other thing you see valuable as metrics to compare localization "accuracy"?

nahueespinosa commented 1 year ago

@ivanpauno Those are all good points.

It seems that that library doesn't check the pose covariance, so it doesn't seem to be a good idea to start comparing poses until covariance values are below some threshold. Was lambkin considering something like that?

Not really, I think that giving an initial pose would be the best solution. Loading the initial pose from a parameter is the last step to complete #41, I've left it open in case some new contributor wanted to take on an easy task, but it looks like it's now needed for proper comparison.

It may also be a good idea to compare how much time it takes to get correctly localized without a hint, but I guess that varies a lot so we would need to repeat many times.

Yeah, that'd be interesting to see, it would be nice to know how that time changes with the number of particles as well. Note that nav2_amcl doesn't publish the estimate unless it was initialized with a known pose so there's not much to compare there. See nav2_amcl/src/amcl_node.cpp#L922-L931.

any other thing you see valuable as metrics to compare localization "accuracy"?

evo's APE and RPE are the only ones that come to mind.

hidmic commented 1 year ago

It seems that that library doesn't check the pose covariance, so it doesn't seem to be a good idea to start comparing poses until covariance values are below some threshold.

That's a good point nonetheless. I think both APE and RPE are defined in terms of point estimates (e.g. distribution mean for a particle filter) because it's simple and many systems downstream will just go along with those point estimates (e.g. classical planning algorithms ignore pose estimate covariance).

That doesn't mean we can't do better e.g. we could do probability density estimation for those errors and have measures of dispersion and confidence.

ivanpauno commented 1 year ago

I think the only missing thing is to create a nice report from the data we have. I'm not sure if we want to leave this ticket open for that end or to create a new one.

nahueespinosa commented 1 year ago

I'd prefer to keep this open until we get that report. I just updated the definition of done.

ivanpauno commented 1 year ago

I'd prefer to keep this open until we get that report. I just updated the definition of done.

Sounds good! What are we expecting the report to look like? Is a markdown file in the repo linked in the readme a good idea or are we expecting something else?

nahueespinosa commented 1 year ago

Is a markdown file in the repo linked in the readme a good idea.

That sounds good to me!

ivanpauno commented 1 year ago

I will wait for https://github.com/ekumenlabs/beluga/pull/126 and https://github.com/ekumenlabs/beluga/pull/119 to be merged before creating the report, as they could change results significantly (I would expect #126 to marginaly increase cpu usage, #119 shouldn't change performance as "on motion" resampling was implement manually in the amcl node).

SteveMacenski commented 5 months ago

I don't see a discussion here on the actual localization quality / improvements that Beluga could offer which seem important. Are there any metrics on this yet?

hidmic commented 5 months ago

@SteveMacenski there are! But the reports have not been made properly public (because time). Our benchmarking toolkit has to go open source (soon enough) and beluga_benchmark is due for an update (a merge, actually, with our private, soon public codebase). We are getting there...

SteveMacenski commented 5 months ago

OK! Let me know if I can be useful. The things I'm looking for (for reference):

Generally in the same-ish compute regime as AMCL. I don't think it needs to be a race to the bottom, but within reason
Equal or better localization quality over several realistic spaces (warehouse, office, etc) over long periods of time
Shown to resolve AMCL's delocalization jumping issues, or at least the potential path for it in the new framework
Better default tuning which we never really bothered with in Navigation/Nav2 to date

With that, I'm perfectly happy to dump nav2_amcl into its own repository on a deprecation track, add in Beluga as default, and the documentation / tutorials to fit.

- S

hidmic commented 5 months ago

Equal or better localization quality over several realistic spaces (warehouse, office, etc) over long periods of time

Noted. Warehouses are the trickiest for us right now. We would need a collaborator in that line of business (which we haven't found yet). Speaking of large spaces though, I wonder if any Willow Garage datasets (e.g. the one described here) survived to this day. Do you happen to know @mikeferguson? The ones on https://google-cartographer-ros.readthedocs.io/en/latest/data.html#pr2-willow-garage are gone, and 18 hs x 1600 $m^2$ was pretty decent.

Shown to resolve AMCL's delocalization jumping issues, or at least the potential path for it in the new framework

This one's perhaps the largest stretch. Posteriors may naturally go multi-modal on a particle filter. Maybe the key here is to do so in a controlled manner if it comes that :thinking: CC @glpuga @nahueespinosa.

mikeferguson commented 5 months ago

@hidmic I do not know of anywhere those still exist - perhaps reaching out to some of the authors of Cartographer?

SteveMacenski commented 5 months ago

Warehouses are the trickiest for us right now.

Let me inquire to see if I can get a dataset or two, no promises though. I have a few that I have private NDA access to, but nothing I can share

Ekumen-OS / beluga

Compare performance between `beluga_amcl` and `nav2_amcl` #35

Description

Definition of Done