Closed nahueespinosa closed 1 year ago
Current results with timemory after https://github.com/ekumenlabs/beluga/commit/17e95ba3f688c4297b77826580a4d78cb72ece77:
$ ros2 launch beluga_example example_rosbag_launch.py prefix:='timem --'
...
[timem-5] [/ws/install/beluga_amcl/lib/beluga_amcl/amcl_node]> Measurement totals:
[timem-5] 33.124544 sec wall
[timem-5] 48.830000 sec user
[timem-5] 0.690000 sec sys
[timem-5] 49.520000 sec cpu
[timem-5] 149.495704 % cpu_util
[timem-5] 33.700000 MB peak_rss
[timem-5] 33.968128 MB page_rss
[timem-5] 1364.967424 MB virtual_memory
[timem-5] 0 major_page_flts
[timem-5] 4593 minor_page_flts
[timem-5] 39839 prio_cxt_swch
[timem-5] 72632 vol_cxt_swch
[timem-5] 0.310466 MB char_read
[timem-5] 0.000000 MB bytes_read
[timem-5] 0.036408 MB char_written
[timem-5] 0.016384 MB bytes_written
$ ros2 launch beluga_example example_rosbag_launch.py package:=nav2_amcl node:=amcl prefix:='timem --'
...
[timem-5] [/opt/ros/humble/lib/nav2_amcl/amcl]> Measurement totals:
[timem-5] 33.120430 sec wall
[timem-5] 2.190000 sec user
[timem-5] 0.540000 sec sys
[timem-5] 2.730000 sec cpu
[timem-5] 8.242598 % cpu_util
[timem-5] 35.164000 MB peak_rss
[timem-5] 35.176448 MB page_rss
[timem-5] 901.443584 MB virtual_memory
[timem-5] 0 major_page_flts
[timem-5] 4125 minor_page_flts
[timem-5] 237 prio_cxt_swch
[timem-5] 70541 vol_cxt_swch
[timem-5] 0.185864 MB char_read
[timem-5] 0.000000 MB bytes_read
[timem-5] 0.003188 MB char_written
[timem-5] 0.000000 MB bytes_written
CPU usage is a bit high :sweat_smile:, I think some of this is expected since we're using multiple threads to update the filter. This is good to start a discussion and definitely worth investigating.
CPU usage is a bit high
One of the main differences between the packages is that nav2_amcl
is decimating the laser scan beams before processing them. #84 will add that feature to beluga in order to continue our analysis.
@nahueespinosa did you pass --cmake-args -DCMAKE_BUILD_TYPE=Release
to colcon?
I notice a significant difference between using that or not.
The default is the cmake None
build type, which is different from Debug
, Release
, RelWithDebInfo
.
I'm not sure why it's the default, as it's never good to use it.
I also tested with https://github.com/ekumenlabs/beluga/pull/84 merged, the performance is now similar to amcl:
Beluga:
[timem-5] [/ws/install/beluga_amcl/lib/beluga_amcl/amcl_node]> Measurement totals:
[timem-5] 33.231538 sec wall
[timem-5] 6.670000 sec user
[timem-5] 1.170000 sec sys
[timem-5] 7.840000 sec cpu
[timem-5] 23.591892 % cpu_util
[timem-5] 32.788000 MB peak_rss
[timem-5] 33.140736 MB page_rss
[timem-5] 1060.876288 MB virtual_memory
[timem-5] 0 major_page_flts
[timem-5] 4464 minor_page_flts
[timem-5] 91538 prio_cxt_swch
[timem-5] 77126 vol_cxt_swch
[timem-5] 0.312198 MB char_read
[timem-5] 0.000000 MB bytes_read
[timem-5] 0.026644 MB char_written
[timem-5] 0.016384 MB bytes_written
[timem-5]
[INFO] [timem-5]: process has finished cleanly [pid 12155]
Nav2:
[timem-5] [/opt/ros/humble/lib/nav2_amcl/amcl]> Measurement totals:
[timem-5] 33.241450 sec wall
[timem-5] 4.180000 sec user
[timem-5] 0.670000 sec sys
[timem-5] 4.850000 sec cpu
[timem-5] 14.590110 % cpu_util
[timem-5] 34.856000 MB peak_rss
[timem-5] 34.787328 MB page_rss
[timem-5] 901.599232 MB virtual_memory
[timem-5] 0 major_page_flts
[timem-5] 4142 minor_page_flts
[timem-5] 475 prio_cxt_swch
[timem-5] 69618 vol_cxt_swch
[timem-5] 0.185900 MB char_read
[timem-5] 0.000000 MB bytes_read
[timem-5] 0.006208 MB char_written
[timem-5] 0.008192 MB bytes_written
[timem-5]
[INFO] [timem-5]: process has finished cleanly [pid 12259]
@ivanpauno We try to set the build type in beluga
if its not set externally:
But beluga_amcl
doesn't have those lines, so that might explain things :thinking:.
@ivanpauno We try to set the build type in beluga if its not set externally:
That seems fine. Locally or for CI, a colcon defaults file can be used with the same purpose.
But beluga_amcl doesn't have those lines, so that might explain things thinking.
Yes, and beluga
is completely templated ...
I'm working on seting up perf
in the docker container to run profiling.
I have done that in the past, I don't remember exactly but there are some tricks to make it work with docker.
I will share instructions when it's ready.
BTW, if anyone has a different profiling tool to suggest that's also fine.
I ran a few tests on the latest commit https://github.com/ekumenlabs/beluga/commit/15d4caed30bfaa726324917a1e88205f26aabaf4 changing the particle count (min_particles = max_particles).
nav2_amcl
uses less CPU for now, but beluga_amcl
scales really well. We can handle 100'000 particles with enough CPU cores.
Disclaimer: This does not count as a proper benchmark yet.
I was able to record cpu events using perf and generate a flamegraph.
I will create a PR with instructions to reproduce soon.
Basically, I'm running the beluga_amcl
bag example and profiling that.
The resulting flamegraph is here. To see the interactive version (zoom-in/out, etc), download the svg (open the link and save-as locally), and re-open it in a web-browser (github disables running the svg scripts).
Some comments to make it easier to read:
libtbb.so
call stack is the offloading caused because of using std::execution::par
.
I haven't processed this part in detail yet, probably easier to analyze using std::execution::seq
.Another thing I want to test in more detail is how much is it worth using std::execution::par
.
For the default settings, my testing is that the sequential executor is faster.
Though that's expected for a reletively low number of particles, I would expect the parallel executor to scale better for larger numbers. BTW, I'm not sure how much that's worth it, i.e. if that scalability is actually ever used in practice. I don't have an actual answer to that.
To compare the resulting trajectories, I saw in lambkin evo was being used. It seems that that library doesn't check the pose covariance, so it doesn't seem to be a good idea to start comparing poses until covariance values are below some threshold. Was lambkin considering something like that?
We could also start by giving an initial pose to amcl to avoid that problem.
It may also be a good idea to compare how much time it takes to get correctly localized without a hint, but I guess that varies a lot so we would need to repeat many times.
@nahueespinosa any other thing you see valuable as metrics to compare localization "accuracy"?
@ivanpauno Those are all good points.
It seems that that library doesn't check the pose covariance, so it doesn't seem to be a good idea to start comparing poses until covariance values are below some threshold. Was lambkin considering something like that?
Not really, I think that giving an initial pose would be the best solution. Loading the initial pose from a parameter is the last step to complete #41, I've left it open in case some new contributor wanted to take on an easy task, but it looks like it's now needed for proper comparison.
It may also be a good idea to compare how much time it takes to get correctly localized without a hint, but I guess that varies a lot so we would need to repeat many times.
Yeah, that'd be interesting to see, it would be nice to know how that time changes with the number of particles as well.
Note that nav2_amcl
doesn't publish the estimate unless it was initialized with a known pose so there's not much to compare there. See nav2_amcl/src/amcl_node.cpp#L922-L931.
any other thing you see valuable as metrics to compare localization "accuracy"?
evo
's APE and RPE are the only ones that come to mind.
It seems that that library doesn't check the pose covariance, so it doesn't seem to be a good idea to start comparing poses until covariance values are below some threshold.
That's a good point nonetheless. I think both APE and RPE are defined in terms of point estimates (e.g. distribution mean for a particle filter) because it's simple and many systems downstream will just go along with those point estimates (e.g. classical planning algorithms ignore pose estimate covariance).
That doesn't mean we can't do better e.g. we could do probability density estimation for those errors and have measures of dispersion and confidence.
I think the only missing thing is to create a nice report from the data we have. I'm not sure if we want to leave this ticket open for that end or to create a new one.
I'd prefer to keep this open until we get that report. I just updated the definition of done.
I'd prefer to keep this open until we get that report. I just updated the definition of done.
Sounds good! What are we expecting the report to look like? Is a markdown file in the repo linked in the readme a good idea or are we expecting something else?
Is a markdown file in the repo linked in the readme a good idea.
That sounds good to me!
I will wait for https://github.com/ekumenlabs/beluga/pull/126 and https://github.com/ekumenlabs/beluga/pull/119 to be merged before creating the report, as they could change results significantly (I would expect #126 to marginaly increase cpu usage, #119 shouldn't change performance as "on motion" resampling was implement manually in the amcl node).
I don't see a discussion here on the actual localization quality / improvements that Beluga could offer which seem important. Are there any metrics on this yet?
@SteveMacenski there are! But the reports have not been made properly public (because time). Our benchmarking toolkit has to go open source (soon enough) and beluga_benchmark
is due for an update (a merge, actually, with our private, soon public codebase). We are getting there...
OK! Let me know if I can be useful. The things I'm looking for (for reference):
With that, I'm perfectly happy to dump nav2_amcl
into its own repository on a deprecation track, add in Beluga as default, and the documentation / tutorials to fit.
- S
Equal or better localization quality over several realistic spaces (warehouse, office, etc) over long periods of time
Noted. Warehouses are the trickiest for us right now. We would need a collaborator in that line of business (which we haven't found yet). Speaking of large spaces though, I wonder if any Willow Garage datasets (e.g. the one described here) survived to this day. Do you happen to know @mikeferguson? The ones on https://google-cartographer-ros.readthedocs.io/en/latest/data.html#pr2-willow-garage are gone, and 18 hs x 1600 $m^2$ was pretty decent.
Shown to resolve AMCL's delocalization jumping issues, or at least the potential path for it in the new framework
This one's perhaps the largest stretch. Posteriors may naturally go multi-modal on a particle filter. Maybe the key here is to do so in a controlled manner if it comes that :thinking: CC @glpuga @nahueespinosa.
@hidmic I do not know of anywhere those still exist - perhaps reaching out to some of the authors of Cartographer?
Warehouses are the trickiest for us right now.
Let me inquire to see if I can get a dataset or two, no promises though. I have a few that I have private NDA access to, but nothing I can share
Description
Now that we have a functional ROS2 node
beluga_amcl
, we should start thinking about a comparison withnav2_amcl
. It would be ideal to have a reproducible environment to compare performance and accuracy whenever we want as we improve the software (looking at youlambkin
).QuickMCL
already has done this for the ROS1 case, leaving behind a repository we could use as reference.Definition of Done