Are trimmered submaps released from memory?

Robin-Li-NJ commented 6 years ago

Hello all,

While I running 3D pure localization, the memory is keeping growing. The growing speed is similar as building map. Are trimmered submaps released from memory?

gaschler commented 6 years ago

Yeah, trimmed submaps should be removed from memory.

Note that pure localization only trims a submap after global SLAM work queue has processed it, so a possible cause is that global SLAM falls behind. (For instance, when you run offline_node, local SLAM is so fast that global SLAM has a long work queue so you can have large numbers of submaps.) Please describe how many submaps you have and how you run cartographer.

Robin-Li-NJ commented 6 years ago

I run cartographer demo_backpack_3d_localization.launch with a saved state file and a bag file, and there are 3 ongoing submaps in rviz for new trajectory.

gaschler commented 6 years ago

Well, from your explanation, it is hard to guess what is the cause. Visualization is likely to use much memory, also on the cartographer_ros server side. However, without visualization, it is not easy to inspect. Before digging deeper, can you give more detail how to run, how large the different kind of input is, how long you run it, how much memory the individual processes use after 1, 10, 30, 60 minutes. Please measure memory without running rviz because its service calls can block cartographer_node for quite some time.

Robin-Li-NJ commented 6 years ago

I run cartographer_ros without rivz this time. Memory is still growing.

Here is my input: 1) bag file with about one hour laser data(20Hz) and IMU data(200Hz); 2) pbstream file with about 350 submaps;

I run pure localization with the bag file and pbstream file for ~1 hour and following is the detailed memory consumption output by pmap with one minute interval.

1 | mapped: | 3010972K | writeable/private: | 2385184K | shared: | 28K 2 | mapped: | 4223436K | writeable/private: | 2696016K | shared: | 28K 3 | mapped: | 4485580K | writeable/private: | 2926372K | shared: | 28K 4 | mapped: | 4616652K | writeable/private: | 3095820K | shared: | 28K 5 | mapped: | 4616652K | writeable/private: | 3173640K | shared: | 28K 6 | mapped: | 4878796K | writeable/private: | 3371196K | shared: | 28K 7 | mapped: | 4944332K | writeable/private: | 3476224K | shared: | 28K 8 | mapped: | 5140940K | writeable/private: | 3635644K | shared: | 28K 9 | mapped: | 5272012K | writeable/private: | 3745908K | shared: | 28K 10 | mapped: | 5534156K | writeable/private: | 3950992K | shared: | 28K 11 | mapped: | 5534156K | writeable/private: | 4019028K | shared: | 28K 12 | mapped: | 5796300K | writeable/private: | 4247116K | shared: | 28K 13 | mapped: | 5861836K | writeable/private: | 4307352K | shared: | 28K 14 | mapped: | 5927372K | writeable/private: | 4333872K | shared: | 28K 15 | mapped: | 5927372K | writeable/private: | 4388956K | shared: | 28K 16 | mapped: | 5992908K | writeable/private: | 4472476K | shared: | 28K 17 | mapped: | 6058444K | writeable/private: | 4513844K | shared: | 28K 18 | mapped: | 6058444K | writeable/private: | 4526160K | shared: | 28K 19 | mapped: | 5992908K | writeable/private: | 4534888K | shared: | 28K 20 | mapped: | 6058444K | writeable/private: | 4538608K | shared: | 28K 21 | mapped: | 6058444K | writeable/private: | 4539744K | shared: | 28K 22 | mapped: | 6058444K | writeable/private: | 4540088K | shared: | 28K 23 | mapped: | 6058444K | writeable/private: | 4551592K | shared: | 28K 24 | mapped: | 5992908K | writeable/private: | 4541688K | shared: | 28K 25 | mapped: | 6123980K | writeable/private: | 4570076K | shared: | 28K 26 | mapped: | 5992908K | writeable/private: | 4557952K | shared: | 28K 27 | mapped: | 6189516K | writeable/private: | 4583192K | shared: | 28K 28 | mapped: | 6255052K | writeable/private: | 4646852K | shared: | 28K 29 | mapped: | 6451660K | writeable/private: | 4909636K | shared: | 28K 30 | mapped: | 6451660K | writeable/private: | 4970292K | shared: | 28K 31 | mapped: | 6517196K | writeable/private: | 5031644K | shared: | 28K 32 | mapped: | 6517196K | writeable/private: | 5045264K | shared: | 28K 33 | mapped: | 6713804K | writeable/private: | 5154940K | shared: | 28K 34 | mapped: | 6779340K | writeable/private: | 5221976K | shared: | 28K 35 | mapped: | 6975948K | writeable/private: | 5422568K | shared: | 28K 36 | mapped: | 6910412K | writeable/private: | 5462496K | shared: | 28K 37 | mapped: | 7041484K | writeable/private: | 5484216K | shared: | 28K 38 | mapped: | 7172556K | writeable/private: | 5642168K | shared: | 28K 39 | mapped: | 7303628K | writeable/private: | 5737952K | shared: | 28K 40 | mapped: | 7303628K | writeable/private: | 5775204K | shared: | 28K 41 | mapped: | 7303628K | writeable/private: | 5799468K | shared: | 28K 42 | mapped: | 7303628K | writeable/private: | 5813936K | shared: | 28K 43 | mapped: | 7303628K | writeable/private: | 5826624K | shared: | 28K 44 | mapped: | 7303628K | writeable/private: | 5829936K | shared: | 28K 45 | mapped: | 7303628K | writeable/private: | 5829936K | shared: | 28K 46 | mapped: | 7303628K | writeable/private: | 5830284K | shared: | 28K 47 | mapped: | 7303628K | writeable/private: | 5830284K | shared: | 28K 48 | mapped: | 7303628K | writeable/private: | 5830284K | shared: | 28K 49 | mapped: | 7303628K | writeable/private: | 5830584K | shared: | 28K 50 | mapped: | 7303628K | writeable/private: | 5831772K | shared: | 28K 51 | mapped: | 7304660K | writeable/private: | 5843020K | shared: | 28K 52 | mapped: | 7312856K | writeable/private: | 5852048K | shared: | 28K 53 | mapped: | 7312856K | writeable/private: | 5854732K | shared: | 28K 54 | mapped: | 7312856K | writeable/private: | 5857416K | shared: | 28K 55 | mapped: | 7312856K | writeable/private: | 5860100K | shared: | 28K 56 | mapped: | 7312856K | writeable/private: | 5862784K | shared: | 28K

gaschler commented 6 years ago

Thanks for the evaluation. Memory usage should really stabilize after minutes (if global SLAM can keep up and pure localization trimmer is on), so this sounds like a potential bug. I think the only way to track this down is to evaluate in depth. Can you share your config and data (or reproduce on one of our public datasets)? The easiest way to reproduce something is to fork cartographer_ros and change exactly the config files you need. (Or just post terminal commands if no change from master is needed.)

There are many data structures that need to be trimmed to limit memory. Most of them can be exported by serialization and then later inspected by rviz or other tools. Perhaps a smaller component such as OptimizationProblem3D::imu_data_ may not be cleaned up correctly. That would show up.

Can you please evaluate as follows:

Compile the latest version of cartographer(_ros)
Remove rviz from the launch file and run your setup with input files that are public.
Call rosservice call /write_state filenameINCREMENT.pbstream after 10, 20, 30 minutes
Share the files publicly.

Of course, the resulting files may be large. If the issue reproduces without a pbstream input, that would be great so the shared files are smaller.

I hope that the pbstream file sizes grow accordingly, than it would be easy to find the culprit.

Of course, there are a couple of run time data structures that also need to be trimmed (like ConstraintBuilder::submap_scan_matchers_) or deleted (like work_queue_ or Task). But I would look into the serialized files first.

Robin-Li-NJ commented 6 years ago

@gaschler

I use gperftools to dump heap and then compare dumps. Most increased memory is related to "cartographer::mapping::DynamicGrid::mutable_value". Is this info helped?

robot@r01> ./pprof --text --base=/tmp/mybin.hprof_15413.0052.heap /home/robot/catkin_ws/install_isolated/bin/cartographer_node /tmp/mybin.hprof_15413.0091.heap Using local file /home/robot/catkin_ws/install_isolated/bin/cartographer_node. Using local file /tmp/mybin.hprof_15413.0091.heap. Total: 335.0 MB 332.3 99.2% 99.2% 332.8 99.3% cartographer::mapping::DynamicGrid::mutable_value 0.9 0.3% 99.5% 0.9 0.3% ros::Connection::read 0.6 0.2% 99.7% 0.6 0.2% std::vector::vector 0.5 0.1% 99.8% 0.5 0.1% cartographer::mapping::DynamicGrid::Grow 0.4 0.1% 99.9% 0.4 0.1% tf2::TimeCache::insertData 0.2 0.1% 100.0% 0.2 0.1% std::_Rb_tree::_M_emplace_unique 0.1 0.0% 100.0% 0.1 0.0% Eigen::Matrix::Matrix 0.0 0.0% 100.0% 9.0 2.7% cartographer::mapping::LocalTrajectoryBuilder3D::InsertIntoSubmap 0.0 0.0% 100.0% 0.0 0.0% cartographer::mapping::optimization::OptimizationProblem3D::AddTrajectoryNode 0.0 0.0% 100.0% -0.0 -0.0% cartographer::mapping::PoseGraph3D::AddNode 0.0 0.0% 100.0% 0.0 0.0% cartographer::mapping::scan_matching::RotationalScanMatcher::RotationalScanMatcher 0.0 0.0% 100.0% 0.0 0.0% std::vector::reserve 0.0 0.0% 100.0% 0.0 0.0% boost::make_shared 0.0 0.0% 100.0% 245.1 73.2% cartographer::mapping::scan_matching::PrecomputeGrid 0.0 0.0% 100.0% 333.6 99.6% std::_Function_handler::_M_invoke 0.0 0.0% 100.0% 0.0 0.0% std::deque::_M_push_back_aux 0.0 0.0% 100.0% 0.0 0.0% std::_Rb_tree::_M_emplace_hint_unique 0.0 0.0% 100.0% 0.0 0.0% ros::CallbackQueue::addCallback 0.0 0.0% 100.0% 0.0 0.0% std::_Hashtable::_M_allocate_node

I will do suggested experiment later.

cschuet commented 6 years ago

I just ran the cartographer_paper_deutsches_museum.bag and here is the output of pmap writeable/private as a function of time.

Note that I went to lunch before the bag was finished, so the last part of the graph tracks memory when ROS playbag had already finished and the cartographer_node was spinning idly.

We might also have a cartographer problem (need to clean-up IMUs and odometry data when trimming) but for sure we have a ROS problem :)

gaschler commented 6 years ago

@robinsking, thanks for the measurements. HybridGrid=DynamicGrid is used at several places. Both in Submap3D and ConstraintBuilder3D, which contains FastCorrelativeScanMatcher3D, which contains PrecomputationGrid. I also think there are multiple issues, the easy to reproduce cartographer_ros one by @cschuet, and a Cartographer one. Would be very helpful if you can provide a reproducible case. (I believe it should already be reproducible with much smaller input data than what you currently use.)

Robin-Li-NJ commented 6 years ago

@gaschler

After go through the code and dump file, it's not memory leak issue.

We have some minor code change while computing GlobalConstraint. We use distance to exclude submap far away from computing GlobalConstraint. While robot is moving, it will always bring in some new submaps to calculate GlobalConstraint and growing memory are used for PrecomputationGrid of new included submap from loaded state.

Thanks anyway for your friendly response.

MichaelGrupp commented 6 years ago

@cschuet

need to clean-up IMUs and odometry data when trimming

This should happen, as the MapByTime data structures are trimmed here: https://github.com/googlecartographer/cartographer/blob/2f9ee597d06523cc4a851ef100b3b655feff8491/cartographer/mapping/internal/optimization/optimization_problem_2d.cc#L170

but for sure we have a ROS problem

Not sure if related, but I noticed that we allow infinite subscription queues in cartographer_ros, as here (the kInfinite... constant is zero)

cschuet commented 6 years ago

@MichaelGrupp It seems the ros::TransformBroadcaster is leaking and it has its own NodeHandle so it should not have the infinite subscription queues.

cschuet commented 6 years ago

Renaming the topic on which the TransformBroadcaster publishes the tf's to something other than "/tf" removes the memory leak (Thanks @pifon2a !), so it seems that our TransformListener is accumulating in its buffer our own tf updates.

I thought https://github.com/googlecartographer/cartographer_ros/commit/d53d3395626404cdb6c9501570def358fd90fd95 (Thanks @MichaelGrupp ) addressed this by setting the buffer size, but I see memory increase well beyond the 10 seconds that the buffer size is set to.

MichaelGrupp commented 6 years ago

:thinking:

Also, shutting down the subscribers of the node, i.e. calling /finish_trajectory after rosbag play finishes, stops the memory accumulation.

MichaelGrupp commented 6 years ago

Here's another valgrind-massif log for this scenario. Same as [here]( ), except I let the node keep running to record that leak.

Could it be that this is related to using simulated time in the demo? Because if the bag player stops, we also have no clock publisher anymore. This might lead to strange behavior with the transform buffer cache time.

valgrind_massif_carto_1e1_buffer_hallway_localization_leak_after_bagplay.log graph

MichaelGrupp commented 6 years ago

Yep, that's the problem :) And I have a solution for the demos, will prepare PRs.

cartographer-project / cartographer

Are trimmered submaps released from memory? #1182