27Apoorva commented 4 years ago

Required Info:

Operating System: Ubuntu 16.04)
Installation type: From source
ROS Version ROS1 -> Melodic on Kinetic
Version or commit hash:
- Commit hash: 4265f6ced69bf05e8b46c41ca515e504983440ba

Steps to reproduce issue

Update the mapper_params_localization.yaml with map file and map pose and change mode to localization. Play the rosbag with tf, lidar scan, odometry. Config file:

# Plugin params
solver_plugin: solver_plugins::CeresSolver
ceres_linear_solver: SPARSE_NORMAL_CHOLESKY
ceres_preconditioner: SCHUR_JACOBI
ceres_trust_strategy: LEVENBERG_MARQUARDT
ceres_dogleg_type: TRADITIONAL_DOGLEG
ceres_loss_function: HuberLoss

# ROS Parameters
odom_frame: odom
map_frame: map
base_frame: base_link
scan_topic: /scan
mode: localization #localization

# if you'd like to start localizing on bringup in a map and pose
map_file_name: /tmp/file-name
map_start_pose: [-24.94, 59.70, -1.50]

debug_logging: true
throttle_scans: 1
transform_publish_period: 0.02 #if 0 never publishes odometry
map_update_interval: 5.0
resolution: 0.05
max_laser_range: 20.0 #for rastering images
minimum_time_interval: 0.5
transform_timeout: 0.2
tf_buffer_duration: 30.
stack_size_to_use: 40000000 #// program needs a larger stack size to serialize large maps

# General Parameters
use_scan_matching: true
use_scan_barycenter: true
minimum_travel_distance: 0.5
minimum_travel_heading: 0.5
scan_buffer_size: 3
scan_buffer_maximum_scan_distance: 10
link_match_minimum_response_fine: 0.1  
link_scan_maximum_distance: 1.5
do_loop_closing: true 
loop_match_minimum_chain_size: 3
loop_match_maximum_variance_coarse: 3.0  
loop_match_minimum_response_coarse: 0.35    
loop_match_minimum_response_fine: 0.45

# Correlation Parameters - Correlation Parameters
correlation_search_space_dimension: 0.5
correlation_search_space_resolution: 0.01
correlation_search_space_smear_deviation: 0.1 

# Correlation Parameters - Loop Closure Parameters
loop_search_space_dimension: 8.0
loop_search_space_resolution: 0.05
loop_search_space_smear_deviation: 0.03
loop_search_maximum_distance: 5.0

# Scan Matcher Parameters
distance_variance_penalty: 0.5      
angle_variance_penalty: 1.0    

fine_search_angle_offset: 0.00349     
coarse_search_angle_offset: 0.349   
coarse_angle_resolution: 0.0349        
minimum_angle_penalty: 0.9
minimum_distance_penalty: 0.5
use_response_expansion: true

Expected behavior

Output should be localized pose

Actual behavior

Crash

[ INFO] [1583777043.814883930, 1557546271.995225210]: Using plugin solver_plugins::CeresSolver
Load From File /tmp/stagfull271.posegraph 
Mapper <- Module
Mapper <- m_pSequentialScanMatcher
Mapper <- m_pGraph
MapperGraph <- Graph; Graph <- m_Edges; Graph <- m_Vertices
MapperGraph <- m_pMapper; MapperGraph <- m_pLoopScanMatcher; MapperGraph <- m_pTraversal
Mapper <- m_pMapperSensorManager
MapperSensorManager <- m_ScanManagers; MapperSensorManager <- m_Scans
Mapper <- m_Listeners
**Finished serializing Mapper**
Load From File
**Serializing Dataset**
Dataset <- m_SensorNameLookup
Dataset <- m_Data
Dataset <- m_Lasers
Dataset <- m_pDatasetInfo
**Finished serializing Dataset**
[DEBUG] [1583777046.818415515, 1557546274.674728258]: DeserializePoseGraph: Successfully read file.
Registering sensor: [Custom Described Lidar]
[ INFO] [1583777046.975197145, 1557546274.674728258]: Waiting for incoming scan to get metadata...
[ INFO] [1583777048.350155972, 1557546274.731866044]: Got scan!
[DEBUG] [1583777048.350361532, 1557546274.731866044]: laser sick_lidar's pose wrt base: 0.460 0.000 0.012 0.000
[DEBUG] [1583777048.350414613, 1557546274.731866044]: laser is mounted upside-down
[DEBUG] [1583777048.354462457, 1557546274.741934691]: CeresSolver: Setting first node as a constant pose:-24.94, 59.70, -1.50.

Solver Summary (v 1.14.0-eigen-(3.2.92)-lapack-suitesparse-(4.4.6)-cxsparse-(3.1.4)-eigensparse-openmp-no_tbb)

                                     Original                  Reduced
Parameter blocks                         4149                     4146
Parameters                               4149                     4146
Residual blocks                          1536                     1536
Residuals                                4608                     4608

Minimizer                        TRUST_REGION

Sparse linear algebra library    SUITE_SPARSE
Trust region strategy     LEVENBERG_MARQUARDT

                                        Given                     Used
Linear solver          SPARSE_NORMAL_CHOLESKY   SPARSE_NORMAL_CHOLESKY
Threads                                    50                       50
Linear solver ordering              AUTOMATIC                     4146

Cost:
Initial                          1.425837e+04
Final                            3.398056e+03
Change                           1.086031e+04

Minimizer iterations                        7
Successful steps                            7
Unsuccessful steps                          0

Time (in seconds):
Preprocessor                         0.011812

  Residual only evaluation           0.110768 (7)
  Jacobian & residual evaluation     0.404967 (7)
  Linear solver                      0.028407 (7)
Minimizer                            0.557027

Postprocessor                        0.008198
Total                                0.577038

Termination:                      CONVERGENCE (Parameter tolerance reached. Relative step_norm: 3.367929e-04 <= 1.000000e-03.)

[DEBUG] [1583777067.221121851, 1557546281.080660694]: MessageFilter [target=odom ]: Added message in frame sick_lidar at time 1557546276.427, count now 5
[DEBUG] [1583777068.263961215, 1557546282.136009386]: Graph size: 1384
[DEBUG] [1583777068.634091732, 1557546282.499694135]: MessageFilter [target=odom ]: Added message in frame sick_lidar at time 1557546282.161, count now 5

terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::lock_error> >'
  what():  boost: mutex lock failed in pthread_mutex_lock: Invalid argument

Additional information

Traceback file:

#0  0x00007ff4f48a8428 in __GI_raise (sig=sig@entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ff4f48aa02a in __GI_abort () at abort.c:89
#2  0x00007ff4f51eb84d in __gnu_cxx::__verbose_terminate_handler() ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ff4f51e96b6 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ff4f51e9701 in std::terminate() ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ff4f51e9919 in __cxa_throw ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ff4fe7df6f1 in boost::throw_exception<boost::lock_error> (
    e=...) at /usr/include/boost/throw_exception.hpp:69
#7  0x00007ff4fe7df830 in boost::mutex::lock (this=0x15b1670)
    at /usr/include/boost/thread/pthread/mutex.hpp:119
#8  0x00007ff4fe7df650 in boost::unique_lock<boost::mutex>::lock (
    this=0x7ffc626726d0)
    at /usr/include/boost/thread/lock_types.hpp:346
#9  0x00007ff4fe7d89a7 in boost::unique_lock<boost::mutex>::unique_lock (this=0x7ffc626726d0, m_=...)
    at /usr/include/boost/thread/lock_types.hpp:124
#10 0x00007ff4fe420744 in boost::shared_mutex::lock_shared (
    this=0x15b1668)
    at /usr/include/boost/thread/pthread/shared_mutex.hpp:188
#11 0x00007ff4fe4206d0 in boost::shared_lock<boost::shared_mutex>::lock (this=0x7ffc62672818)
    at /usr/include/boost/thread/lock_types.hpp:645
#12 0x00007ff4fe420497 in boost::shared_lock<boost::shared_mutex>::shared_lock (this=0x7ffc62672818, m_=...)
    at /usr/include/boost/thread/lock_types.hpp:520
#13 0x00007ff4fe422e05 in karto::LocalizedRangeScan::GetPointReadings
    (this=0x15b1590, wantFiltered=false)
    at /mfp_workspace/src/slam_toolbox/slam_toolbox/lib/karto_sdk/incl---Type <return> to continue, or q <return> to quit---
ude/karto_sdk/Karto.h:5688
#14 0x00007ff4fdcd00d6 in karto::ScanMatcher::FindValidPoints (this=0x17bfc30, pScan=0x15b1590, 
    rViewPoint=...)
    at /mfp_workspace/src/slam_toolbox/slam_toolbox/lib/karto_sdk/src/Mapper.cpp:1075
#15 0x00007ff4fdccfd28 in karto::ScanMatcher::AddScan (this=0x17bfc30, pScan=0x15b1590, 
    rViewPoint=..., doSmear=true)
    at /mfp_workspace/src/slam_toolbox/slam_toolbox/lib/karto_sdk/src/Mapper.cpp:1035
#16 0x00007ff4fdccfcc5 in karto::ScanMatcher::AddScans (this=0x17bfc30, 
    rScans=std::vector of length 7, capacity 10 = {...}, viewPoint=...)
    at /mfp_workspace/src/slam_toolbox/slam_toolbox/lib/karto_sdk/src/Mapper.cpp:1002
#17 0x00007ff4fdcde381 in karto::ScanMatcher::MatchScan<std::vector<karto::LocalizedRangeScan*, std::allocator<karto::LocalizedRangeScan*> > > (this=0x17bfc30, pScan=0x536bb60, 
    rBaseScans=std::vector of length 7, capacity 10 = {...}, rMean=..., rCovariance=..., 
    doPenalize=true, doRefineMatch=true)
    at /mfp_workspace/src/slam_toolbox/slam_toolbox/lib/karto_sdk/src/Mapper.cpp:515
#18 0x00007ff4fdcd9afc in karto::Mapper::ProcessLocalization (this=0x17b6c70, pScan=0x536bb60)
    at /mfp_workspace/src/slam_toolbox/slam_toolbox/lib/karto_sdk/src/Mapper.cpp:2824
#19 0x00007ff4fe7d7840 in slam_toolbox::LocalizationSlamToolbox::addScan (this=0x7ffc62673e58, 
    laser=0x15af380, scan=..., karto_pose=...)
    at /mfp_workspace/src/slam_toolbox/slam_toolbox/src/slam_toolbox_localization.cpp:158
#20 0x00007ff4fe7d73c9 in slam_toolbox::LocalizationSlamToolbox::laserCallback (
    this=0x7ffc62673e58, scan=...)
    at /mfp_workspace/src/slam_toolbox/slam_toolbox/src/slam_toolbox_localization.cpp:111
#21 0x00007ff4fe3e6641 in boost::_mfi::mf1<void, slam_toolbox::SlamToolbox, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&>::operator() (this=0x15b2568, 
    p=0x7ffc62673e58, a1=...) at /usr/include/boost/bind/mem_fn_template.hpp:165
#22 0x00007ff4fe3e657b in boost::_bi::list2<boost::_bi::value<slam_toolbox::SlamToolbox*>, boost::arg<1> >::operator()<boost::_mfi::mf1<void, slam_toolbox::SlamToolbox, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&>, boost::_bi::list1<boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&> > (this=0x15b2578, f=..., a=...)
    at /usr/include/boost/bind/bind.hpp:313
#23 0x00007ff4fe3e64cd in boost::_bi::bind_t<void, boost::_mfi::mf1<void, slam_toolbox::SlamToolbox, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&>, boost::_bi::list2<boost::_bi::value<slam_toolbox::SlamToolbox*>, boost::arg<1> > >::operator()<boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&> (this=0x15b2568, a1=...)
    at /usr/include/boost/bind/bind.hpp:905
#24 0x00007ff4fe3e62a8 in boost::detail::function::void_function_obj_invoker1<boost::_bi::bind_t<void, boost::_mfi::mf1<void, slam_toolbox::SlamToolbox, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&>, boost::_bi::list2<boost::_bi::value<slam_toolbox::SlamToolbox*>, boost::arg<1> > >, void, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&>::invoke (function_obj_ptr=..., a0=...)
    at /usr/include/boost/function/function_template.hpp:159
#25 0x00007ff4fe3e5768 in boost::function1<void, boost::shared_ptr<sensor_msgs::LaserScan_<std::a---Type <return> to continue, or q <return> to q---Type <return> to continue, or q <return> to q---Type <return> to continue, or q <return> to q---Type <return> to continue, or q <return> to q---Type <return> to continue, or q <return> to q---Type <return> to continue, or q <return> to q---Type <return> to continue, or q <return> to q---Type <return> to continue, or q <return> to q---Type <return> to continue, or q <return> to quit---
llocator<void> > const> const&>::operator() (this=0x15b2560, a0=...)
    at /usr/include/boost/function/function_template.hpp:772
#26 0x00007ff4fe3e5473 in boost::detail::function::void_function_obj_invoker1<boost::function<void (boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&)>, void, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> >::invoke(boost::detail::function::function_buffer&, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const>) (function_obj_ptr=..., a0=...)
    at /usr/include/boost/function/function_template.hpp:159
#27 0x00007ff4fe3e5c8f in boost::function1<void, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> >::operator() (this=0x16be988, a0=...) at /usr/include/boost/function/function_template.hpp:772
#28 0x00007ff4fe3e5a73 in message_filters::CallbackHelper1T<boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&, sensor_msgs::LaserScan_<std::allocator<void> > >::call (this=0x16be980, event=..., 
    nonconst_force_copy=false) at /opt/ros/kinetic/include/message_filters/signal1.h:76
#29 0x00007ff4fe3c84bd in message_filters::Signal1<sensor_msgs::LaserScan_<std::allocator<void> > >::call (
    this=0x17a1d48, event=...) at /opt/ros/kinetic/include/message_filters/signal1.h:119
#30 0x00007ff4fe3c83ed in message_filters::SimpleFilter<sensor_msgs::LaserScan_<std::allocator<void> > >::signalMessage (this=0x17a1d48, event=...) at /opt/ros/kinetic/include/message_filters/simple_filter.h:136
#31 0x00007ff4fe3d3979 in tf2_ros::MessageFilter<sensor_msgs::LaserScan_<std::allocator<void> > >::CBQueueCallback::call (this=0x17d8160) at /opt/ros/kinetic/include/tf2_ros/message_filter.h:596
#32 0x00007ff4f84b3838 in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) ()
   from /opt/ros/kinetic/lib/libroscpp.so
#33 0x00007ff4f84b523b in ros::CallbackQueue::callAvailable(ros::WallDuration) ()
   from /opt/ros/kinetic/lib/libroscpp.so
#34 0x00007ff4f8511e39 in ros::SingleThreadedSpinner::spin(ros::CallbackQueue*) ()
   from /opt/ros/kinetic/lib/libroscpp.so
#35 0x00007ff4f84f6e9b in ros::spin() () from /opt/ros/kinetic/lib/libroscpp.so
#36 0x000000000049ee08 in main (argc=1, argv=0x7ffc62674758)
    at /mfp_workspace/src/slam_toolbox/slam_toolbox/src/slam_toolbox_localization_node.cpp:43

SteveMacenski commented 4 years ago

Please properly fill out the issue template.

Also provide your parameters and bag file.

27Apoorva commented 4 years ago

@SteveMacenski Sorry about that. I updated the issue template. Hope that is better. I will get you a traceback file if my copy pasting from gdb session is not helpful. Thanks

SteveMacenski commented 4 years ago

This issue is complex, I’ll need the bag too. Please fully describe your setup and the issue. When does it happen. How often. Have you tried other maps. Etc. The template should walk you through providing that information- I just recently added it so if there’s tweeks I can make to nudge folks like yourself to fully describe their issue let me know :-)

But given no one else has reported this, I suspect it has to do with your setup, available memory, etc. so what’s this being run on? CPU/memory/static memory left. It could be that localization is trying to access a deleted scan or something’s corrupted in the serialized map itself.

When does this happen? Directly on deserialization or sometime after? How was the map that was serialized used/generated/treated?

27Apoorva commented 4 years ago

It would be great if you could add sections in your template for:

Bag File
Config file
Relevant screenshots
Computer's config if not using snap

I am using Ubuntu 16.04 in VM inside macbook. Memory - 9.4gb Processor - Intel core i7-8750H CPU @ 2.20GHz × 4 Disk - 811GB I can see the map in RViz. The deserialization doesn't give any errors. Robot moves for some time about 30 seconds maybe and then it happens. I just am giving the path to map in the config file. I cannot post the bag file online because of confidentiality issue. I haven't tried it on any other map yet. Things I will try:

Get a serialized map from another data.
Try localization again

If not I will email you the bag file.

SteveMacenski commented 4 years ago

Robot moves for some time about 30 seconds maybe and then it happens

Things aren't random - so something most have triggered this. Is this ~30 timeframe happen to align with the first time that the loop closure system is triggered? Or perhaps the first time that Ceres fails to converge (which is odd in your log above, I've never seen a properly setup system fail to converge)?

9.4gb

Your memory is a bit low, you sure there's actually memory available? I assume that's OK, but just checking.

Please try with another run generated from the sync or async nodes and try again. I think one of your colleagues posted a ticket earlier about generating serialized files from LifeLong mapping mode and had issues (which, don't do). Lets try to isolate this to:

is it this set of data?
is it this serialized file?
is it the localization mode?
is it this computer?

I cannot post the bag file online because of confidentiality issue.

OK, then I can't help with that. I don't want to be responsible for your companies IP or data if you're unwilling to share for debug.

27Apoorva commented 4 years ago

I used another dataset. It still has the same error. When I used another dataset, I created the posegraph using online_sync mode. and seems like graph loads fine when I opened it in rviz.

Not just this dataset
Serialized map loads fine in RViz. Not sure if there is another way to check it.

I can send the bag if you can take a look at it. Thank you. Here is the google drive link with bag since bag was quite big. https://drive.google.com/drive/folders/1Z-lzib9k2RlorWhBBa1SKL3WK7wAy18P?usp=sharing

SteveMacenski commented 4 years ago

Is this ~30 timeframe happen to align with the first time that the loop closure system is triggered? Or perhaps the first time that Ceres fails to converge (which is odd in your log above, I've never seen a properly setup system fail to converge)?

Please answer my question above. Also what happens if you disable loop closure in localization mode?

27Apoorva commented 4 years ago

so i am sure it doesn't happen when loop closure is happening for the first time. I tried disabling loop closure but it still happens. I am gonna look into the memory stuff since I have debug_logging as true which might be making it crash

157 looks the same as my issue.

SteveMacenski commented 4 years ago

@27Apoorva any update?

27Apoorva commented 4 years ago

last I know it still was the issue so we have dropped to work on this for now. Thank you for all your help/

Tobias-Fischer commented 4 years ago

Hiya - I experience the same problem, as described in #233. I think it is more beneficial to continue the discussion here. It would be great if we find a solution. Please let me know how I can contribute.

To answer some questions from above:

I serialized the map using online_sync.launch and the rviz plugin
The machine has 32GB of RAM so I don't think it's a memory issue
I am using localization.launch when the crash happens and set a map_file_name

Tobias-Fischer commented 4 years ago

I actually found that the crash does not happen when I set mode: localization in the mapper_params_localization.yaml. Not sure why it is set to mapping by default in localization mode? I will run more tests tomorrow.

SteveMacenski commented 4 years ago

What's the crash? Get a gdb traceback please.

See read me on the mode setting. It just changes how ceres handles removing values from the graph (speed vs memory - default to mapping optimizes for memory; localization is speed)

Tobias-Fischer commented 4 years ago

The traceback seems to be the same as that posted by @27Apoorva; for completeness:

#0  0x00007ffff553f438 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff554103a in __GI_abort () at abort.c:89
#2  0x00007ffff5b7984d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff5b776b6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff5b77701 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff5b77919 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7b51508 in void boost::throw_exception<boost::lock_error>(boost::lock_error const&) () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/liblocalization_slam_toolbox.so
#7  0x00007ffff7b4d1e9 in boost::mutex::lock() () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/liblocalization_slam_toolbox.so
#8  0x00007ffff7b51e7b in boost::unique_lock<boost::mutex>::lock() () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/liblocalization_slam_toolbox.so
#9  0x00007ffff7b51575 in boost::unique_lock<boost::mutex>::unique_lock(boost::mutex&) () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/liblocalization_slam_toolbox.so
#10 0x00007ffff775854f in boost::shared_mutex::lock_shared() () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libtoolbox_common.so
#11 0x00007ffff775b79b in boost::shared_lock<boost::shared_mutex>::lock() () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libtoolbox_common.so
#12 0x00007ffff775a257 in boost::shared_lock<boost::shared_mutex>::shared_lock(boost::shared_mutex&) () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libtoolbox_common.so
#13 0x00007ffff7758b7f in karto::LocalizedRangeScan::GetPointReadings(bool) const () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libtoolbox_common.so
#14 0x00007ffff6f9a489 in karto::ScanMatcher::FindValidPoints(karto::LocalizedRangeScan*, karto::Vector2<double> const&) const () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libkartoSlamToolbox.so
#15 0x00007ffff6f9a22e in karto::ScanMatcher::AddScan(karto::LocalizedRangeScan*, karto::Vector2<double> const&, bool) () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libkartoSlamToolbox.so
#16 0x00007ffff6f9a0ca in karto::ScanMatcher::AddScans(std::vector<karto::LocalizedRangeScan*, std::allocator<karto::LocalizedRangeScan*> > const&, karto::Vector2<double>) () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libkartoSlamToolbox.so
#17 0x00007ffff6fb06e5 in double karto::ScanMatcher::MatchScan<std::vector<karto::LocalizedRangeScan*, std::allocator<karto::LocalizedRangeScan*> > >(karto::LocalizedRangeScan*, std::vector<karto::LocalizedRangeScan*, std::allocator<karto::LocalizedRangeScan*> > const&, karto::Pose2&, karto::Matrix3&, bool, bool) () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libkartoSlamToolbox.so
#18 0x00007ffff6fa29fa in karto::Mapper::ProcessLocalization(karto::LocalizedRangeScan*) () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libkartoSlamToolbox.so
#19 0x00007ffff7b4aebc in slam_toolbox::LocalizationSlamToolbox::addScan(karto::LaserRangeFinder*, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&, karto::Pose2&) ()
   from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/liblocalization_slam_toolbox.so
#20 0x00007ffff7b4ab30 in slam_toolbox::LocalizationSlamToolbox::laserCallback(boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&) () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/liblocalization_slam_toolbox.so
#21 0x00007ffff76d862e in boost::_mfi::mf1<void, slam_toolbox::SlamToolbox, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&>::operator()(slam_toolbox::SlamToolbox*, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&) const
    () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libtoolbox_common.so
#22 0x00007ffff76d21c6 in void boost::_bi::list2<boost::_bi::value<slam_toolbox::SlamToolbox*>, boost::arg<1> >::operator()<boost::_mfi::mf1<void, slam_toolbox::SlamToolbox, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&>, boost::_bi::list1<boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&> >(boost::_bi::type<void>, boost::_mfi::mf1<void, slam_toolbox::SlamToolbox, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&>&, boost::_bi::list1<boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&>&, int) () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libtoolbox_common.so
#23 0x00007ffff76cb1ab in void boost::_bi::bind_t<void, boost::_mfi::mf1<void, slam_toolbox::SlamToolbox, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&>, boost::_bi::list2<boost::_bi::value<slam_toolbox::SlamToolbox*>, boost::arg<1> > >::operator()<boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&>(boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&) () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libtoolbox_common.so
#24 0x00007ffff76c2f62 in boost::detail::function::void_function_obj_invoker1<boost::_bi::bind_t<void, boost::_mfi::mf1<void, slam_toolbox::SlamToolbox, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&>, boost::_bi::list2<boost::_bi::value<slam_toolbox::SlamToolbox*>, boost::arg<1> > >, void, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&>::invoke(boost::detail::function::function_buffer&, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&) ()
   from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libtoolbox_common.so
#25 0x00007ffff76d22ef in boost::function1<void, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&>::operator()(boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&) const ()
   from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libtoolbox_common.so
#26 0x00007ffff76cb323 in boost::detail::function::void_function_obj_invoker1<boost::function<void (boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&)>, void, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> >::invoke(boost::detail::function::function_buffer&, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const>) () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libtoolbox_common.so
#27 0x00007ffff771f6c2 in boost::function1<void, boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> >::operator()(boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const>) const ()
   from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libtoolbox_common.so
#28 0x00007ffff771afb1 in message_filters::CallbackHelper1T<boost::shared_ptr<sensor_msgs::LaserScan_<std::allocator<void> > const> const&, sensor_msgs::LaserScan_<std::allocator<void> > >::call(ros::MessageEvent<sensor_msgs::LaserScan_<std::allocator<void> > const> const&, bool)
    () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libtoolbox_common.so
#29 0x00007ffff76c9185 in message_filters::Signal1<sensor_msgs::LaserScan_<std::allocator<void> > >::call(ros::MessageEvent<sensor_msgs::LaserScan_<std::allocator<void> > const> const&) () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libtoolbox_common.so
#30 0x00007ffff76c047d in message_filters::SimpleFilter<sensor_msgs::LaserScan_<std::allocator<void> > >::signalMessage(ros::MessageEvent<sensor_msgs::LaserScan_<std::allocator<void> > const> const&) ()
   from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libtoolbox_common.so
#31 0x00007ffff771a383 in tf2_ros::MessageFilter<sensor_msgs::LaserScan_<std::allocator<void> > >::CBQueueCallback::call() () from /home/administrator/catkin_ws/devel/.private/slam_toolbox/lib/libtoolbox_common.so
#32 0x00007ffff6805838 in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) () from /opt/ros/kinetic/lib/libroscpp.so
#33 0x00007ffff680723b in ros::CallbackQueue::callAvailable(ros::WallDuration) () from /opt/ros/kinetic/lib/libroscpp.so
#34 0x00007ffff6863e39 in ros::SingleThreadedSpinner::spin(ros::CallbackQueue*) () from /opt/ros/kinetic/lib/libroscpp.so
#35 0x00007ffff6848e9b in ros::spin() () from /opt/ros/kinetic/lib/libroscpp.so
#36 0x00000000004c8943 in main ()

SteveMacenski commented 4 years ago

Looks like the same issue - its trying to access some point data for a scan that is either deleted, invalid, or for some reason not able to access the mutex. Google backs this theory up:

https://stackoverflow.com/questions/30090108/what-does-mutex-lock-fail-with-invalid-argument-mean

Is there some deterministic (or even pseudo-deterministic) way you can get this to occur? The main reason I couldn't help @27Apoorva is because I couldn't replicate. I ran over many of my datasets and this never occurred. If you have a dataset that this happens in somewhat regularly, we can look and see what's being deleted its unhappy about and track back to whatever algorithmic error there is.

Tobias-Fischer commented 4 years ago

Alright, here we go: https://www.dropbox.com/sh/oncqjr76857kxyv/AADuL3p1cJLE5cXKEEMRUJ-ca?dl=0

To replicate, run rosbag play --clock bags_2020-07-21-20-43-01.bag and roslaunch slam_toolbox localization.launch. You will need to modify the path to the serialized map in the mapper_params_localization.yaml (note that also the base_frame differs from the default). It seems to crash every ~5th time or so. It also seems to crash more often if I launch the slam_toolbox a couple of seconds after the rosbag play. I also attach another bag, bags_2020-07-21-20-57-42.bag, which is longer and I also observed the crash.

I hope this helps - let me know if there is something else I can provide.

Tobias-Fischer commented 4 years ago

~~As before, setting mode: localization leads to fewer/no crashes (I haven't observed any yet in that mode)~~ just had one ...

SteveMacenski commented 4 years ago

OK, this is on my queue to look at (I'm busy with something else but its item #3 on my queue). How long is the bag that fails every 5th time? I don't want to sit around for 15 hours waiting for a crash and spamming my console with debug info.

My guess from what we're looking at is one of the two following:

that there's some mutex that's not locked somewhere it should be so when we delete an object, its mid-read while that's happening
The localization measurement deletion logic is removing something it shouldn't have of or some residual of a deleted scan is still stored that's being tried to access. Given it happens only 1/5 of the time, I feel this is less likely the case but seems to me logically like the most likely place for me to have screwed up.

Either way, if you want to do some debugging in the meantime, more power to you. The more we can isolate where this is happening, the better. Maybe try running it without the scan matcher (Ceres) plugin so there's no loop closure. If it goes away, that narrows down to that subsystem. If it doesn't, then we can at least rule that out (I'm hoping its something Ceres related, that subsystem is far more easy to debug than the graph side of things). LocalizedRangeScan::GetPointReadings is the function where your tracebacks tell me the failure is happening when it tries to lock the mutex while trying to AddScan.

27Apoorva commented 4 years ago

Hi, If I remember correctly, the reason this crash was happening had something to do with ceres and running kinetic branch. When I switched to melodic on Ubuntu 16.04, the crash stopped happening. I am currently not working on slam toolbox actively but this is what I remember on top of my head.

SteveMacenski commented 4 years ago

Ah, I don't support Kinetic branch anymore (its a wasteland without structure from prototyping as a research hobby project). It wouldn't surprise me at all if there were serious issues in the Kinetic branch. @Tobias-Fischer are you also using the kinetic branch? This could be an easy fix (and maybe I just delete the kinetic branch so that no one else runs into this) :-)

Tobias-Fischer commented 4 years ago

Hiya, The bag files are ~30 seconds (crashing every 5th time that one) and ~2 minutes respectively, so not too long to sit around. I will run it without Ceres and let you know. The error happens both on the melodic and noetic branches. The rosbag is recorded on ROS Kinetic though (unfortunately the robot still runs Kinetic), but that should not matter. I agree with your suggestion to delete the kinetic branch. The noetic branch compiles fine on kinetic/melodic, so IMO there is not even a need for the melodic branch.

Many thanks for looking into this!

Tobias-Fischer commented 4 years ago

So the crash happens even after setting use_scan_matching: false and do_loop_closing: false. It crashes quite frequently now actually which should be good for debugging.

Tobias-Fischer commented 4 years ago

Seems like I found the culprit: In https://github.com/SteveMacenski/slam_toolbox/blob/noetic-devel/slam_toolbox/lib/karto_sdk/src/Mapper.cpp#L2860-L2875 the variable oldLSV.vertex is a nullptr. The assignment in https://github.com/SteveMacenski/slam_toolbox/blob/noetic-devel/slam_toolbox/lib/karto_sdk/src/Mapper.cpp#L2879 is executed even if scan_vertex is a nullptr.

Not sure what the best fix for this is, but I hope you know what to do from there @SteveMacenski.

Many thanks!

SteveMacenski commented 4 years ago

oldLSV is from another place from scan_vertex. scan_vertex is from the current scan's processing that's replacing the old one in the front of the buffer (oldLSV). The only way scan_vertex can be nullptr is if https://github.com/SteveMacenski/slam_toolbox/blob/noetic-devel/slam_toolbox/lib/karto_sdk/src/Mapper.cpp#L1388 this function returns nullptr which I don't see how that could happen. The assert should make things crash if pscan isn't valid. Then once you get into the if statement, you'll return a real-allocated pointer.

You could test if I'm wrong though easily by adding a breakpoint or print statement on L1402 and see if that for some reason is at some point triggered.

It doesn't surprise me at all though that that part of the code is probably where the issue lies. I did spend a bunch of time white boarding this and testing to make sure it all worked out OK but given that was all more 'art' than 'engineering', I recognize that's probably where the mistake is.

I finished one of my tasks, this is now next up on my task list.

Tobias-Fischer commented 4 years ago

So if you set useScanMatching to false, then AddVertex is never called, so scan_vertex is always a nullptr. This explains why it crashes more often (always) without scan matching enabled.

Also, I think there is a missing break after https://github.com/SteveMacenski/slam_toolbox/blob/noetic-devel/slam_toolbox/lib/karto_sdk/src/Mapper.cpp#L2917 which could potentially lead to the same edge being removed multiple times and yet another nullptr.

SteveMacenski commented 4 years ago

I don't think there should be a break. A vertex can have many edges connecting to it from loop closures. Just because you found 1 edge doesn't mean there's not another or many others. The found is to make sure you found at minimum one because you know in fact there is at least one. You couldn't remove the same edge multiple times because it would look different. The source and target would be switched relative to both edge connected nodes. Other nodes don't have that same edge info from that single node so there's no risk of duplication there. If we really did delete the same edge multiple times, L2915's delete would be what's crashing, and crashing super regularly.

Agree? Or am I missing something potentially?

That is true about what you say about turning off scan matching. Frankly it would never occur to me someone would ever turn that off (you're not really using SLAM or localization anymore at that point)

Tobias-Fischer commented 4 years ago

Agreed re the break, haven't properly thought that true.

Also agreed re the scan matching, I just did that because you suggested it. Maybe that option should not be exposed ..

SteveMacenski commented 4 years ago

Maybe that option should not be exposed ..

I don't disagree

Tobias-Fischer commented 4 years ago

So I played with it a little but didn't very far. I thought that maybe the shared lock / unique lock stuff might be the culprit (race condition between the unlock of the shared lock and getting the unique lock) and played with upgrade lock / upgrade to unique lock as replacements, but that led either to deadlocks or crashes.

Did you have any luck?

SteveMacenski commented 4 years ago

I haven't looked, like I said, I have other things first in my queue. I hope to be able start looking myself midway through next week.

SteveMacenski commented 4 years ago

Hi, I finished by last task, this is next up on deck to start looking at tomorrow

Tobias-Fischer commented 4 years ago

That's great, thanks!

SteveMacenski commented 4 years ago

@Tobias-Fischer While I can't reproduce locally (tons of TF errors in Noetic that aren't related just spam me endlessly and I dont actually see any transformation ?) I think I know what's happening.

ProcessLocalization calls MatchScan using the current scan (probably OK) and a set of running scans in the scan buffer GetRunningScans. The running scans are a list of the most recent N scans in the buffer of size parameter set. These running scans are then given to AddScans with the current scan's position. Given the thing failing is accessing a scan, we can now safely ignore the current scan of being a problem since that's not even in the AddScans input. So AddScans does:

    const_forEach(LocalizedRangeScanVector, &rScans)
    {
      if (*iter == NULL)
      {
        continue;
      }

      AddScan(*iter, viewPoint);
    }

The AddScan part is the next step in the traceback. So each scan is iterated over. What's odd here is that you can see I intentionall added a check to see if it was an invalid scan with the NULL check before processing. So there's 3 options:

A: The scan is deleted but the memory wasn't set to null so even though its gone, where ever it was deleted from needs a set to nullptr afterwards.
B: At the time of processing, the scan is still actually valid but by the time it gets to accessing the scan, its been deleted.
C: The data was deleted but the scan wasn't

We then call the FindValidPoints function and this is where we call GetPointReadings in Karto.h:5686. After that we have the mutex and its crash-central.

Now some good questions that I don't run into often: if you've deleted an object but not set it to null, if you access it (e.g. pScan->GetPointReadings() as in FindValidPoints), what happens? Immediate crash? tries to do stuff until it accesses a member variable which dies (but member functions OK)? If its the former, then the issue isn't that it was already deleted because we see in the traceback it gets into member functions. If its the latter, then it probably was because that mutex is the very first member object it tries to access.

So a couple things I might try:

In Karto.h:5688 (in the noetic branch, should be same for melodic) where we have the boost lock, add a line beforehand to do some erroneous check on the m_IsDirty parameter and see if it crashes at that line now and not the mutex. If so, that means just accessing any member fails and therefore the mutex isn't special. Or add a new parameter so that its the only place its used so the crash couldn't be mistaken with an un-locked member variable being corrupted from multi-threaded access.
In Mapper.cpp:1075 where we do the pScan->GetPointReadings(), maybe do another nullptr check - though I don't think that's likely to help

What might be a problem?

Scans are deleted but not set to nullptr
Your localization buffer of scans might be longer than the running scan length size, or maybe the same size but some occasional issue results in one being deleted right around the time when another requires it. they're both based on the scan_buffer_size parameter, so maybe make the localization length -1 of that or directly just using the running scan buffer?
the measurements within a laser scan are deleted but the scan is still alive / non-null
In the slam_toolbox_localization.cpp and slam_toolbox_common.cpp we delete scans that were not successfully processed. Maybe you have an issue where the laser scans aren't being fully processed so they're externally deleted but openKarto isn't aware of it. in _localization.cpp:171 and _common.cpp:534 you could add print statements to see if that's called before a crash.

I'll link to a branch shortly where I go over all the deletes I think might be problematic and add null afterwards if missing and you can see if that fixes it for you since I can't seem to trigger it

SteveMacenski commented 4 years ago

tl;dr: the problem is in the running scans - somehow there's 1 in the running scan buffer that's either been deleted, elements of it have been deleted, or something else odd like deleted but not nulled out. It does not have to do with the current scan at all. Given we only see this in localization mode, I suspect is that we delete a scan that's still in the rolling scan buffer from the m_LocalizationScanVertices buffer.

~Or I suppose the opposite in AddRunningScan where it erases a scan that the m_LocalizationScanVertices is still storing when its distance is saying to. I'd have to think about it more, but that doesn't seem as likely.~

      while (m_RunningScans.size() > m_RunningBufferMaximumSize ||
             squaredDistance > math::Square(m_RunningBufferMaximumDistance) - KT_TOLERANCE)
      {
        // remove front of running scans
        m_RunningScans.erase(m_RunningScans.begin());

        // recompute stats of running scans
        frontScanPose = m_RunningScans.front()->GetSensorPose();
        backScanPose = m_RunningScans.back()->GetSensorPose();
        squaredDistance = frontScanPose.GetPosition().SquaredDistance(backScanPose.GetPosition());
      }

edit: probably not this. If the running scan stuff deleted a scan, then we wouldnt see a crash trying to access it because that list would be up to date. We would have seen the crash in the m_LocalizationScanVertices part of the code if this was the case. Disregard.

Another thing we could do to debug is to print the unique IDs of the laser scans every time we do some operation and see who is trying to access or delete a scan. That would tell us what scan is causing the crash and then who's the last to play with it and what they did. Since the bag is only 30 seconds long it shouldn't be too bad

Branch: https://github.com/SteveMacenski/slam_toolbox/tree/180 - I just did a -l on the bag file and I'll leave it running for an hour on this branch and see if I can get it to crash after... ~120 runs. Are you saying you see the crash after 5 runs when you re-launch Slam Toolbox each time or just playing the bag 5 times over the same running localization session?

Please test the branch and get back to me with an update and we can take it from there.

Tobias-Fischer commented 4 years ago

I haven't observed this crash anymore in the 180 branch. Thanks a lot, I highly appreciate your efforts in this project!

SteveMacenski commented 4 years ago

Really? That easy.... OK....@27Apoorva can you confirm? If I were to venture from that analysis, it was https://github.com/SteveMacenski/slam_toolbox/pull/246/files#diff-82245f7e389553498ddb2d01272b8b0aR5330 that fixed it. So the object was deleted but the readings didn't get set to null

@Tobias-Fischer are you sufficiently happy with that if I merged it in to close out this issue?

SteveMacenski commented 4 years ago

Whoops, merge closed this - waiting for confirmation that we're all done here. FYI - see a bunch of PRs I just added, if you weren't building with release builds, this will give you about a 50% speed up

SteveMacenski commented 4 years ago

Closing from merged PRs. @Tobias-Fischer @27Apoorva let me know if this comes back in any form. Crashing is no beuno, thanks for the help debugging and isolating the problem. That made my life much easier to find the issue.

Tobias-Fischer commented 4 years ago

Unfortunately this still happens - much more rarely though, rare enough for me not to get a stacktrace :(.

Error is still the same:

terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::lock_error> >'
  what():  boost: mutex lock failed in pthread_mutex_lock: Invalid argument

SteveMacenski commented 4 years ago

If you can get a new traceback, please file a new ticket and I can take a look. Might be more things I need to add nullptr to, but that surprises me a little. I'd have to see specifically where this one is being called from since its likely going to be different than before.

SteveMacenski / slam_toolbox

Melodic localization mode crash pthread lock #180

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

so i am sure it doesn't happen when loop closure is happening for the first time. I tried disabling loop closure but it still happens. I am gonna look into the memory stuff since I have debug_logging as true which might be making it crash

157 looks the same as my issue.