SteveMacenski / slam_toolbox

Slam Toolbox for lifelong mapping and localization in potentially massive maps with ROS
GNU Lesser General Public License v2.1
1.5k stars 497 forks source link

TF timestamp problem #717

Open Renbago opened 5 days ago

Renbago commented 5 days ago

Required Info: Tested on SAHA-ROBOTIK Speedy

Problem description

With large amount(the total amount of edges and nodes in current graph is 2190) with pose_graph data the https://github.com/SteveMacenski/slam_toolbox/blob/19a66904c648a174220be1a1b98e81a9ef0e6758/src/slam_toolbox_common.cpp#L264 time difference between time.now and the last data coming from scan_timestamp increases (6-8 seconds) while publishing the info. It's synchronized with time.now when reached goal or pose. The LOG:

[INFO] [1719387024.208723337] [SR64S1.global_costmap.global_costmap]: Message Filter dropping message: frame 'main_lidar_link' at time 1719387023.839 for reason 'the timestamp on the message is earlier than all the data in the transform cache'
[INFO] [1719387024.227065683] [SR64S1.global_costmap.global_costmap]: Message Filter dropping message: frame 'obstacle_lidar_link' at time 1719387023.926 for reason 'the timestamp on the message is earlier than all the data in the transform cache'
[INFO] [1719387024.272666888] [SR64S1.global_costmap.global_costmap]: Message Filter dropping message: frame 'base_link' at time 1719387023.802 for reason 'the timestamp on the message is earlier than all the data in the transform cache' 
[ERROR] [1720007812.847318626] [transformPoseInTargetFrame]: Extrapolation Error looking up target frame: Lookup would require extrapolation into the past.  Requested time 1720007802.415600 but the earliest data is at time 1720007802.817890, when looking up transform from frame [base_footprint] to frame [map]

After ERROR robot stops, then matching pose_graph with blue link and robot creating path again. By using same map with less pose_graph, I get less INFO and ERROR,(if its per seconds with less maps like 10 sec)

From past issues i found this merge: https://github.com/SteveMacenski/slam_toolbox/pull/377 , also after searched ros.org I found the same issue: https://answers.ros.org/question/393581/for-nav2-lidar-timestamp-on-the-message-is-earlier-than-all-the-data-in-the-transform-cache/

His comment:

ok, this issue and my [other one] 
([https://answers.ros.org/question/395470/slam_toolbox-map-odom-transform-stuck-at-time-0200/)]) are somehow linked. 
Under Galactic, the slam_toolbox timestamps the TF map-odom with the timestamp of the last 
laserscan + transform_timeout offset ([see here](1.).
Under Foxy, it was timestamped with now() + transform_timeout offset ([see here](2.)).

I replaced in Galactic the laser timestamp by now(), and my two issues got resolved...
 definitively something to look here.

the links in comment are :

  1. https://github.com/SteveMacenski/slam_toolbox/blob/f449abf039dd15cd0be719ceeb039533b487329c/src/slam_toolbox_common.cpp#L234
  2. https://github.com/SteveMacenski/slam_toolbox/blob/e2f0a71c5475b4e7d32085a18aa89c36f014e28b/src/slam_toolbox_common.cpp#L233

After changing the msg.header.stamp = scan_timestamp + transform_timeout_; with msg.header.stamp = this->now() + transform_timeout_;

Not getting INFO and ERROR. But this is not solving the problem, because of the matching blue link robot losing localisation.

The large pose_graph map large_map

With scan_timestamp :

Screencast from 07-03-2024 03:39:08 PM.webm

With this->now():

Screencast from 07-03-2024 03:59:13 PM.webm

Screencast from 07-03-2024 04:02:21 PM.webm

Additional information

tf2_monitor outputs:

RESULTS: for all Frames

Frames:
Frame: base_footprint, published by <no authority available>, Average Delay: 0.00701269, Max Delay: 0.0276206
Frame: base_link, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: bottom_shelf_link, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: charging_block_bottom_link, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: charging_block_top_link, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: chassis_link, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: cliff_sensor_link, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: docking_virtual_link, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: eyes_cyl_link, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: eyes_link, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: front_camera_bottom_screw_frame, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: front_camera_color_frame, published by <no authority available>, Average Delay: 880.685, Max Delay: 880.685
Frame: front_camera_color_optical_frame, published by <no authority available>, Average Delay: 880.685, Max Delay: 880.685
Frame: front_camera_depth_frame, published by <no authority available>, Average Delay: 880.714, Max Delay: 880.714
Frame: front_camera_depth_optical_frame, published by <no authority available>, Average Delay: 880.714, Max Delay: 880.714
Frame: front_camera_link, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: imu_link, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: imu_link_transform, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: left_mast_link, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: left_wheel_link, published by <no authority available>, Average Delay: 0.00311589, Max Delay: 0.135685
Frame: logo_link, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: main_lidar_dummy_cloud_link, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: main_lidar_link, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: middle_shelf_link, published by <no authority available>, Average Delay: 881.54, Max Delay: 881.54
Frame: obstacle_lidar_link, published by <no authority available>, Average Delay: 881.541, Max Delay: 881.541
Frame: odom, published by <no authority available>, Average Delay: -0.898396, Max Delay: 0
Frame: right_mast_link, published by <no authority available>, Average Delay: 881.541, Max Delay: 881.541
Frame: right_wheel_link, published by <no authority available>, Average Delay: 0.00311771, Max Delay: 0.135687
Frame: scu_v3_left_bottom, published by <no authority available>, Average Delay: 881.541, Max Delay: 881.541
Frame: supp_wheel_bl_link, published by <no authority available>, Average Delay: 881.541, Max Delay: 881.541
Frame: supp_wheel_br_link, published by <no authority available>, Average Delay: 881.541, Max Delay: 881.541
Frame: supp_wheel_fl_link, published by <no authority available>, Average Delay: 881.541, Max Delay: 881.541
Frame: supp_wheel_fr_link, published by <no authority available>, Average Delay: 881.541, Max Delay: 881.541
Frame: trays_head_link, published by <no authority available>, Average Delay: 881.541, Max Delay: 881.541

All Broadcasters:
Node: <no authority available> 117.914 Hz, Average Delay: -0.377333 Max Delay: 0.0276206
^C[INFO] [1720008636.915834597] [rclcpp]: signal_handler(signum=2)
SteveMacenski commented 5 days ago

time difference between time.now and the last data coming from scan_timestamp increases (6-8 seconds) while publishing the info

Is that in the sync mode? What about async mode? As the documentation describes, async is meant to take data as it comes rather than sync which buffer up all data and makes sure that you process every possible scan. If you have a small compute and a sufficiently large space, you can find the bounds of what you can do in real-time before your computer can't keep up with the process. async was developed specifically with that in mind so that it didn't fall behind - it only looks for new scans to process after the last scan was processed.

That's not a particularly large map, but I also suspect that you don't have a very powerful computer on there.

Another option is to live use async to get a good but not best map for use then-and-there and then use cloud services with bagged data with sync to get the best map with all the data for later use offline. For one reason or another, one way or another, most companies end up having some level of duality of cloud services for getting 'improved' maps.

But, this is all described in the docs and in the ROSCon talk.

After changing the msg.header.stamp = scan_timestamp + transformtimeout; with msg.header.stamp = this->now() + transformtimeout; Not getting INFO and ERROR. But this is not solving the problem, because of the matching blue link robot losing localisation.

That is mis-stamping your laser scans, that's not a real solution and will cause issues in time synchronization of your product.

Renbago commented 3 days ago

Hi, sorry for late reply. In issue we wre using async mode. The hardware is nuc11tnki5

For more to understand problem:

Screencast from 07-05-2024 03:10:27 PM.webm

/*****************************************************************************/
void SlamToolbox::publishTransformLoop(
  const double & transform_publish_period)
/*****************************************************************************/
{
  if (transform_publish_period == 0) {
    return;
  }

  rclcpp::Rate r(1.0 / transform_publish_period);
  while (rclcpp::ok()) {
    {
      boost::mutex::scoped_lock lock(map_to_odom_mutex_);
      auto current_time = this->now();
      rclcpp::Time scan_timestamp = scan_header.stamp;
      RCLCPP_INFO(get_logger(), "pasted_time_last_recieved_scan_data = %f", (current_time - scan_timestamp).seconds());
      // Avoid publishing tf with initial 0.0 scan timestamp
      if (scan_timestamp.seconds() > 0.0 && !scan_header.frame_id.empty()) {
        geometry_msgs::msg::TransformStamped msg;
        msg.transform = tf2::toMsg(map_to_odom_);
        msg.child_frame_id = odom_frame_;
        msg.header.frame_id = map_frame_;
        msg.header.stamp = scan_timestamp + transform_timeout_;
        tfB_->sendTransform(msg);
      }
    }
    r.sleep();
  }
}
SteveMacenski commented 3 days ago

I'm not entirely sure what you're showing or explaining, please provide some context for that video / code.

Are you sure that your laser scanner node is giving you properly stamped times? Is there any major latency in the system? Id be interested in seeing the time difference between the node->now() time and the scan message header in the main laser callback in async mode.

CihatAltiparmak commented 8 hours ago

When i saw this ticket related to my problem, i wanted to reply this ticket.

Required Info:

Steps to reproduce issue

just clone navigation2 and slam_toolbox into your workspace

mkdir slam_ws/src -p
cd slam_ws/src
git clone https://github.com/SteveMacenski/slam_toolbox
git clone https://github.com/ros-navigation/navigation2
cd ..
colcon build

And then download this pose graph from google drive and extract to your ws.

While running localization launch of slam_toolbox, use this config file.

mapper_params_localization.yaml

slam_toolbox:
  ros__parameters:
    solver_plugin: solver_plugins::CeresSolver
    ceres_linear_solver: SPARSE_NORMAL_CHOLESKY
    ceres_preconditioner: SCHUR_JACOBI
    ceres_trust_strategy: LEVENBERG_MARQUARDT
    ceres_dogleg_type: TRADITIONAL_DOGLEG
    ceres_loss_function: None

    # ROS Parameters
    odom_frame: odom
    map_frame: map
    base_frame: base_footprint
    scan_topic: /scan
    mode: localization

    # if you'd like to start localizing on bringup in a map and pose
    #map_file_name: test_steve
    #map_start_pose: [5.0, 1.0, 0.0]
    map_file_name: /home/<your_username>/slam_ws/big_pose_graph/my_pose
    map_start_pose: [-2.0, -0.5, 0.0]

    debug_logging: false
    throttle_scans: 1
    transform_publish_period: 0.02 #if 0 never publishes odometry
    map_update_interval: 5.0
    resolution: 0.05
    min_laser_range: 0.0 #for rastering images
    max_laser_range: 20.0 #for rastering images
    minimum_time_interval: 0.5
    transform_timeout: 0.2
    tf_buffer_duration: 30.
    stack_size_to_use: 40000000 #// program needs a larger stack size to serialize large maps

    # General Parameters
    use_scan_matching: true
    use_scan_barycenter: true
    minimum_travel_distance: 0.5
    minimum_travel_heading: 0.5
    scan_buffer_size: 3
    scan_buffer_maximum_scan_distance: 10.0
    link_match_minimum_response_fine: 0.1  
    link_scan_maximum_distance: 1.5
    do_loop_closing: true 
    loop_match_minimum_chain_size: 3
    loop_match_maximum_variance_coarse: 3.0  
    loop_match_minimum_response_coarse: 0.35    
    loop_match_minimum_response_fine: 0.45

    # Correlation Parameters - Correlation Parameters
    correlation_search_space_dimension: 0.5
    correlation_search_space_resolution: 0.01
    correlation_search_space_smear_deviation: 0.1 

    # Correlation Parameters - Loop Closure Parameters
    loop_search_space_dimension: 8.0
    loop_search_space_resolution: 0.05
    loop_search_space_smear_deviation: 0.03
    loop_search_maximum_distance: 3.0

    # Scan Matcher Parameters
    distance_variance_penalty: 0.5      
    angle_variance_penalty: 1.0    

    fine_search_angle_offset: 0.00349     
    coarse_search_angle_offset: 0.349   
    coarse_angle_resolution: 0.0349        
    minimum_angle_penalty: 0.9
    minimum_distance_penalty: 0.5
    use_response_expansion: true

and then just run this 4 launch in separate terminals (firstly go to workspace per each terminal)

ros2 launch turtlebot3_gazebo turtlebot3_world.launch.py use_sim_time:=True
ros2 launch nav2_bringup navigation_launch.py use_sim_time:=True
ros2 launch slam_toolbox localization_launch.py use_sim_time:=True
rviz2 -d src/navigation2/nav2_bringup/rviz/nav2_default_view.rviz

Finally send goal pose to upper-right corners of map like below.

Screenshot from 2024-07-08 16-12-14

Expected behavior

Localization should work well even with big pose graph

Actual behavior

ProcessLocalization() -> m_pGraph->AddEdges(pScan, cov) -> LinkNearChains(pScan, means, covariances) function chain in karto/Mapper.cpp takes 0.9 seconds when bigger pose is used for localization. This time consuming execution breaks navigation2 trajectory execution lifecycle because of time delay between map and odom frame.

Additional information

When i change link_scan_maximum_distance parameter as 0.5, it works. But with default configuration, below parts of code becomes mostly time-consuming.

https://github.com/SteveMacenski/slam_toolbox/blob/a108d26d190371ea60675935f8a216586d1d23dc/lib/karto_sdk/src/Mapper.cpp#L2889-L2892

https://github.com/SteveMacenski/slam_toolbox/blob/a108d26d190371ea60675935f8a216586d1d23dc/lib/karto_sdk/src/Mapper.cpp#L1493

https://github.com/SteveMacenski/slam_toolbox/blob/a108d26d190371ea60675935f8a216586d1d23dc/lib/karto_sdk/src/Mapper.cpp#L1643-L1660

In my case, FindNearChains gives 70 sample. ScanMacher does its own job in approximately 0.01~0.02 seconds. Total time approximately equals 0.8 seconds.

It would be great that you give advice for this problem. Thank you in advance.