eProsima / Micro-XRCE-DDS

An XRCE DDS implementation. Looking for commercial support? Contact info@eprosima.com
Apache License 2.0
156 stars 17 forks source link

Segmentation fault in microCDR #73

Closed TieneSabor closed 3 years ago

TieneSabor commented 3 years ago

I was recently trying to test micro ros client and agent using an app similar to rtt-test demo code, but written in rclc instead. (https://github.com/micro-ROS/micro-ROS-rtt) I found that after publish and subscribe certain amount of data, the client would failed and showed segmentation fault. My test case was that the client continuously publish a roughly 2k Byte data and subscribe the same topic in each round, and it failed at about 16k th round.

Core was generated by `install/micro_ros_demos_rclc/lib/micro_ros_demos_rclc/rtt_test_host/rtt_test_ho'.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f88af8d3187 in ucdr_deserialize_uint8_t (ub=ub@entry=0x7fff024f5380, value=value@entry=0x7fff024f52fa "\t") at /root/microros_ws/src/eProsima/Micro-CDR/src/c/types/basic.c:181
181         *value = (uint8_t)*ub->iterator;
[Current thread is 1 (Thread 0x7f88af3a7000 (LWP 79974))]
(gdb) bt
#0  0x00007f88af8d3187 in ucdr_deserialize_uint8_t (ub=ub@entry=0x7fff024f5380, value=value@entry=0x7fff024f52fa "\t") at /root/microros_ws/src/eProsima/Micro-CDR/src/c/types/basic.c:181
#1  0x00007f88af8f6a88 in uxr_deserialize_submessage_header (ub=ub@entry=0x7fff024f5380, id=id@entry=0x7fff024f52fa "\t", flags=flags@entry=0x7fff024f52fb "", length=length@entry=0x7fff024f52fc)
    at /root/microros_ws/src/eProsima/Micro-XRCE-DDS-Client/src/c/core/serialization/xrce_subheader.c:15
#2  0x00007f88af8f0891 in uxr_read_submessage_header (ub=ub@entry=0x7fff024f5380, submessage_id=submessage_id@entry=0x7fff024f52fa "\t", length=length@entry=0x7fff024f52fc, 
    flags=flags@entry=0x7fff024f52fb "") at /root/microros_ws/src/eProsima/Micro-XRCE-DDS-Client/src/c/core/session/submessage.c:23
#3  0x00007f88af8eea5b in read_submessage_list (session=session@entry=0x7f88afc0ca98 <custom_sessions+632>, submessages=submessages@entry=0x7fff024f5380, stream_id=..., stream_id@entry=...)
    at /root/microros_ws/src/eProsima/Micro-XRCE-DDS-Client/src/c/core/session/session.c:594
#4  0x00007f88af8eefa2 in read_stream (session=session@entry=0x7f88afc0ca98 <custom_sessions+632>, ub=ub@entry=0x7fff024f5430, stream_id=..., seq_num=<optimized out>)
    at /root/microros_ws/src/eProsima/Micro-XRCE-DDS-Client/src/c/core/session/session.c:580
#5  0x00007f88af8ef1ce in read_message (ub=0x7fff024f5430, session=0x7f88afc0ca98 <custom_sessions+632>) at /root/microros_ws/src/eProsima/Micro-XRCE-DDS-Client/src/c/core/session/session.c:543
#6  listen_message (session=session@entry=0x7f88afc0ca98 <custom_sessions+632>, poll_ms=poll_ms@entry=100) at /root/microros_ws/src/eProsima/Micro-XRCE-DDS-Client/src/c/core/session/session.c:411
#7  0x00007f88af8ef501 in listen_message_reliably (session=session@entry=0x7f88afc0ca98 <custom_sessions+632>, poll_ms=<optimized out>)
    at /root/microros_ws/src/eProsima/Micro-XRCE-DDS-Client/src/c/core/session/session.c:448
#8  0x00007f88af8ef720 in uxr_run_session_until_data (session=0x7f88afc0ca98 <custom_sessions+632>, timeout_ms=100)
    at /root/microros_ws/src/eProsima/Micro-XRCE-DDS-Client/src/c/core/session/session.c:218
#9  0x00007f88afbff38a in rmw_wait (subscriptions=0x556fdbbbcb48, guard_conditions=0x556fdbbbcb60, services=0x556fdbbbcb90, clients=0x556fdbbbcb78, events=0x556fdbbbcba8, wait_set=0x556fdbbbed90, 
    wait_timeout=0x7fff024f5670) at /root/microros_ws/src/uros/rmw_microxrcedds/rmw_microxrcedds_c/src/rmw_wait.c:82
#10 0x00007f88afdf4d89 in rcl_wait () from /root/ros2_foxy/install/rcl/lib/librcl.so
#11 0x00007f88afd9aa3f in rclc_executor_spin_some (executor=0x7fff02505d00, timeout_ns=100000000) at /root/microros_ws/src/uros/rclc/rclc/src/rclc/executor.c:974
#12 0x00007f88afd9ac9e in rclc_executor_spin (executor=0x7fff02505d00) at /root/microros_ws/src/uros/rclc/rclc/src/rclc/executor.c:1000
#13 0x0000556fda6d6fc3 in main ()

here's a print screen for easy reading image

Please tell me what else I need to provide to solve this, thanks!

pablogs9 commented 3 years ago

Could you provide code and instructions for replicating this?

TieneSabor commented 3 years ago

Sorry I spend some time to make a simplified code to replicate the situation, here is the code:

#include <rcl/rcl.h>
#include <rcl/error_handling.h>
#include <rclc/rclc.h>
#include <rclc/executor.h>

#include <std_msgs/msg/string.h>

#include <stdio.h>
#include <unistd.h>

#define ARRAY_LEN 1900

#define RCCHECK(fn) { rcl_ret_t temp_rc = fn; if((temp_rc != RCL_RET_OK)){printf("Failed status on line %d: %d. Aborting.\n",__LINE__,(int)temp_rc); return 1;}}
#define RCSOFTCHECK(fn) { rcl_ret_t temp_rc = fn; if((temp_rc != RCL_RET_OK)){printf("Failed status on line %d: %d. Continuing.\n",__LINE__,(int)temp_rc);}}

rcl_publisher_t publisher;
rcl_subscription_t sub;
std_msgs__msg__String pub_msg; 
std_msgs__msg__String sub_msg;

int counter = 0;

void timer_callback(rcl_timer_t * timer, int64_t last_call_time)
{
    (void) last_call_time;
    if (timer != NULL) {
        counter++;
        pub_msg.data.size = 1900;
        RCSOFTCHECK(rcl_publish(&publisher, &pub_msg, NULL));
    }
}

void subscription_callback(const void * sub_msgin)
{
        const std_msgs__msg__String * sub_msg = (const std_msgs__msg__String *)sub_msgin;
        printf("received %dth data \r\n", counter);
}

int main(int argc, const char * const * argv)
{
    rcl_allocator_t allocator = rcl_get_default_allocator();
    rclc_support_t support;

    // create init_options
    RCCHECK(rclc_support_init(&support, argc, argv, &allocator));

    // create node
    rcl_node_t node = rcl_get_zero_initialized_node();
    RCCHECK(rclc_node_init_default(&node, "string_node", "", &support));

    // create publisher
    RCCHECK(rclc_publisher_init_default(
        &publisher,
        &node,
        ROSIDL_GET_MSG_TYPE_SUPPORT(std_msgs, msg, String),
        "/string_publisher"));
        // create subscriber
    RCCHECK(rclc_subscription_init_default(&sub, &node,
                ROSIDL_GET_MSG_TYPE_SUPPORT(std_msgs, msg, String), "/string_publisher"));

    // create timer,
    rcl_timer_t timer = rcl_get_zero_initialized_timer();
    const unsigned int timer_timeout = 1;
    RCCHECK(rclc_timer_init_default(
        &timer,
        &support,
        RCL_MS_TO_NS(timer_timeout),
        timer_callback));

    // create executor
    rclc_executor_t executor = rclc_executor_get_zero_initialized_executor();
    RCCHECK(rclc_executor_init(&executor, &support.context, 2, &allocator));
    RCCHECK(rclc_executor_add_timer(&executor, &timer));
    RCCHECK(rclc_executor_add_subscription(&executor, &sub, &sub_msg,
                &subscription_callback, ON_NEW_DATA));

    // Fill the array with a known sequence
    pub_msg.data.data = (char * ) malloc(ARRAY_LEN * sizeof(char));
    for(int i=0;i<1899;i++){
        pub_msg.data.data[i] = 'A';
    }
    pub_msg.data.data[1899] = '\0';
    pub_msg.data.size = 1900;
    pub_msg.data.capacity = ARRAY_LEN;

    sub_msg.data.data = (char * ) malloc(ARRAY_LEN * sizeof(char));
        sub_msg.data.size = 1900;
        sub_msg.data.capacity = ARRAY_LEN;

    rclc_executor_spin(&executor);

    RCCHECK(rcl_subscription_fini(&sub, &node))
    RCCHECK(rcl_publisher_fini(&publisher, &node))
    RCCHECK(rcl_node_fini(&node))
}
TieneSabor commented 3 years ago
  1. To run the code, please replace the "main.c" in microros_ws/src/uros/micro-ROS-demos/rclc/string_publisher(or subscriber)/.
  2. The code contains a timer that publish a 1.9kB string and a subscription that receive the data, both in a same node, and in my case, the fault always occurs when after 16383 th publish (Actually in executor wait as showed in the gdb result above).
  3. Such fault won't happen when timer and subscription is in separate file and executed individually.
  4. The environment is in a docker, ubuntu 20.04, and ros2 foxy( ros and micro ros should be the newest version because I just update both last week)
  5. Edit: both agent/client are on host, in case you don't know.
pablogs9 commented 3 years ago

I'm going to test this, I'll be back in a while

pablogs9 commented 3 years ago

I have it replicated and I have found a bug in the XRCE-DDS Client. Working on the solution I'll be back in a while.

pablogs9 commented 3 years ago

I have prepared a patch here so this segfault should not appear anymore. But it is still pending to determine the XRCE-DDS Client fragmentation corner case when the sub-message sequence number rollover.

Closing since solved.

Please reopen if the problem still happens when the PR is merged.

pablogs9 commented 3 years ago

Update, also we have found a problem with the middleware history, solved here: https://github.com/micro-ROS/micro_ros_setup/pull/292

Please notify when you can retry your code and everything works.

TieneSabor commented 3 years ago

Hello, Thanks for the help and I have checked out those changes. I will retry codes ASAP. However, I am wondering why this situation only happens when stream->base.history is not power of two?

TieneSabor commented 3 years ago

I had done the test with the new patch, the segmentation fault have disappeared. However, I tried to change the XRCE stream history size to 32 without add lines in input_reliable_stream.c, and the segmentation fault still happened.

pablogs9 commented 3 years ago

As far as I know, the library is implemented in order to handle the reliability of the streams based on these assumptions. Because each XRCE message has a sequence number from 0 to 2^16-1, so having a power of two buffer historic size allow us to assign sequence numbers to historic slots in an optimal way using a modulo operation.

This segmentation fault is happening only when XRCE is fragmenting and the sequence number rolls over. So:

Another solution would be to increase MTU in order to avoid XRCE fragmenting packages.

TieneSabor commented 3 years ago

Thanks for the explanation!