Roudi SEGFAULT when a subscriber dies

ceccocats commented 2 years ago

Required information

Operating system: Ubuntu 20.04 LTS

Compiler version: GCC 9.4.0

Observed result or behavior: All this behavior occured on the iceoryx version shipped with cyclone DDS on ros2 galactic. I see similar behavior also with ros2 humble

Roudi do SEGFAULT when a subscriber dies. It happens expecially when you have a publisher with high publish rate. ROUDI segfault after it detects that the node is dead and try to clean the memory.

Tracking with GDB this is the backtrace:

(gdb) bt
#0  0x00007ffff7a13149 in std::__atomic_base<unsigned long>::fetch_sub (__m=std::memory_order_relaxed, __i=1, this=0x1e20010)
    at /usr/include/c++/9/bits/atomic_base.h:551
#1  iox::mepoo::SharedChunk::decrementReferenceCounter (this=0x7fff70d827b8) at /er/src/external/iceoryx/iceoryx_posh/source/mepoo/shared_chunk.cpp:56
#2  0x00007ffff7a130b8 in iox::mepoo::SharedChunk::~SharedChunk (this=0x7fff70d827b8, __in_chrg=<optimized out>)
    at /er/src/external/iceoryx/iceoryx_posh/source/mepoo/shared_chunk.cpp:42
#3  0x00007ffff7a237e2 in iox::popo::UsedChunkList<257u>::cleanup (this=0x7ffff677c6f8)
    at /er/src/external/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl:115
#4  0x00007ffff7a23730 in iox::popo::ChunkReceiver<iox::popo::ChunkReceiverData<256u, iox::popo::ChunkQueueData<iox::DefaultChunkQueueConfig, iox::popo::ThreadSafePolicy> > >::releaseAll (this=0x7fff70d82880) at /er/src/external/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver.inl:88
#5  0x00007ffff7a23506 in iox::popo::SubscriberPortRouDi::releaseAllChunks (this=0x7fff70d82870)
    at /er/src/external/iceoryx/iceoryx_posh/source/popo/ports/subscriber_port_roudi.cpp:47
#6  0x00007ffff7b38a9d in iox::roudi::PortManager::destroySubscriberPort (this=0x7ffff7bae138 <iox::roudi::IceOryxRouDiApp::run()::m_rouDiComponents+51352>, 
    subscriberPortData=0x7ffff677a3e8) at /er/src/external/iceoryx/iceoryx_posh/source/roudi/port_manager.cpp:571
#7  0x00007ffff7b3808e in iox::roudi::PortManager::deletePortsOfProcess (this=0x7ffff7bae138 <iox::roudi::IceOryxRouDiApp::run()::m_rouDiComponents+51352>, 
    runtimeName=...) at /er/src/external/iceoryx/iceoryx_posh/source/roudi/port_manager.cpp:494
#8  0x00007ffff7b5d7c2 in iox::roudi::ProcessManager::monitorProcesses (this=0x7ffff7c9f470 <iox::roudi::IceOryxRouDiApp::run()::roudi+112>)
    at /er/src/external/iceoryx/iceoryx_posh/source/roudi/process_manager.cpp:668
#9  0x00007ffff7b5d23a in iox::roudi::ProcessManager::run (this=0x7ffff7c9f470 <iox::roudi::IceOryxRouDiApp::run()::roudi+112>)
    at /er/src/external/iceoryx/iceoryx_posh/source/roudi/process_manager.cpp:597
#10 0x00007ffff7b4f44a in iox::roudi::RouDi::monitorAndDiscoveryUpdate (this=0x7ffff7c9f400 <iox::roudi::IceOryxRouDiApp::run()::roudi>)
    at /er/src/external/iceoryx/iceoryx_posh/source/roudi/roudi.cpp:151
#11 0x00007ffff7b58135 in std::__invoke_impl<void, void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> (
    __f=@0x5555555761f0: (void (iox::roudi::RouDi::*)(iox::roudi::RouDi * const)) 0x7ffff7b4f3e8 <iox::roudi::RouDi::monitorAndDiscoveryUpdate()>, 
    __t=@0x5555555761e8: 0x7ffff7c9f400 <iox::roudi::IceOryxRouDiApp::run()::roudi>) at /usr/include/c++/9/bits/invoke.h:73
#12 0x00007ffff7b57dcb in std::__invoke<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> (
    __fn=@0x5555555761f0: (void (iox::roudi::RouDi::*)(iox::roudi::RouDi * const)) 0x7ffff7b4f3e8 <iox::roudi::RouDi::monitorAndDiscoveryUpdate()>)
    at /usr/include/c++/9/bits/invoke.h:95
#13 0x00007ffff7b57a51 in std::thread::_Invoker<std::tuple<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> >::_M_invoke<0ul, 1ul> (this=0x5555555761e8)
    at /usr/include/c++/9/thread:244
#14 0x00007ffff7b57969 in std::thread::_Invoker<std::tuple<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> >::operator() (this=0x5555555761e8)
    at /usr/include/c++/9/thread:251
#15 0x00007ffff7b5791a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (iox::roudi::RouDi::*)(), iox::roudi::RouDi*> > >::_M_run (this=0x5555555761e0)
--Type <RET> for more, q to quit, c to continue without paging--
   at /usr/include/c++/9/thread:195
#16 0x00007ffff7759de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#17 0x00007ffff745b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#18 0x00007ffff7595133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I inserted a debug print in the SEGFAULTED function:

void SharedChunk::decrementReferenceCounter() noexcept
{
    std::cout<<"DECREMENT: "<<m_chunkManagement<<"\n";
    if ((m_chunkManagement != nullptr)
        && (m_chunkManagement->m_referenceCounter.fetch_sub(1U, std::memory_order_relaxed) == 1U))
    {
        freeChunk();
    }
}

It seems that m_chunkManagement is a wrong pointer, compared with the previous ones:

[...]
DECREMENT: 0x7f499f075ff0
DECREMENT: 0
DECREMENT: 0
DECREMENT: 0
DECREMENT: 0x7f499f075f48
DECREMENT: 0x7f499f075f48
[...]
DECREMENT: 0x1df0000
Segmentation fault

Looks like that m_chunkManagement is corrupted somewhere

ceccocats commented 2 years ago

In this repo you can find the a replica of the bug: https://github.com/ceccocats/iceoryx_bug

elfenpiff commented 2 years ago

@ceccocats this is an amazingly detailed bug report. Thanks a lot for the effort!

I will dig into it and let you know what I have found out.

elfenpiff commented 2 years ago

Just for the others, to reproduce the bug use the iceoryx environment and install some missing packages:

cd iceoryx
./tools/scripts/ice_env.sh enter ros:galactic
apt install ros-galactic-desktop ros-galactic-ros-testing

and then proceed as it is described in the iceoryx_bug repo.

What I find curious is that the listener receive the first couple of hundred messages and then suddenly stalls and is doing nothing anymore. It is then also impossible to kill it with CTRL+c or sending sigterm.

@MatthiasKillat could this maybe linked to the DDS bug we encountered some time ago were some hashes were off and this lead to some segfault due to has collisions.

elfenpiff commented 2 years ago

Another interesting observation. When I start the setup with (in separate terminals)

/opt/ros/galactic/bin/iox-roudi -c roudi_config.toml
ros2 run demo_nodes_cpp talker_loaned_message
ros2 run demo_nodes_cpp listener

And try to terminate roudi when the listener is stalled, roudi hangs. It seems there is some kind of deadlock in there.

elfenpiff commented 2 years ago

@ceccocats I created an issue also on cyclone-dds side: https://github.com/eclipse-cyclonedds/cyclonedds/issues/1445 since I suspect maybe a couple of issues here. @MatthiasKillat maybe you have some insights.

MatthiasKillat commented 2 years ago

It does not look like it is a CycloneDDS problem or a usage problem. It looks more like a problem in resource cleanup. We need to trace why the wrong (?) pointer is used in the cleanup. I am not too familiar with the cleanup strategy, but the subscriber is marked to be dead and the chunks that it still owns cleaned up.

The problem appears to occur if some chunk is cleaned up that does hold an invalid pointer. The question is: how can this happen if reference counting works properly? Was it already cleaned up? Was it never a valid pointer?

Was it maybe solved already in a later version (say master)? I am not aware of this specifically.

afrixs commented 2 years ago

I've just tested the iceoryx_bug scenario with the latest iceoryx version in master branch and ros rolling. Long story short: the crash and stall still occurs, however, you need to terminate the listener before it stalls to reproduce the RouDi crash + there are some additional crashes happening

Versions/commits

Ubuntu 22.04.1 LTS ros-rolling-rclcpp: 17.0.0-1jammy.20220914.143855 rmw_cyclonedds: 5b67ae471ba81560b51bac6b1dacfbf271e0dd22 cyclonedds: 0.10.2 iceoryx: 5b67ae471ba81560b51bac6b1dacfbf271e0dd22

Build notes

rmw_cyclonedds, cyclonedds and iceoryx have been cloned to ~/dds_ws/src.
cyclonedds has been slightly modified to work with the latest iceoryx (diff)
sudo apt remove ros-rolling-rmw-cyclonedds-cpp ros-rolling-cyclonedds; sudo apt remove "ros-rolling-iceoryx*" was executed before building dds_ws
Minor changes have been made also to iceoryx_bug repo (diff)

Additional observations

I observed hangs of listener after few hundreds of received messages like @elfenpiff mentioned. However in ~70% cases, there were segfaults instead of hangs in the listener process. After killing the stalled listener using pkill -KILL listener RouDi did not crash though.

When I managed to kill the listener using pkill listener (without -KILL) earlier than it crashed/stalled itself, iox-roudi process segfaulted with the same backtrace like was reported by @ceccocats .

When I was alternating between pkill -KILL listener before hang/crash, pkill -KILL listener after hang and letting listener to segfault itself, after 10 to 20 runs listener crashed with following message:

2022-10-26 13:58:47.356 [Warn ]: ICEORYX error! POPO__CHUNK_RECEIVER_INVALID_CHUNK_TO_RELEASE_FROM_USER
listener: /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_hoofs/source/error_handling/error_handler.cpp:43: static void iox::ErrorHandler::reactOnErrorLevel(iox::ErrorLevel, const char*): Assertion `false' failed.
[ros2run]: Aborted

In case of hanging the listener process loops infinitely here between two values of current variable.

72          for (auto current = m_usedListHead; current != INVALID_INDEX; current = m_listIndices[current])
(gdb) n
74              if (!m_listData[current].isLogicalNullptr())
(gdb) n
77                  if (m_listData[current].getChunkHeader() == chunkHeader)
(gdb) n
100             previous = current;
(gdb) p previous
$10 = 28
(gdb) p current
$11 = 26
(gdb) n
72          for (auto current = m_usedListHead; current != INVALID_INDEX; current = m_listIndices[current])
(gdb) n
74              if (!m_listData[current].isLogicalNullptr())
(gdb) n
77                  if (m_listData[current].getChunkHeader() == chunkHeader)
(gdb) n
100             previous = current;
(gdb) p previous
$12 = 26
(gdb) p current
$13 = 28
(gdb) bt
#0  iox::popo::UsedChunkList<2049u>::remove (this=0x7fffe1e783d0, chunkHeader=0x7fffc7c253f0, chunk=...)
    at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl:100
#1  0x00007ffff6c72992 in iox::popo::ChunkReceiver<iox::popo::ChunkReceiverData<2048u, iox::popo::ChunkQueueData<iox::DefaultChunkQueueConfig, iox::popo::ThreadSafePolicy> > >::release (this=0x7fffffffa8f0,
    chunkHeader=0x7fffc7c253f0)
    at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver.inl:104
#2  0x00007ffff6c72407 in iox::popo::SubscriberPortUser::releaseChunk (this=0x7fffffffa8e0,
    chunkHeader=0x7fffc7c253f0)
    at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_posh/source/popo/ports/subscriber_port_user.cpp:73
#3  0x00007ffff724ad1e in iox_sub_release_chunk (self=0x555555605db8, userPayload=0x7fffc7c25458)
    at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_binding_c/source/c_subscriber.cpp:166
#4  0x00007ffff74a2d45 in dds_data_allocator_free (data_allocator=0x55555560c0e8, ptr=0x7fffc7c25458)
    at /home/matej/rtv/dds_ws/src/cyclonedds/src/core/ddsc/src/dds_data_allocator.c:155
#5  0x00007ffff7569c58 in fini_and_free_sample<CddsSubscription*> (entity=@0x7fffffffa988: 0x55555560c0a0,
    loaned_message=0x7fffc7c25458)
    at /home/matej/rtv/dds_ws/src/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:1392
#6  0x00007ffff755ef4c in return_loaned_message_from_subscription_int (subscription=0x555555605a10,
    loaned_message=0x7fffc7c25458)
    at /home/matej/rtv/dds_ws/src/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:3547
#7  0x00007ffff755efae in rmw_return_loaned_message_from_subscription (subscription=0x555555605a10,
    loaned_message=0x7fffc7c25458)
    at /home/matej/rtv/dds_ws/src/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:3566
#8  0x00007ffff78b7274 in rcl_return_loaned_message_from_subscription (subscription=0x5555556051a0,
    loaned_message=0x7fffc7c25458) at ./src/rcl/subscription.c:656
#9  0x00007ffff7eab134 in rclcpp::Executor::execute_subscription (
    subscription=std::shared_ptr<rclcpp::SubscriptionBase> (use count 3, weak count 2) = {...})
    at /usr/include/c++/11/bits/shared_ptr_base.h:1295
#10 0x00007ffff7eab7cf in rclcpp::Executor::execute_any_executable (this=0x7fffffffbe30, any_exec=...)
    at ./src/rclcpp/executor.cpp:518
#11 0x00007ffff7eb3e30 in rclcpp::executors::SingleThreadedExecutor::spin (this=0x7fffffffbe30) at ./src/rclcpp/executors/single_threaded_executor.cpp:37
#12 0x00005555555593a9 in main (argc=1, argv=0x7fffffffcc48) at /home/matej/rtv/iceoryx_bug/build/demo_nodes_cpp/rclcpp_components/node_main_listener.cpp:65
(gdb)

In the case of SEGFAULT some invalid pointers appear here.

[INFO] [1666780094.301991786] [listener]: I heard
[INFO] [1666780094.302168835] [listener]: I heard
[INFO] [1666780094.302351658] [listener]: I heard
[INFO] [1666780094.302448285] [listener]: I heard

Thread 1 "listener" received signal SIGSEGV, Segmentation fault.
0x00007ffff6c6495c in iox::memory::RelativePointer<iox::mepoo::ChunkHeader>::computeRawPtr (this=0x2d10000) at /home/matej/rtv/dds_ws/install/iceoryx_hoofs/include/iceoryx/v2.90.0/iceoryx_hoofs/memory/relative_pointer.inl:239
239         return getPtr(segment_id_t{m_id}, m_offset);
(gdb) bt
#0  0x00007ffff6c6495c in iox::memory::RelativePointer<iox::mepoo::ChunkHeader>::computeRawPtr (
    this=0x2d10000)
    at /home/matej/rtv/dds_ws/install/iceoryx_hoofs/include/iceoryx/v2.90.0/iceoryx_hoofs/memory/relative_pointer.inl:239
#1  0x00007ffff6c648b0 in iox::memory::RelativePointer<iox::mepoo::ChunkHeader>::get (this=0x2d10000)
    at /home/matej/rtv/dds_ws/install/iceoryx_hoofs/include/iceoryx/v2.90.0/iceoryx_hoofs/memory/relative_pointer.inl:112
#2  0x00007ffff6c64ea6 in iox::mepoo::ShmSafeUnmanagedChunk::getChunkHeader (this=0x7fffe1f2c220)
    at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_posh/source/mepoo/shm_safe_unmanaged_chunk.cpp:93
#3  0x00007ffff6c7334e in iox::popo::UsedChunkList<2049u>::remove (this=0x7fffe1f2a210,
    chunkHeader=0x7fffc7c25bd0, chunk=...)
    at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl:77
#4  0x00007ffff6c72992 in iox::popo::ChunkReceiver<iox::popo::ChunkReceiverData<2048u, iox::popo::ChunkQueueData<iox::DefaultChunkQueueConfig, iox::popo::ThreadSafePolicy> > >::release (this=0x7fffffffa8f0,
    chunkHeader=0x7fffc7c25bd0)
    at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver.inl:104
#5  0x00007ffff6c72407 in iox::popo::SubscriberPortUser::releaseChunk (this=0x7fffffffa8e0, chunkHeader=0x7fffc7c25bd0) at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_posh/source/popo/ports/subscriber_port_user.cpp:73
#6  0x00007ffff724ad1e in iox_sub_release_chunk (self=0x55555560a468, userPayload=0x7fffc7c25c38) at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_binding_c/source/c_subscriber.cpp:166
#7  0x00007ffff74a2d45 in dds_data_allocator_free (data_allocator=0x555555610498, ptr=0x7fffc7c25c38) at /home/matej/rtv/dds_ws/src/cyclonedds/src/core/ddsc/src/dds_data_allocator.c:155
#8  0x00007ffff7569c58 in fini_and_free_sample<CddsSubscription*> (entity=@0x7fffffffa988: 0x555555610450, loaned_message=0x7fffc7c25c38) at /home/matej/rtv/dds_ws/src/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:1392
#9  0x00007ffff755ef4c in return_loaned_message_from_subscription_int (subscription=0x555555609dc0, loaned_message=0x7fffc7c25c38) at /home/matej/rtv/dds_ws/src/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:3547
#10 0x00007ffff755efae in rmw_return_loaned_message_from_subscription (subscription=0x555555609dc0, loaned_message=0x7fffc7c25c38) at /home/matej/rtv/dds_ws/src/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:3566
#11 0x00007ffff78b7274 in rcl_return_loaned_message_from_subscription (subscription=0x555555609550, loaned_message=0x7fffc7c25c38) at ./src/rcl/subscription.c:656
#12 0x00007ffff7eab134 in rclcpp::Executor::execute_subscription (subscription=std::shared_ptr<rclcpp::SubscriptionBase> (use count 3, weak count 2) = {...}) at /usr/include/c++/11/bits/shared_ptr_base.h:1295
#13 0x00007ffff7eab7cf in rclcpp::Executor::execute_any_executable (this=0x7fffffffbe30, any_exec=...) at ./src/rclcpp/executor.cpp:518
#14 0x00007ffff7eb3e30 in rclcpp::executors::SingleThreadedExecutor::spin (this=0x7fffffffbe30) at ./src/rclcpp/executors/single_threaded_executor.cpp:37
#15 0x00005555555593a9 in main (argc=1, argv=0x7fffffffcc48) at /home/matej/rtv/iceoryx_bug/build/demo_nodes_cpp/rclcpp_components/node_main_listener.cpp:65
(gdb)
(gdb) p m_offset
Cannot access memory at address 0x2d10008
(gdb) p &m_offset
$1 = (iox::memory::RelativePointer<iox::mepoo::ChunkHeader>::offset_t *) 0x2d10008
(gdb) p m_id
Cannot access memory at address 0x2d10000
(gdb) p &m_id
$2 = (iox::memory::segment_id_underlying_t *) 0x2d10000

elfenpiff commented 2 years ago

@elBoberido you are most familiar with the used chunk list and since there is maybe an infinite loop in there I tagged you here.

@afrixs thank you for the very detailed report! I will look into it.

afrixs commented 2 years ago

I've just tested the bug with versions used by ros2 packages (rmw_cyclonedds: 1.4.1, cyclonedds: 0.9.0, iceoryx: v2.0.2) and segfaults seem to happen on listener side occasionally too. So the only difference was that the new iox-roudi needs an extra argument -m on to remove killed applications, with this argument the behavior is exactly the same in master and v2.0.2

elfenpiff commented 2 years ago

@afrixs @ceccocats Thank you for your support and help so far. I think the next step is to create an iceoryx only example to reproduce the bug (and to ensure that this is really originating from iceoryx and we can exclude cyclone-dds and ros).

I would write one in the next days but if some of you would like to do this I would be more than happy to let you do it - just let me know.

A good approach would be to just create a branch from current master, take the iceoryx_examples/icedelivery example and adjust it so that it behaves in the same way. We have also a listener example in iceoryx_examples/callback

afrixs commented 2 years ago

I can take it, thanks for pointing me to the next step

elfenpiff commented 2 years ago

@afrixs cool, thank you very much :smile:

If you need support just let me know!

afrixs commented 2 years ago

The example can be found here.

Currently I'm failing to reproduce the crashes/stalls using pure iceoryx (icedelivery/iox-cpp-publisher + icedelivery/iox-cpp-subscriber). However, I've succeeded to reproduce them with a pure iceoryx publisher and ROS2 subscriber (icedelivery/iox-cpp-publisher + ros2 run demo_nodes_cpp listener).

Tomorrow I'm going to make the iceoryx subscriber replicate CycloneDDS reader behavior more closely to see where the crashes start to occur

elBoberido commented 2 years ago

@elfenpiff @afrixs my gut tells me this is synchronization issue. After having a closer look, it seems there is a small window where the subscriber removed the sample from the UsedChunkList but the memory is not yet synchronized. When the subscriber is killed within this window and RouDi does the cleanup we have a double free. The problem can be solved, the question is just how expensive it will be.

afrixs commented 2 years ago

I'm currently investigating the problem and you're right about the problem being related to synchronization, but maybe the solution for this particular bug scenario (which would be much more difficult to reproduce with such a small window) is simpler. There was a lock missing around calling iox_sub_release_chunk in cyclonedds (there is a lock but it locks a different mutex from the one used for iox_sub_take_chunk). In my tests, calling these two methods from different threads at the same time caused crashes very similar to the ones which were reported.

After adding the correct lock the crashes are gone, however, the test subscriber keeps stalling (in a different way), that's what I'm currently analyzing. So it looks like we'll be moving to cyclonedds repo with this issue, but first I want to make sure what's happening with the persisting stall...

afrixs commented 2 years ago

Ok, the reason of "stalling" was just my debugging clutter, so things work properly now, no crashes, no hangs. So since I assume that there was no intention for iox_sub_take_chunk and iox_sub_release_chunk to be thread-safe, I think this issue can be closed here and I'll create a PR in cyclonedds.

Btw, there's missing iox_sub_release_chunk(subscriber, userPayload); in the ice_c_callbacks_subscriber.c example, that's one flaw I have found here, I'll create a PR for that

elfenpiff commented 2 years ago

@afrixs Great work and thank you for taking care of this!

elBoberido commented 2 years ago

@afrixs hahaha, I started my last comment with my gut tells me the subscriber is accessed from multiple threads without synchronization and then I had another look at the UsedChunkList to be sure we do not have an issue there. Your are right, the window is too small to reproduce it that reliable. Great that you found the the root cause of your problem but we need to also fix the synchronization issue in the UsedChunkList

MatthiasKillat commented 2 years ago

@afrixs Great find. I am not sure I am the culprit for missing the lock or the code was changed in the meantime. In C it is not possible to consistently enforce locking by say using lock_guards and proper encapsulation (struct members are accessible).

In any case, as already mentioned, the subscriber is not thread-safe. In C++ this is trivial to solve in a default way with a wrapper class and a mutex. @elBoberido I still envision to make the subscriber lock-free thread-safe, which includes solving the problem you mentioned. This is not trivial though :-(

In C more discipline is required to lock the mutex manually or also create a wrapper struct with a mutex and some functions that take care of locking etc. I was under the impression that was done but may have been changed in the meantime. Unfortunately it is still easy to mess up by not locking and using the subscriber field of this wrapper struct directly. As @afrixs correctly pointed out, the mutex that needs to be locked is locked with shm_lock_iox_sub, see.

All these problems would go away if the subscriber would be thread-safe in iceoryx itself but this would also change the contract and reduce performance.

MatthiasKillat commented 2 years ago

@elBoberido One could argue to provide thread safe versions of publisher and subscriber in addition, but this is additional API to test in C++. But it might be worth it to have it as default in the C API to avoid these problems.

Say the C API subscriber uses a mutex protected C++ subscriber under the hood. Of course, this reduces performance if this is not needed ...

elBoberido commented 2 years ago

@MatthiasKillat @elfenpiff after creating #1771 I had an even closer look to the UsedChunkList and came to the conclusion that we do not have any issue there. It was just me who was a little bit confused :smile:

cc @afrixs

sergmister commented 1 year ago

Still not resolved I believe (as of 2.0.3). Correct me if wrong or someone is still looking / has found a solution.

afrixs commented 1 year ago

The problem turned out to be in CycloneDDS, I believe it has been resolved by this PR

elBoberido commented 1 year ago

@sergmister I think the issue is solved for a dying subscriber. Did you still experience issues with the latest CycloneDDS. There is still an issue with a dying publisher which uses a mutex due to the history feature for late subscriber. This is usually uncontested and does not affect performance but if the publisher dies at the wrong time and RouDi tries to clean up, it will encounter similar problems. #1710 would fix this issue for the publisher and it is also related to #325 in case the monitoring is not active and there are multiple publisher.

If you do not encounter the problem with latest CycloneDDS I would close this issue for the subscriber.

sergmister commented 1 year ago

@elBoberido This issue seems to be resolved on the latest releases/0.10.x branch of CycloneDDS, at least with my tests using ROS2 Humble. For anyone else building ROS2, just after you gather all the sources with vcs import run:

rm -r ${ROS_ROOT}/src/cyclonedds
git -C ${ROS_ROOT}/src/ clone --depth 1 https://github.com/eclipse-cyclonedds/cyclonedds.git --branch releases/0.10.x

elBoberido commented 1 year ago

@sergmister nice to hear :)

sgf201 commented 1 year ago

@elBoberido This issue seems to be resolved on the latest releases/0.10.x branch of CycloneDDS, at least with my tests using ROS2 Humble. For anyone else building ROS2, just after you gather all the sources with vcs import run:
rm -r ${ROS_ROOT}/src/cyclonedds
git -C ${ROS_ROOT}/src/ clone --depth 1 https://github.com/eclipse-cyclonedds/cyclonedds.git --branch releases/0.10.x

can you tell me is it possible to compile cyclonedds as a workspace package to replace the cyclonedds in ros

elBoberido commented 1 year ago

@sgf201 I'm not that familiar with how workspaces work in ros but I think it should be possible. You may get more help in the Cyclone DDS repo

eclipse-iceoryx / iceoryx