Closed ceccocats closed 1 year ago
In this repo you can find the a replica of the bug: https://github.com/ceccocats/iceoryx_bug
@ceccocats this is an amazingly detailed bug report. Thanks a lot for the effort!
I will dig into it and let you know what I have found out.
Just for the others, to reproduce the bug use the iceoryx environment and install some missing packages:
cd iceoryx
./tools/scripts/ice_env.sh enter ros:galactic
apt install ros-galactic-desktop ros-galactic-ros-testing
and then proceed as it is described in the iceoryx_bug repo.
What I find curious is that the listener receive the first couple of hundred messages and then suddenly stalls and is doing nothing anymore. It is then also impossible to kill it with CTRL+c or sending sigterm.
@MatthiasKillat could this maybe linked to the DDS bug we encountered some time ago were some hashes were off and this lead to some segfault due to has collisions.
Another interesting observation. When I start the setup with (in separate terminals)
/opt/ros/galactic/bin/iox-roudi -c roudi_config.toml
ros2 run demo_nodes_cpp talker_loaned_message
ros2 run demo_nodes_cpp listener
And try to terminate roudi when the listener is stalled, roudi hangs. It seems there is some kind of deadlock in there.
@ceccocats I created an issue also on cyclone-dds side: https://github.com/eclipse-cyclonedds/cyclonedds/issues/1445 since I suspect maybe a couple of issues here. @MatthiasKillat maybe you have some insights.
It does not look like it is a CycloneDDS problem or a usage problem. It looks more like a problem in resource cleanup. We need to trace why the wrong (?) pointer is used in the cleanup. I am not too familiar with the cleanup strategy, but the subscriber is marked to be dead and the chunks that it still owns cleaned up.
The problem appears to occur if some chunk is cleaned up that does hold an invalid pointer. The question is: how can this happen if reference counting works properly? Was it already cleaned up? Was it never a valid pointer?
Was it maybe solved already in a later version (say master)? I am not aware of this specifically.
I've just tested the iceoryx_bug scenario with the latest iceoryx version in master branch and ros rolling. Long story short: the crash and stall still occurs, however, you need to terminate the listener before it stalls to reproduce the RouDi crash + there are some additional crashes happening
Ubuntu 22.04.1 LTS ros-rolling-rclcpp: 17.0.0-1jammy.20220914.143855 rmw_cyclonedds: 5b67ae471ba81560b51bac6b1dacfbf271e0dd22 cyclonedds: 0.10.2 iceoryx: 5b67ae471ba81560b51bac6b1dacfbf271e0dd22
rmw_cyclonedds
, cyclonedds
and iceoryx
have been cloned to ~/dds_ws/src
.cyclonedds
has been slightly modified to work with the latest iceoryx (diff)sudo apt remove ros-rolling-rmw-cyclonedds-cpp ros-rolling-cyclonedds; sudo apt remove "ros-rolling-iceoryx*"
was executed before building dds_wsiceoryx_bug
repo (diff)I observed hangs of listener after few hundreds of received messages like @elfenpiff mentioned. However in ~70% cases, there were segfaults instead of hangs in the listener process. After killing the stalled listener using pkill -KILL listener
RouDi did not crash though.
When I managed to kill the listener using pkill listener
(without -KILL
) earlier than it crashed/stalled itself, iox-roudi
process segfaulted with the same backtrace like was reported by @ceccocats .
When I was alternating between pkill -KILL listener
before hang/crash, pkill -KILL listener
after hang and letting listener to segfault itself, after 10 to 20 runs listener crashed with following message:
2022-10-26 13:58:47.356 [Warn ]: ICEORYX error! POPO__CHUNK_RECEIVER_INVALID_CHUNK_TO_RELEASE_FROM_USER
listener: /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_hoofs/source/error_handling/error_handler.cpp:43: static void iox::ErrorHandler::reactOnErrorLevel(iox::ErrorLevel, const char*): Assertion `false' failed.
[ros2run]: Aborted
In case of hanging the listener process loops infinitely here between two values of current
variable.
72 for (auto current = m_usedListHead; current != INVALID_INDEX; current = m_listIndices[current])
(gdb) n
74 if (!m_listData[current].isLogicalNullptr())
(gdb) n
77 if (m_listData[current].getChunkHeader() == chunkHeader)
(gdb) n
100 previous = current;
(gdb) p previous
$10 = 28
(gdb) p current
$11 = 26
(gdb) n
72 for (auto current = m_usedListHead; current != INVALID_INDEX; current = m_listIndices[current])
(gdb) n
74 if (!m_listData[current].isLogicalNullptr())
(gdb) n
77 if (m_listData[current].getChunkHeader() == chunkHeader)
(gdb) n
100 previous = current;
(gdb) p previous
$12 = 26
(gdb) p current
$13 = 28
(gdb) bt
#0 iox::popo::UsedChunkList<2049u>::remove (this=0x7fffe1e783d0, chunkHeader=0x7fffc7c253f0, chunk=...)
at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl:100
#1 0x00007ffff6c72992 in iox::popo::ChunkReceiver<iox::popo::ChunkReceiverData<2048u, iox::popo::ChunkQueueData<iox::DefaultChunkQueueConfig, iox::popo::ThreadSafePolicy> > >::release (this=0x7fffffffa8f0,
chunkHeader=0x7fffc7c253f0)
at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver.inl:104
#2 0x00007ffff6c72407 in iox::popo::SubscriberPortUser::releaseChunk (this=0x7fffffffa8e0,
chunkHeader=0x7fffc7c253f0)
at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_posh/source/popo/ports/subscriber_port_user.cpp:73
#3 0x00007ffff724ad1e in iox_sub_release_chunk (self=0x555555605db8, userPayload=0x7fffc7c25458)
at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_binding_c/source/c_subscriber.cpp:166
#4 0x00007ffff74a2d45 in dds_data_allocator_free (data_allocator=0x55555560c0e8, ptr=0x7fffc7c25458)
at /home/matej/rtv/dds_ws/src/cyclonedds/src/core/ddsc/src/dds_data_allocator.c:155
#5 0x00007ffff7569c58 in fini_and_free_sample<CddsSubscription*> (entity=@0x7fffffffa988: 0x55555560c0a0,
loaned_message=0x7fffc7c25458)
at /home/matej/rtv/dds_ws/src/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:1392
#6 0x00007ffff755ef4c in return_loaned_message_from_subscription_int (subscription=0x555555605a10,
loaned_message=0x7fffc7c25458)
at /home/matej/rtv/dds_ws/src/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:3547
#7 0x00007ffff755efae in rmw_return_loaned_message_from_subscription (subscription=0x555555605a10,
loaned_message=0x7fffc7c25458)
at /home/matej/rtv/dds_ws/src/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:3566
#8 0x00007ffff78b7274 in rcl_return_loaned_message_from_subscription (subscription=0x5555556051a0,
loaned_message=0x7fffc7c25458) at ./src/rcl/subscription.c:656
#9 0x00007ffff7eab134 in rclcpp::Executor::execute_subscription (
subscription=std::shared_ptr<rclcpp::SubscriptionBase> (use count 3, weak count 2) = {...})
at /usr/include/c++/11/bits/shared_ptr_base.h:1295
#10 0x00007ffff7eab7cf in rclcpp::Executor::execute_any_executable (this=0x7fffffffbe30, any_exec=...)
at ./src/rclcpp/executor.cpp:518
#11 0x00007ffff7eb3e30 in rclcpp::executors::SingleThreadedExecutor::spin (this=0x7fffffffbe30) at ./src/rclcpp/executors/single_threaded_executor.cpp:37
#12 0x00005555555593a9 in main (argc=1, argv=0x7fffffffcc48) at /home/matej/rtv/iceoryx_bug/build/demo_nodes_cpp/rclcpp_components/node_main_listener.cpp:65
(gdb)
In the case of SEGFAULT some invalid pointers appear here.
[INFO] [1666780094.301991786] [listener]: I heard
[INFO] [1666780094.302168835] [listener]: I heard
[INFO] [1666780094.302351658] [listener]: I heard
[INFO] [1666780094.302448285] [listener]: I heard
Thread 1 "listener" received signal SIGSEGV, Segmentation fault.
0x00007ffff6c6495c in iox::memory::RelativePointer<iox::mepoo::ChunkHeader>::computeRawPtr (this=0x2d10000) at /home/matej/rtv/dds_ws/install/iceoryx_hoofs/include/iceoryx/v2.90.0/iceoryx_hoofs/memory/relative_pointer.inl:239
239 return getPtr(segment_id_t{m_id}, m_offset);
(gdb) bt
#0 0x00007ffff6c6495c in iox::memory::RelativePointer<iox::mepoo::ChunkHeader>::computeRawPtr (
this=0x2d10000)
at /home/matej/rtv/dds_ws/install/iceoryx_hoofs/include/iceoryx/v2.90.0/iceoryx_hoofs/memory/relative_pointer.inl:239
#1 0x00007ffff6c648b0 in iox::memory::RelativePointer<iox::mepoo::ChunkHeader>::get (this=0x2d10000)
at /home/matej/rtv/dds_ws/install/iceoryx_hoofs/include/iceoryx/v2.90.0/iceoryx_hoofs/memory/relative_pointer.inl:112
#2 0x00007ffff6c64ea6 in iox::mepoo::ShmSafeUnmanagedChunk::getChunkHeader (this=0x7fffe1f2c220)
at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_posh/source/mepoo/shm_safe_unmanaged_chunk.cpp:93
#3 0x00007ffff6c7334e in iox::popo::UsedChunkList<2049u>::remove (this=0x7fffe1f2a210,
chunkHeader=0x7fffc7c25bd0, chunk=...)
at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl:77
#4 0x00007ffff6c72992 in iox::popo::ChunkReceiver<iox::popo::ChunkReceiverData<2048u, iox::popo::ChunkQueueData<iox::DefaultChunkQueueConfig, iox::popo::ThreadSafePolicy> > >::release (this=0x7fffffffa8f0,
chunkHeader=0x7fffc7c25bd0)
at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver.inl:104
#5 0x00007ffff6c72407 in iox::popo::SubscriberPortUser::releaseChunk (this=0x7fffffffa8e0, chunkHeader=0x7fffc7c25bd0) at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_posh/source/popo/ports/subscriber_port_user.cpp:73
#6 0x00007ffff724ad1e in iox_sub_release_chunk (self=0x55555560a468, userPayload=0x7fffc7c25c38) at /home/matej/rtv/dds_ws/src/iceoryx/iceoryx_binding_c/source/c_subscriber.cpp:166
#7 0x00007ffff74a2d45 in dds_data_allocator_free (data_allocator=0x555555610498, ptr=0x7fffc7c25c38) at /home/matej/rtv/dds_ws/src/cyclonedds/src/core/ddsc/src/dds_data_allocator.c:155
#8 0x00007ffff7569c58 in fini_and_free_sample<CddsSubscription*> (entity=@0x7fffffffa988: 0x555555610450, loaned_message=0x7fffc7c25c38) at /home/matej/rtv/dds_ws/src/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:1392
#9 0x00007ffff755ef4c in return_loaned_message_from_subscription_int (subscription=0x555555609dc0, loaned_message=0x7fffc7c25c38) at /home/matej/rtv/dds_ws/src/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:3547
#10 0x00007ffff755efae in rmw_return_loaned_message_from_subscription (subscription=0x555555609dc0, loaned_message=0x7fffc7c25c38) at /home/matej/rtv/dds_ws/src/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:3566
#11 0x00007ffff78b7274 in rcl_return_loaned_message_from_subscription (subscription=0x555555609550, loaned_message=0x7fffc7c25c38) at ./src/rcl/subscription.c:656
#12 0x00007ffff7eab134 in rclcpp::Executor::execute_subscription (subscription=std::shared_ptr<rclcpp::SubscriptionBase> (use count 3, weak count 2) = {...}) at /usr/include/c++/11/bits/shared_ptr_base.h:1295
#13 0x00007ffff7eab7cf in rclcpp::Executor::execute_any_executable (this=0x7fffffffbe30, any_exec=...) at ./src/rclcpp/executor.cpp:518
#14 0x00007ffff7eb3e30 in rclcpp::executors::SingleThreadedExecutor::spin (this=0x7fffffffbe30) at ./src/rclcpp/executors/single_threaded_executor.cpp:37
#15 0x00005555555593a9 in main (argc=1, argv=0x7fffffffcc48) at /home/matej/rtv/iceoryx_bug/build/demo_nodes_cpp/rclcpp_components/node_main_listener.cpp:65
(gdb)
(gdb) p m_offset
Cannot access memory at address 0x2d10008
(gdb) p &m_offset
$1 = (iox::memory::RelativePointer<iox::mepoo::ChunkHeader>::offset_t *) 0x2d10008
(gdb) p m_id
Cannot access memory at address 0x2d10000
(gdb) p &m_id
$2 = (iox::memory::segment_id_underlying_t *) 0x2d10000
@elBoberido you are most familiar with the used chunk list and since there is maybe an infinite loop in there I tagged you here.
@afrixs thank you for the very detailed report! I will look into it.
I've just tested the bug with versions used by ros2 packages (rmw_cyclonedds: 1.4.1, cyclonedds: 0.9.0, iceoryx: v2.0.2) and segfaults seem to happen on listener side occasionally too. So the only difference was that the new iox-roudi
needs an extra argument -m on
to remove killed applications, with this argument the behavior is exactly the same in master
and v2.0.2
@afrixs @ceccocats Thank you for your support and help so far. I think the next step is to create an iceoryx only example to reproduce the bug (and to ensure that this is really originating from iceoryx and we can exclude cyclone-dds and ros).
I would write one in the next days but if some of you would like to do this I would be more than happy to let you do it - just let me know.
A good approach would be to just create a branch from current master, take the iceoryx_examples/icedelivery
example and adjust it so that it behaves in the same way. We have also a listener example in iceoryx_examples/callback
I can take it, thanks for pointing me to the next step
@afrixs cool, thank you very much :smile:
If you need support just let me know!
The example can be found here.
Currently I'm failing to reproduce the crashes/stalls using pure iceoryx (icedelivery/iox-cpp-publisher + icedelivery/iox-cpp-subscriber). However, I've succeeded to reproduce them with a pure iceoryx publisher and ROS2 subscriber (icedelivery/iox-cpp-publisher + ros2 run demo_nodes_cpp listener
).
Tomorrow I'm going to make the iceoryx subscriber replicate CycloneDDS reader behavior more closely to see where the crashes start to occur
@elfenpiff @afrixs my gut tells me this is synchronization issue. After having a closer look, it seems there is a small window where the subscriber removed the sample from the UsedChunkList
but the memory is not yet synchronized. When the subscriber is killed within this window and RouDi does the cleanup we have a double free. The problem can be solved, the question is just how expensive it will be.
I'm currently investigating the problem and you're right about the problem being related to synchronization, but maybe the solution for this particular bug scenario (which would be much more difficult to reproduce with such a small window) is simpler. There was a lock missing around calling iox_sub_release_chunk
in cyclonedds (there is a lock but it locks a different mutex from the one used for iox_sub_take_chunk
). In my tests, calling these two methods from different threads at the same time caused crashes very similar to the ones which were reported.
After adding the correct lock the crashes are gone, however, the test subscriber keeps stalling (in a different way), that's what I'm currently analyzing. So it looks like we'll be moving to cyclonedds repo with this issue, but first I want to make sure what's happening with the persisting stall...
Ok, the reason of "stalling" was just my debugging clutter, so things work properly now, no crashes, no hangs. So since I assume that there was no intention for iox_sub_take_chunk
and iox_sub_release_chunk
to be thread-safe, I think this issue can be closed here and I'll create a PR in cyclonedds.
Btw, there's missing iox_sub_release_chunk(subscriber, userPayload);
in the ice_c_callbacks_subscriber.c
example, that's one flaw I have found here, I'll create a PR for that
@afrixs Great work and thank you for taking care of this!
@afrixs hahaha, I started my last comment with my gut tells me the subscriber is accessed from multiple threads without synchronization
and then I had another look at the UsedChunkList
to be sure we do not have an issue there. Your are right, the window is too small to reproduce it that reliable. Great that you found the the root cause of your problem but we need to also fix the synchronization issue in the UsedChunkList
@afrixs Great find. I am not sure I am the culprit for missing the lock or the code was changed in the meantime. In C it is not possible to consistently enforce locking by say using lock_guards and proper encapsulation (struct members are accessible).
In any case, as already mentioned, the subscriber is not thread-safe. In C++ this is trivial to solve in a default way with a wrapper class and a mutex. @elBoberido I still envision to make the subscriber lock-free thread-safe, which includes solving the problem you mentioned. This is not trivial though :-(
In C more discipline is required to lock the mutex manually or also create a wrapper struct with a mutex and some functions that take care of locking etc. I was under the impression that was done but may have been changed in the meantime. Unfortunately it is still easy to mess up by not locking and using the subscriber field of this wrapper struct directly. As @afrixs correctly pointed out, the mutex that needs to be locked is locked with shm_lock_iox_sub
, see.
All these problems would go away if the subscriber would be thread-safe in iceoryx itself but this would also change the contract and reduce performance.
@elBoberido One could argue to provide thread safe versions of publisher and subscriber in addition, but this is additional API to test in C++. But it might be worth it to have it as default in the C API to avoid these problems.
Say the C API subscriber uses a mutex protected C++ subscriber under the hood. Of course, this reduces performance if this is not needed ...
@MatthiasKillat @elfenpiff after creating #1771 I had an even closer look to the UsedChunkList
and came to the conclusion that we do not have any issue there. It was just me who was a little bit confused :smile:
cc @afrixs
Still not resolved I believe (as of 2.0.3). Correct me if wrong or someone is still looking / has found a solution.
The problem turned out to be in CycloneDDS, I believe it has been resolved by this PR
@sergmister I think the issue is solved for a dying subscriber. Did you still experience issues with the latest CycloneDDS. There is still an issue with a dying publisher which uses a mutex due to the history feature for late subscriber. This is usually uncontested and does not affect performance but if the publisher dies at the wrong time and RouDi
tries to clean up, it will encounter similar problems. #1710 would fix this issue for the publisher and it is also related to #325 in case the monitoring is not active and there are multiple publisher.
If you do not encounter the problem with latest CycloneDDS I would close this issue for the subscriber.
@elBoberido This issue seems to be resolved on the latest releases/0.10.x
branch of CycloneDDS, at least with my tests using ROS2 Humble. For anyone else building ROS2, just after you gather all the sources with vcs import
run:
rm -r ${ROS_ROOT}/src/cyclonedds
git -C ${ROS_ROOT}/src/ clone --depth 1 https://github.com/eclipse-cyclonedds/cyclonedds.git --branch releases/0.10.x
@sergmister nice to hear :)
@elBoberido This issue seems to be resolved on the latest
releases/0.10.x
branch of CycloneDDS, at least with my tests using ROS2 Humble. For anyone else building ROS2, just after you gather all the sources withvcs import
run:rm -r ${ROS_ROOT}/src/cyclonedds git -C ${ROS_ROOT}/src/ clone --depth 1 https://github.com/eclipse-cyclonedds/cyclonedds.git --branch releases/0.10.x
can you tell me is it possible to compile cyclonedds as a workspace package to replace the cyclonedds in ros
@sgf201 I'm not that familiar with how workspaces work in ros but I think it should be possible. You may get more help in the Cyclone DDS repo
Required information
Operating system: Ubuntu 20.04 LTS
Compiler version: GCC 9.4.0
Observed result or behavior: All this behavior occured on the iceoryx version shipped with cyclone DDS on ros2 galactic. I see similar behavior also with ros2 humble
Roudi do SEGFAULT when a subscriber dies. It happens expecially when you have a publisher with high publish rate. ROUDI segfault after it detects that the node is dead and try to clean the memory.
Tracking with GDB this is the backtrace:
I inserted a debug print in the SEGFAULTED function:
It seems that m_chunkManagement is a wrong pointer, compared with the previous ones:
Looks like that m_chunkManagement is corrupted somewhere