Open tobiasstarkwayve opened 1 week ago
The only way I can see this happening is if m_usedListHead is modified between the start of the loop and the later if check. Is that possible?
Are you accessing the endpoint (publisher, subscriber, client, server) from multiple threads? I thinks this is the only way for the scenario you describe. The endpoints are not thread safe and you need to use e.g. a smart_lock
with them. You can use multiple endpoints with the same ServiceDescription
in multiple threads but not the same instance of an endpoint across multiple threads without a mutes.
The only way I can see this happening is if m_usedListHead is modified between the start of the loop and the later if check. Is that possible?
Are you accessing the endpoint (publisher, subscriber, client, server) from multiple threads? I thinks this is the only way for the scenario you describe. The endpoints are not thread safe and you need to use e.g. a
smart_lock
with them. You can use multiple endpoints with the sameServiceDescription
in multiple threads but not the same instance of an endpoint across multiple threads without a mutes.
@elBoberido I don't think so. We don't interact with the subscription directly at all, just through the ROS executor. And even the multi-threaded ROS executor seems to meticulously return the loaned message right after the callback ran (i.e., while still under the callback-group protection against concurrent runs).
Looking at it from that lens, do the two accesses really need to run in parallel? Or is it enough to run in quick succession from different CPU cores? In that case one load of m_usedListHead
might load from a stale CPU cache and the next load might load the updated value from memory.
I don't fully understand the intricacies of the memory fences in this code, so maybe I'm wrong, but without any I think this would be possible, right? (edit: on further investigation I'm. not convinced an acquire
fence on the cleanup/insert pathacquire
fence would even help here. Anyway, let's first identify the problem then speculate about solutions)
@tobiasstarkwayve if the enpoint is properly synchronized by e.g. a mutex, cache synchronization is taken care of. If the endpoint is used only in a single thread and the OS moves the thread to another CPU, the caches are also synchronized.
I didn't have a deep look into the code but was relying on your investigation with modified between the start of the loop and the later if check. Is that possible?
. This is not possible if the endpoint is used in the intended way, i.e. only single threaded or all accesses is guarded by the same mutex. The fence in the UsedChunkList
is not intended to make it thread safe and accessible from multiple threads but only to have the memory synchronized when the application dies and RouDi does the cleanup, i.e. no concurrent access but transfer of the single threaded ownership to another participant.
@tobiasstarkwayve oh, can you print the thread ID in the insert
and release
methods? One thing to also keep in mind is that the samples also accesses the endpoint when they are released. So if you run in a multi-threaded environment and release samples without holding the mutex, one will also observe the issue you describe
@tobiasstarkwayve I had a closer look at the code and I agree with your assessment. The only way this could happen is when the else branch is taken in the initial loop iteration. The only way this can happen is if there is an unsynchronized access to the UsedChunkList
. My bet would be the release of the sample. Do you use the typed or the untyped API.
@tobiasstarkwayve if the enpoint is properly synchronized by e.g. a mutex, cache synchronization is taken care of. If the endpoint is used only in a single thread and the OS moves the thread to another CPU, the caches are also synchronized.
Yes, that's true. But that's not quite what I meant. What I meant is a subtly different situation: there are two threads, both access the endpoint at different times but are synchronised only by an atomic variable.
Here is the code for the multi-threaded ROS executor (https://github.com/ros2/rclcpp/blob/c743c173e68d92af872cf163e10721a8dbe51dd0/rclcpp/src/rclcpp/executor.cpp). The multiple threads of that multi-threaded executor basically run get_next_executable
and execute_any_executable
in an endless loop. When get_next_executable
picks up a subscription it marks the can_be_taken_from
atomicBool as false (preventing it from being picked up again) and when execute_any_executable
returns it sets it back to true.
So the scenario I'm thinking of is:
can_be_taken_from
is set to trueUnfortunately I'm not enough of an expert on memory models to know whether this is actually plausible. But I definitely don't see a clear synchronisation point between the two. Granted, the store to can_be_taken_from
is sequentially consistent, but I'm not sure the modifications in the usedChunkList "happen before" the store to can_be_taken_from.
Do you use the typed or the untyped API.
This is the stacktrace we observed when the test assertion failed:
Condition: previous < INVALID_INDEX in bool iox::popo::UsedChunkList<257>::remove(const mepoo::ChunkHeader *, mepoo::SharedChunk &) [Capacity = 257] is violated. (iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl:94)
2024-07-02 21:45:34.180 [ Error ]: ICEORYX error! EXPECTS_ENSURES_FAILED
Unhandled C++ exception:
librmw_cyclonedds.so(iox::errorHandler(iox::Error, std::function<void ()> const&, iox::ErrorLevel)+0x68) [0xfffef4cd401c]
librmw_cyclonedds.so(iox::cxx::internal::Require(bool, char const*, int, char const*, char const*)+0x1d4) [0xfffef4cd3ae8]
librmw_cyclonedds.so(iox::popo::UsedChunkList<257u>::remove(iox::mepoo::ChunkHeader const*, iox::mepoo::SharedChunk&)+0x268) [0xfffef4cb4ac0]
librmw_cyclonedds.so(iox::popo::SubscriberPortUser::releaseChunk(iox::mepoo::ChunkHeader const*)+0x3c) [0xfffef4cb3cc4]
librmw_cyclonedds.so(iox_sub_release_chunk+0x60) [0xfffef4ca8a54]
librmw_cyclonedds.so(dds_data_allocator_free+0xc4) [0xfffef4c976b0]
librmw_cyclonedds.so(+0x1b261c) [0xfffef4bd261c]
librmw_cyclonedds.so(+0x198280) [0xfffef4bb8280]
librmw_cyclonedds.so(rmw_return_loaned_message_from_subscription+0x20) [0xfffef4bb7f8c]
our_binary(rcl_return_loaned_message_from_subscription+0x80) [0xaaaaab8528ac]
our_binary(rclcpp::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>)+0x218) [0xaaaaab79a8cc]
our_binary(rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&)+0x278) [0xaaaaab799e18]
our_binary(rclcpp::executors::MultiThreadedExecutor::run(unsigned long)+0x2dc) [0xaaaaab7a3820]
So I think that is the untyped API, but I'm not 100% sure.
The can_be_taken_from
is used with std::memory_order_seq_cst
in store
as well as in load
. Without digging too deep into the code, this should be sufficient to handle over ownership between threads. But I think the issue is not getting the samples but releasing them. Let's assume the executor moves the subscription to another thread but there are still somewhere samples which are accessed and released from the initial thread. Now one has concurrent access to a data structure which is not thread-safe and data races are introduced. You can test my theory by printing the thread ID and the chunkHeader pointer to the terminal. If I'm correct with my assessment, you will see that there e.g. an insert
with tid
1 and chunkHeader
0x42, followed by an insert
with tid
2 and chunkHeader
0x13 and then a remove
with tid
1 and chunkHeader
0x42 ... the tid
might even be something totally different.
If it is Cyclone DDS it is the untyped API indeed. This looks more and more like it is related to how iceoryx is integrated into Cyclone DDS and this is something where I do not have much knowledge about.
The
can_be_taken_from
is used withstd::memory_order_seq_cst
instore
as well as inload
. Without digging too deep into the code, this should be sufficient to handle over ownership between threads. But I think the issue is not getting the samples but releasing them. Let's assume the executor moves the subscription to another thread but there are still somewhere samples which are accessed and released from the initial thread. Now one has concurrent access to a data structure which is not thread-safe and data races are introduced. You can test my theory by printing the thread ID and the chunkHeader pointer to the terminal. If I'm correct with my assessment, you will see that there e.g. aninsert
withtid
1 andchunkHeader
0x42, followed by aninsert
withtid
2 andchunkHeader
0x13 and then aremove
withtid
1 andchunkHeader
0x42 ... thetid
might even be something totally different.If it is Cyclone DDS it is the untyped API indeed. This looks more and more like it is related to how iceoryx is integrated into Cyclone DDS and this is something where I do not have much knowledge about.
Hmm, I think that can't really happen. The samples are always released when the user callback completes, and the user callback doesn't move threads while it runs. I will try your proposed experiment, though, let's see what comes up.
This looks more and more like it is related to how iceoryx is integrated into Cyclone DDS and this is something where I do not have much knowledge about.
Do you have any recommendations who to ask instead? Is this a question for the CycloneDDS project?
@tobiasstarkwayve I have too little knowledge about the ROS executor but everything points to an unsynchronized access
@eboasson did you experience something similar to what Tobias describes with Cyclone DDS and the multi-threaded ROS executor?
Required information
Operating system:
Compiler version: Clang 16
Eclipse iceoryx version: v. 2.0.3 (but code is unchanged since then)
Observed result or behaviour: As we were tracking down a segmentation fault we added the following assertions into the used-chunks list:
Diff
``` ✔ ~/iceoryx [v2.0.3|✚ 2…3] 12:25 $ git diff diff --git a/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver_data.hpp b/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver_data.hpp index 01e1c9ac4..0bcb35975 100644 --- a/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver_data.hpp +++ b/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver_data.hpp @@ -44,6 +44,9 @@ struct ChunkReceiverData : public ChunkQueueDataType /// has to return one to not brake the contract. This is aligned with AUTOSAR Adaptive ara::com static constexpr uint32_t MAX_CHUNKS_IN_USE = MaxChunksHeldSimultaneously + 1U; UsedChunkList m_chunksInUse;
+
+private:
+ void self_check();
};
} // namespace popo
diff --git a/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl b/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl
index eb867880c..310317610 100644
--- a/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl
+++ b/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl
@@ -38,9 +38,11 @@ UsedChunkList::UsedChunkList() noexcept
template
bool UsedChunkList::insert(mepoo::SharedChunk chunk) noexcept
{
+ self_check();
auto hasFreeSpace = m_freeListHead != INVALID_INDEX;
if (hasFreeSpace)
{
+ iox::cxx::Ensures(m_freeListHead < INVALID_INDEX);
// get next free entry after freelistHead
auto nextFree = m_listIndices[m_freeListHead];
@@ -48,17 +50,20 @@ bool UsedChunkList::insert(mepoo::SharedChunk chunk) noexcept
m_listIndices[m_freeListHead] = m_usedListHead;
m_usedListHead = m_freeListHead;
+ iox::cxx::Ensures(m_usedListHead < INVALID_INDEX);
m_listData[m_usedListHead] = DataElement_t(chunk);
// set freeListHead to the next free entry
m_freeListHead = nextFree;
+ self_check();
/// @todo can we do this cheaper with a global fence in cleanup?
m_synchronizer.clear(std::memory_order_release);
return true;
}
else
{
+ self_check();
return false;
}
}
@@ -67,10 +72,11 @@ template
bool UsedChunkList::remove(const mepoo::ChunkHeader* chunkHeader, mepoo::SharedChunk& chunk) noexcept
{
auto previous = INVALID_INDEX;
-
+ self_check();
// go through usedList with stored chunks
for (auto current = m_usedListHead; current != INVALID_INDEX; current = m_listIndices[current])
{
+ iox::cxx::Ensures(current < INVALID_INDEX);
if (!m_listData[current].isLogicalNullptr())
{
// does the entry match the one we want to remove?
@@ -85,6 +91,7 @@ bool UsedChunkList::remove(const mepoo::ChunkHeader* chunkHeader, mepo
}
else
{
+ iox::cxx::Ensures(previous < INVALID_INDEX);
m_listIndices[previous] = m_listIndices[current];
}
@@ -94,11 +101,13 @@ bool UsedChunkList::remove(const mepoo::ChunkHeader* chunkHeader, mepo
/// @todo can we do this cheaper with a global fence in cleanup?
m_synchronizer.clear(std::memory_order_release);
+ self_check();
return true;
}
}
previous = current;
}
+ self_check();
return false;
}
@@ -115,7 +124,7 @@ void UsedChunkList::cleanup() noexcept
data.releaseToSharedChunk();
}
}
-
+ self_check();
init(); // just to save us from the future self
}
@@ -147,9 +156,19 @@ void UsedChunkList::init() noexcept
data.releaseToSharedChunk();
}
+ self_check();
m_synchronizer.clear(std::memory_order_release);
}
+template
+void UsedChunkList::self_check() noexcept {
+ for (uint32_t i = 0U; i < Capacity; ++i) {
+ iox::cxx::Ensures(m_listIndices[i] <= INVALID_INDEX);
+ }
+ iox::cxx::Ensures(m_freeListHead <= INVALID_INDEX);
+ iox::cxx::Ensures(m_usedListHead <= INVALID_INDEX);
+}
+
} // namespace popo
} // namespace iox
```
Running with this patch, we observed a violation of the check ahead of the
m_listIndices[previous] = m_listIndices[current];
line. This seems like a clear bug to me, as it implies an out-of-bounds access on the array. Do you agree?Furthermore, it looks to me like this assertion can be triggered only if the else branch is taken in the initial iteration of the loop (as any subsequent iteration will have previous < INVALID_INDEX as ensured by the loop condition). The only way I can see this happening is if m_usedListHead is modified between the start of the loop and the later if check. Is that possible? If so, does the code properly handle that case? Wouldn't this need some sort of retry-loop?
Expected result or behaviour: No out-of-bounds accesses
Conditions where it occurred / Performed steps:
The bug is hard to reproduce, but on our system it appears reliably after a few thousand runs of our periodic pipeline. Changing the size of the images being processed affects whether the problem occurs (?), which we assume is due to higher load on the system
Additional helpful information