eclipse-iceoryx / iceoryx

Eclipse iceoryx™ - true zero-copy inter-process-communication
https://iceoryx.io
Apache License 2.0
1.58k stars 373 forks source link

Out-of-bounds memory access in UsedChunkList #2311

Open tobiasstarkwayve opened 1 week ago

tobiasstarkwayve commented 1 week ago

Required information

Operating system:

$ uname -a
Linux 5.10.104-l4t-r35.3.ga+g1ef0a8bc637b #1 SMP PREEMPT Wed Jun 5 17:27:33 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Compiler version: Clang 16

Eclipse iceoryx version: v. 2.0.3 (but code is unchanged since then)

Observed result or behaviour: As we were tracking down a segmentation fault we added the following assertions into the used-chunks list:

Diff

``` ✔ ~/iceoryx [v2.0.3|✚ 2…3] 12:25 $ git diff diff --git a/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver_data.hpp b/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver_data.hpp index 01e1c9ac4..0bcb35975 100644 --- a/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver_data.hpp +++ b/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver_data.hpp @@ -44,6 +44,9 @@ struct ChunkReceiverData : public ChunkQueueDataType /// has to return one to not brake the contract. This is aligned with AUTOSAR Adaptive ara::com static constexpr uint32_t MAX_CHUNKS_IN_USE = MaxChunksHeldSimultaneously + 1U; UsedChunkList m_chunksInUse; + +private: + void self_check(); }; } // namespace popo diff --git a/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl b/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl index eb867880c..310317610 100644 --- a/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl +++ b/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl @@ -38,9 +38,11 @@ UsedChunkList::UsedChunkList() noexcept template bool UsedChunkList::insert(mepoo::SharedChunk chunk) noexcept { + self_check(); auto hasFreeSpace = m_freeListHead != INVALID_INDEX; if (hasFreeSpace) { + iox::cxx::Ensures(m_freeListHead < INVALID_INDEX); // get next free entry after freelistHead auto nextFree = m_listIndices[m_freeListHead]; @@ -48,17 +50,20 @@ bool UsedChunkList::insert(mepoo::SharedChunk chunk) noexcept m_listIndices[m_freeListHead] = m_usedListHead; m_usedListHead = m_freeListHead; + iox::cxx::Ensures(m_usedListHead < INVALID_INDEX); m_listData[m_usedListHead] = DataElement_t(chunk); // set freeListHead to the next free entry m_freeListHead = nextFree; + self_check(); /// @todo can we do this cheaper with a global fence in cleanup? m_synchronizer.clear(std::memory_order_release); return true; } else { + self_check(); return false; } } @@ -67,10 +72,11 @@ template bool UsedChunkList::remove(const mepoo::ChunkHeader* chunkHeader, mepoo::SharedChunk& chunk) noexcept { auto previous = INVALID_INDEX; - + self_check(); // go through usedList with stored chunks for (auto current = m_usedListHead; current != INVALID_INDEX; current = m_listIndices[current]) { + iox::cxx::Ensures(current < INVALID_INDEX); if (!m_listData[current].isLogicalNullptr()) { // does the entry match the one we want to remove? @@ -85,6 +91,7 @@ bool UsedChunkList::remove(const mepoo::ChunkHeader* chunkHeader, mepo } else { + iox::cxx::Ensures(previous < INVALID_INDEX); m_listIndices[previous] = m_listIndices[current]; } @@ -94,11 +101,13 @@ bool UsedChunkList::remove(const mepoo::ChunkHeader* chunkHeader, mepo /// @todo can we do this cheaper with a global fence in cleanup? m_synchronizer.clear(std::memory_order_release); + self_check(); return true; } } previous = current; } + self_check(); return false; } @@ -115,7 +124,7 @@ void UsedChunkList::cleanup() noexcept data.releaseToSharedChunk(); } } - + self_check(); init(); // just to save us from the future self } @@ -147,9 +156,19 @@ void UsedChunkList::init() noexcept data.releaseToSharedChunk(); } + self_check(); m_synchronizer.clear(std::memory_order_release); } +template +void UsedChunkList::self_check() noexcept { + for (uint32_t i = 0U; i < Capacity; ++i) { + iox::cxx::Ensures(m_listIndices[i] <= INVALID_INDEX); + } + iox::cxx::Ensures(m_freeListHead <= INVALID_INDEX); + iox::cxx::Ensures(m_usedListHead <= INVALID_INDEX); +} + } // namespace popo } // namespace iox ```

Running with this patch, we observed a violation of the check ahead of the m_listIndices[previous] = m_listIndices[current]; line. This seems like a clear bug to me, as it implies an out-of-bounds access on the array. Do you agree?

Furthermore, it looks to me like this assertion can be triggered only if the else branch is taken in the initial iteration of the loop (as any subsequent iteration will have previous < INVALID_INDEX as ensured by the loop condition). The only way I can see this happening is if m_usedListHead is modified between the start of the loop and the later if check. Is that possible? If so, does the code properly handle that case? Wouldn't this need some sort of retry-loop?

Expected result or behaviour: No out-of-bounds accesses

Conditions where it occurred / Performed steps:

The bug is hard to reproduce, but on our system it appears reliably after a few thousand runs of our periodic pipeline. Changing the size of the images being processed affects whether the problem occurs (?), which we assume is due to higher load on the system

Additional helpful information

elBoberido commented 1 week ago

The only way I can see this happening is if m_usedListHead is modified between the start of the loop and the later if check. Is that possible?

Are you accessing the endpoint (publisher, subscriber, client, server) from multiple threads? I thinks this is the only way for the scenario you describe. The endpoints are not thread safe and you need to use e.g. a smart_lock with them. You can use multiple endpoints with the same ServiceDescription in multiple threads but not the same instance of an endpoint across multiple threads without a mutes.

tobiasstarkwayve commented 6 days ago

The only way I can see this happening is if m_usedListHead is modified between the start of the loop and the later if check. Is that possible?

Are you accessing the endpoint (publisher, subscriber, client, server) from multiple threads? I thinks this is the only way for the scenario you describe. The endpoints are not thread safe and you need to use e.g. a smart_lock with them. You can use multiple endpoints with the same ServiceDescription in multiple threads but not the same instance of an endpoint across multiple threads without a mutes.

@elBoberido I don't think so. We don't interact with the subscription directly at all, just through the ROS executor. And even the multi-threaded ROS executor seems to meticulously return the loaned message right after the callback ran (i.e., while still under the callback-group protection against concurrent runs).

Looking at it from that lens, do the two accesses really need to run in parallel? Or is it enough to run in quick succession from different CPU cores? In that case one load of m_usedListHead might load from a stale CPU cache and the next load might load the updated value from memory.

I don't fully understand the intricacies of the memory fences in this code, so maybe I'm wrong, but without any acquire fence on the cleanup/insert path I think this would be possible, right? (edit: on further investigation I'm. not convinced an acquire fence would even help here. Anyway, let's first identify the problem then speculate about solutions)

elBoberido commented 6 days ago

@tobiasstarkwayve if the enpoint is properly synchronized by e.g. a mutex, cache synchronization is taken care of. If the endpoint is used only in a single thread and the OS moves the thread to another CPU, the caches are also synchronized.

I didn't have a deep look into the code but was relying on your investigation with modified between the start of the loop and the later if check. Is that possible?. This is not possible if the endpoint is used in the intended way, i.e. only single threaded or all accesses is guarded by the same mutex. The fence in the UsedChunkList is not intended to make it thread safe and accessible from multiple threads but only to have the memory synchronized when the application dies and RouDi does the cleanup, i.e. no concurrent access but transfer of the single threaded ownership to another participant.

elBoberido commented 6 days ago

@tobiasstarkwayve oh, can you print the thread ID in the insert and release methods? One thing to also keep in mind is that the samples also accesses the endpoint when they are released. So if you run in a multi-threaded environment and release samples without holding the mutex, one will also observe the issue you describe

elBoberido commented 6 days ago

@tobiasstarkwayve I had a closer look at the code and I agree with your assessment. The only way this could happen is when the else branch is taken in the initial loop iteration. The only way this can happen is if there is an unsynchronized access to the UsedChunkList. My bet would be the release of the sample. Do you use the typed or the untyped API.

tobiasstarkwayve commented 6 days ago

@tobiasstarkwayve if the enpoint is properly synchronized by e.g. a mutex, cache synchronization is taken care of. If the endpoint is used only in a single thread and the OS moves the thread to another CPU, the caches are also synchronized.

Yes, that's true. But that's not quite what I meant. What I meant is a subtly different situation: there are two threads, both access the endpoint at different times but are synchronised only by an atomic variable.

Here is the code for the multi-threaded ROS executor (https://github.com/ros2/rclcpp/blob/c743c173e68d92af872cf163e10721a8dbe51dd0/rclcpp/src/rclcpp/executor.cpp). The multiple threads of that multi-threaded executor basically run get_next_executable and execute_any_executable in an endless loop. When get_next_executable picks up a subscription it marks the can_be_taken_from atomicBool as false (preventing it from being picked up again) and when execute_any_executable returns it sets it back to true.

So the scenario I'm thinking of is:

  1. subscription is picked up by thread 1
  2. everything runs, samples are released, etc.
  3. can_be_taken_from is set to true
  4. same subscription is picked up by thread 2 on a different core
  5. everything runs, but there was no synchronisation point so this run is racy with step (2)

Unfortunately I'm not enough of an expert on memory models to know whether this is actually plausible. But I definitely don't see a clear synchronisation point between the two. Granted, the store to can_be_taken_from is sequentially consistent, but I'm not sure the modifications in the usedChunkList "happen before" the store to can_be_taken_from.

Do you use the typed or the untyped API.

This is the stacktrace we observed when the test assertion failed:

Condition: previous < INVALID_INDEX in bool iox::popo::UsedChunkList<257>::remove(const mepoo::ChunkHeader *, mepoo::SharedChunk &) [Capacity = 257] is violated. (iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl:94)
2024-07-02 21:45:34.180 [ Error ]: ICEORYX error! EXPECTS_ENSURES_FAILED
Unhandled C++ exception:
librmw_cyclonedds.so(iox::errorHandler(iox::Error, std::function<void ()> const&, iox::ErrorLevel)+0x68) [0xfffef4cd401c]
librmw_cyclonedds.so(iox::cxx::internal::Require(bool, char const*, int, char const*, char const*)+0x1d4) [0xfffef4cd3ae8]
librmw_cyclonedds.so(iox::popo::UsedChunkList<257u>::remove(iox::mepoo::ChunkHeader const*, iox::mepoo::SharedChunk&)+0x268) [0xfffef4cb4ac0]
librmw_cyclonedds.so(iox::popo::SubscriberPortUser::releaseChunk(iox::mepoo::ChunkHeader const*)+0x3c) [0xfffef4cb3cc4]
librmw_cyclonedds.so(iox_sub_release_chunk+0x60) [0xfffef4ca8a54]
librmw_cyclonedds.so(dds_data_allocator_free+0xc4) [0xfffef4c976b0]
librmw_cyclonedds.so(+0x1b261c) [0xfffef4bd261c]
librmw_cyclonedds.so(+0x198280) [0xfffef4bb8280]
librmw_cyclonedds.so(rmw_return_loaned_message_from_subscription+0x20) [0xfffef4bb7f8c]
our_binary(rcl_return_loaned_message_from_subscription+0x80) [0xaaaaab8528ac]
our_binary(rclcpp::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>)+0x218) [0xaaaaab79a8cc]
our_binary(rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&)+0x278) [0xaaaaab799e18]
our_binary(rclcpp::executors::MultiThreadedExecutor::run(unsigned long)+0x2dc) [0xaaaaab7a3820]

So I think that is the untyped API, but I'm not 100% sure.

elBoberido commented 6 days ago

The can_be_taken_from is used with std::memory_order_seq_cst in store as well as in load. Without digging too deep into the code, this should be sufficient to handle over ownership between threads. But I think the issue is not getting the samples but releasing them. Let's assume the executor moves the subscription to another thread but there are still somewhere samples which are accessed and released from the initial thread. Now one has concurrent access to a data structure which is not thread-safe and data races are introduced. You can test my theory by printing the thread ID and the chunkHeader pointer to the terminal. If I'm correct with my assessment, you will see that there e.g. an insert with tid 1 and chunkHeader 0x42, followed by an insert with tid 2 and chunkHeader 0x13 and then a remove with tid 1 and chunkHeader 0x42 ... the tid might even be something totally different.

If it is Cyclone DDS it is the untyped API indeed. This looks more and more like it is related to how iceoryx is integrated into Cyclone DDS and this is something where I do not have much knowledge about.

tobiasstarkwayve commented 2 days ago

The can_be_taken_from is used with std::memory_order_seq_cst in store as well as in load. Without digging too deep into the code, this should be sufficient to handle over ownership between threads. But I think the issue is not getting the samples but releasing them. Let's assume the executor moves the subscription to another thread but there are still somewhere samples which are accessed and released from the initial thread. Now one has concurrent access to a data structure which is not thread-safe and data races are introduced. You can test my theory by printing the thread ID and the chunkHeader pointer to the terminal. If I'm correct with my assessment, you will see that there e.g. an insert with tid 1 and chunkHeader 0x42, followed by an insert with tid 2 and chunkHeader 0x13 and then a remove with tid 1 and chunkHeader 0x42 ... the tid might even be something totally different.

If it is Cyclone DDS it is the untyped API indeed. This looks more and more like it is related to how iceoryx is integrated into Cyclone DDS and this is something where I do not have much knowledge about.

Hmm, I think that can't really happen. The samples are always released when the user callback completes, and the user callback doesn't move threads while it runs. I will try your proposed experiment, though, let's see what comes up.

This looks more and more like it is related to how iceoryx is integrated into Cyclone DDS and this is something where I do not have much knowledge about.

Do you have any recommendations who to ask instead? Is this a question for the CycloneDDS project?

elBoberido commented 2 days ago

@tobiasstarkwayve I have too little knowledge about the ROS executor but everything points to an unsynchronized access

@eboasson did you experience something similar to what Tobias describes with Cyclone DDS and the multi-threaded ROS executor?