eclipse-iceoryx / iceoryx

Eclipse iceoryx™ - true zero-copy inter-process-communication
https://iceoryx.io
Apache License 2.0
1.69k stars 393 forks source link

loffli concurrent bug #2351

Closed tang-qh closed 1 month ago

tang-qh commented 2 months ago

Required information

the abaCounter in loffli is not unique:

do
    {
        // we are empty if next points to an element with index of Size
        if (oldHead.indexToNextFreeIndex >= m_size || !m_nextFreeIndex)
        {
            return false;
        }

        // NOLINTNEXTLINE(cppcoreguidelines-pro-bounds-pointer-arithmetic) upper limit of index set by m_size
        newHead.indexToNextFreeIndex = m_nextFreeIndex.get()[oldHead.indexToNextFreeIndex];
        newHead.abaCounter += 1;
    } while (!m_head.compare_exchange_weak(oldHead, newHead, std::memory_order_acq_rel, std::memory_order_acquire));

when the swap failed,may be the m_head have been changed multiple times, so newHead.abaCounter += 1 is not enough;

and we did encounter this problem(although the probability is very low): Mempool [m_chunkSize = 56, numberOfChunks = 5214, used_chunks = 193 ] has no more space left 2024-09-22_13-32-46.367778 [ Error ]ICEORYX error! POSH__MEMPOOL_POSSIBLE_DOUBLE_FREE

newHead.abaCounter += 1 should be -> newHead.abaCounter = oldHead.abaCounter +1 it can fix it;

Operating system: E.g. Ubuntu 18.04 LTS

Compiler version: E.g. GCC 7.4.0

Eclipse iceoryx version: E.g. v1.2.3 or main branch

Observed result or behaviour: A clear and precise description of the observed result.

Expected result or behaviour: What do you expect to happen?

Conditions where it occurred / Performed steps: Describe how one can reproduce the bug.

Additional helpful information

If there is a core dump, please run the following command and add the output to the issue in a separate comment

gdb --batch \
   --ex "shell printf '\n\033[33m#### Local Variables ####\033[m\n'"  --ex "info locals" \
   --ex "shell printf '\n\033[33m#### Threads ####\033[m\n'"          --ex "info threads" \
   --ex "shell printf '\n\033[33m#### Shared Libraries ####\033[m\n'" --ex "info sharedlibrary" \
   --ex "shell printf '\n\033[33m#### Stack Frame ####\033[m\n'"      --ex "info frame" \
   --ex "shell printf '\n\033[33m#### Register ####\033[m\n'"         --ex "info register" \
   --ex "shell printf '\n\033[33m#### Backtrace ####\033[m'"          --ex "thread apply all bt" \
   --core coreDumpFile binaryFile
elfenpiff commented 2 months ago

@tang-qh this is an amazing finding! You are completely right and we need to fix this - the algorithm is not correct here.

And thank you a lot for providing the fix with it!

Btw. when we ported the lock-free code to next-gen iceoryx2 we fixed this by accident, see this here: https://github.com/eclipse-iceoryx/iceoryx2/blob/main/iceoryx2-bb/lock-free/src/mpmc/unique_index_set.rs#L474

elfenpiff commented 2 months ago

@tang-qh Since you found the bug, would you like to have the honor of fixing it and creating a pull request?

If not, I would fix it by applying your solution.

tang-qh commented 2 months ago

@elfenpiff I don't know which branch to submit to, please fix it directly~

tang-qh commented 2 months ago
diff --git a/iceoryx_hoofs/source/concurrent/loffli.cpp b/iceoryx_hoofs/source/concurrent/loffli.cpp
index 54e178c..6d382cc 100644
--- a/iceoryx_hoofs/source/concurrent/loffli.cpp
+++ b/iceoryx_hoofs/source/concurrent/loffli.cpp
@@ -62,7 +62,7 @@ bool LoFFLi::pop(Index_t& index) noexcept

         // NOLINTNEXTLINE(cppcoreguidelines-pro-bounds-pointer-arithmetic) upper limit of index set by m_size
         newHead.indexToNextFreeIndex = m_nextFreeIndex.get()[oldHead.indexToNextFreeIndex];
-        newHead.abaCounter += 1;
+        newHead.abaCounter = oldHead.abaCounter + 1;
     } while (!m_head.compare_exchange_weak(oldHead, newHead, std::memory_order_acq_rel, std::memory_order_acquire));

     /// comes from outside, is not shared and therefore no synchronization is needed
@@ -104,7 +104,7 @@ bool LoFFLi::push(const Index_t index) noexcept
         // NOLINTNEXTLINE(cppcoreguidelines-pro-bounds-pointer-arithmetic) index is limited by capacity
         m_nextFreeIndex.get()[index] = oldHead.indexToNextFreeIndex;
         newHead.indexToNextFreeIndex = index;
-        newHead.abaCounter += 1;
+        newHead.abaCounter = oldHead.abaCounter + 1;
     } while (!m_head.compare_exchange_weak(oldHead, newHead, std::memory_order_acq_rel, std::memory_order_acquire));
elBoberido commented 2 months ago

@tang-qh can you create a pull request?

If you cannot sign the eclipse contribution agreement, we could of course also create a PR and merge this. But since you found and fixed the bug you should also have the honor to be the author of the commit.

tang-qh commented 1 month ago

@elBoberido hi, I have submitted a repaired pull request;