QuantumLeaps / lock-free-ring-buffer

"Lock-Free Ring Buffer" (LFRB) is a minimal, customizable implementation of a ring buffer (a.k.a. circular buffer) in C, specifically suitable for embedded systems.
MIT License
34 stars 7 forks source link

Possible documentation issue for ARM Cortex-M and 'strong memory consistency' restriction #7

Closed geosmall closed 1 month ago

geosmall commented 1 month ago

From here (https://developer.arm.com/documentation/ddi0406/cb/Appendixes/Barrier-Litmus-Tests/Introduction/Overview-of-memory-consistency) we read that the ARM Cortex-M CPUs generally follow a weak memory consistency model. This means that memory operations (loads and stores) can be reordered by the processor, which can lead to non-intuitive behavior in multi-threaded applications if not properly managed. ARM provides memory barrier instructions (such as DMB, DSB, and ISB) to enforce ordering constraints when needed to ensure that memory operations are completed in the desired order to maintain data consistency concurrent environments.

README.md indicates the lock-free-ring-buffer is specifically suitable for embedded single-core microcontrollers including ARM Cortex-M CPU. It also later indicates (as a Lock-Free Restriction) that the ring buffer does not require any "locking" (mutual exclusion mechanism) as long as the system has strong memory consistency. My reference above indicates that ARM Cortex-M CPU does not meet this restriction.

Is lock-free-ring-buffer suitable for use on an ARM Cortex-M7 as is? Alternatively does it need modification to be suitable for use on m7 assuming only one thread/interrupt can produce data into the ring buffer, only one thread/interrupt can consume data from the ring buffer and both the producer and consumer run on the same CPU?

geosmall commented 1 month ago

See https://www.codeproject.com/Articles/43510/Lock-Free-Single-Producer-Single-Consumer-Circular#heading_how_it_works

quantum-leaps commented 1 month ago

Hi George, Good question. The Cortex-M7 often uses cache memories, so strictly speaking it does not meet the restrictions of "systems without cache memory".

But the issue should certainly be investigated and here is where the open-source community comes in. So if you'd like to use LFRB on your Cortex-M7 safely, please do the work to properly investigate how this works in this particular situation. If any changes/restrictions are needed, please submit a pull request for the code and/or documentation update. I hope that my comments make sense to you.

--MMS

geosmall commented 1 month ago

Thank you for the comments, they do make sense. I will continue with the research you mention and hope to come back with a pull request. Feel free to close this issue if you think it makes sense.

Best Regards George

geosmall commented 1 month ago

From my research so far:

A weak memory consistency model is generally true for all ARM Cortex-M processors. This model allows for memory operations to be reordered for performance optimization, which can be beneficial but requires careful synchronization in multi-threaded applications. Cache on m7 further complicates the issue, but as far as I can tell all ARM Cortex-M need consideration here.

Nice retro history lesson here: https://preshing.com/20120930/weak-vs-strong-memory-models/

As it applies to lock free circular buffers, it is essential that the head and tail indexes must be updated only after the element (get() or put()) is read or written. Properly sequenced updates to the head or tail are needed to ensure proper access to the elements. For this to work, the read and writes of the head and tail must be atomic and must not be re-ordered.

My next step will be to understand an efficient and portable means to accomplish this.

George

quantum-leaps commented 1 month ago

... it is essential that the head and tail indexes must be updated only after the element (get() or put()) is read or written

I believe that this is the case in the LFRB implementation, and you can easily verify it in the source code. No bigger research than that is needed to settle this.

... the read and writes of the head and tail must be atomic and must not be re-ordered.

I thought that in the LFRB implementation this should be guaranteed by the C Standard and here specifically by the concept of "C Sequence Points". But I guess the literature on the subject discusses some corner cases. This could be investigated further.

--MMS

geosmall commented 1 month ago

OK here is where I ended up in my research:

While a variable that is a single instruction load/store operations on a processor does provide some level of atomicity, there are a few additional considerations:

Atomicity vs. Memory Ordering

Atomicity: Ensures that a read or write operation completes without interruption. For up to uint32_t variables, this is generally true on ARM Cortex-M processors.

Memory Ordering: Ensures the correct sequence of operations across multiple threads. This is where std::memory_order comes into play.

C Sequence Points: Define points in the code where all side effects of previous evaluations are complete, and no side effects of subsequent evaluations have started.

While sequence points help with ordering within a single thread, they don’t provide guarantees for memory visibility across multiple threads.

Key reasons why memory ordering is considered relevant:

Reordering by the Compiler: The C and C++ standards allow the compiler to reorder instructions for optimization unless explicitly told not to. For example, without memory barriers or memory ordering, the compiler could:

Reordering by the Processor: While a Cortex-M4 has a simpler memory model than modern multi-core processors, it's still possible for the memory subsystem to perform some reordering, particularly with peripheral memory accesses.

Interrupts: ISR could preempt the main thread at any point, including in the middle of a read-modify-write sequence, leading to inconsistent states if proper synchronization isn't enforced. If the write_index is being updated at the same time the main thread reads it, the value read by the main thread could be incomplete or inconsistent.

Sequence Points in C: C sequence points (or in C++, sequencing rules) provide guarantees about when side effects (like memory writes) occur relative to other operations within the same thread. However, these guarantees do not extend across multiple threads or between an interrupt handler and the main thread. This is why memory ordering is required to ensure that changes made by one thread (or ISR) are visible to another in the correct order.

Even though variables up to unin32_t are single instruction load/store, without explicit memory ordering, there's no guarantee that:

Use of acquire-release semantics ensures that:

With all this I ended up concluding there is indeed an issue that needs consideration for my use case, at least for ARM Cortex m4 and m7 that I am targeting.

Note: I also ran into issues mixing and in my C++ application that sent me down a rabbit hole. This surprised me as it is different than other std types. I'm thinking I need to stick to a C++ implementation.

I see from the release history and the literature added to /lit that you have been researching the issue discussed here. I'll close this issue at this point. Thanks!

Best Regards, George