Possible documentation issue for ARM Cortex-M and 'strong memory consistency' restriction

geosmall commented 1 month ago

From here (https://developer.arm.com/documentation/ddi0406/cb/Appendixes/Barrier-Litmus-Tests/Introduction/Overview-of-memory-consistency) we read that the ARM Cortex-M CPUs generally follow a weak memory consistency model. This means that memory operations (loads and stores) can be reordered by the processor, which can lead to non-intuitive behavior in multi-threaded applications if not properly managed. ARM provides memory barrier instructions (such as DMB, DSB, and ISB) to enforce ordering constraints when needed to ensure that memory operations are completed in the desired order to maintain data consistency concurrent environments.

README.md indicates the lock-free-ring-buffer is specifically suitable for embedded single-core microcontrollers including ARM Cortex-M CPU. It also later indicates (as a Lock-Free Restriction) that the ring buffer does not require any "locking" (mutual exclusion mechanism) as long as the system has strong memory consistency. My reference above indicates that ARM Cortex-M CPU does not meet this restriction.

Is lock-free-ring-buffer suitable for use on an ARM Cortex-M7 as is? Alternatively does it need modification to be suitable for use on m7 assuming only one thread/interrupt can produce data into the ring buffer, only one thread/interrupt can consume data from the ring buffer and both the producer and consumer run on the same CPU?

geosmall commented 1 month ago

See https://www.codeproject.com/Articles/43510/Lock-Free-Single-Producer-Single-Consumer-Circular#heading_how_it_works

quantum-leaps commented 1 month ago

Hi George, Good question. The Cortex-M7 often uses cache memories, so strictly speaking it does not meet the restrictions of "systems without cache memory".

But the issue should certainly be investigated and here is where the open-source community comes in. So if you'd like to use LFRB on your Cortex-M7 safely, please do the work to properly investigate how this works in this particular situation. If any changes/restrictions are needed, please submit a pull request for the code and/or documentation update. I hope that my comments make sense to you.

--MMS

geosmall commented 1 month ago

Thank you for the comments, they do make sense. I will continue with the research you mention and hope to come back with a pull request. Feel free to close this issue if you think it makes sense.

Best Regards George

geosmall commented 1 month ago

From my research so far:

A weak memory consistency model is generally true for all ARM Cortex-M processors. This model allows for memory operations to be reordered for performance optimization, which can be beneficial but requires careful synchronization in multi-threaded applications. Cache on m7 further complicates the issue, but as far as I can tell all ARM Cortex-M need consideration here.

Nice retro history lesson here: https://preshing.com/20120930/weak-vs-strong-memory-models/

As it applies to lock free circular buffers, it is essential that the head and tail indexes must be updated only after the element (get() or put()) is read or written. Properly sequenced updates to the head or tail are needed to ensure proper access to the elements. For this to work, the read and writes of the head and tail must be atomic and must not be re-ordered.

My next step will be to understand an efficient and portable means to accomplish this.

George

quantum-leaps commented 1 month ago

... it is essential that the head and tail indexes must be updated only after the element (get() or put()) is read or written

I believe that this is the case in the LFRB implementation, and you can easily verify it in the source code. No bigger research than that is needed to settle this.

... the read and writes of the head and tail must be atomic and must not be re-ordered.

I thought that in the LFRB implementation this should be guaranteed by the C Standard and here specifically by the concept of "C Sequence Points". But I guess the literature on the subject discusses some corner cases. This could be investigated further.

--MMS

geosmall commented 1 month ago

OK here is where I ended up in my research:

While a variable that is a single instruction load/store operations on a processor does provide some level of atomicity, there are a few additional considerations:

Atomicity vs. Memory Ordering

Atomicity: Ensures that a read or write operation completes without interruption. For up to uint32_t variables, this is generally true on ARM Cortex-M processors.

Memory Ordering: Ensures the correct sequence of operations across multiple threads. This is where std::memory_order comes into play.

C Sequence Points: Define points in the code where all side effects of previous evaluations are complete, and no side effects of subsequent evaluations have started.

While sequence points help with ordering within a single thread, they don’t provide guarantees for memory visibility across multiple threads.

Key reasons why memory ordering is considered relevant:

Reordering by the Compiler: The C and C++ standards allow the compiler to reorder instructions for optimization unless explicitly told not to. For example, without memory barriers or memory ordering, the compiler could:

Write the head pointer (indicating that new data is available) before the data itself is written to the buffer.
Read the head pointer before reading the data, leading to the consumer seeing stale or partially written data. Even if write_index and buffer[] are uint32_t, this reordering could result in the producer marking the data as ready (by updating the write_index) before the data itself has been fully written.

Reordering by the Processor: While a Cortex-M4 has a simpler memory model than modern multi-core processors, it's still possible for the memory subsystem to perform some reordering, particularly with peripheral memory accesses.

Interrupts: ISR could preempt the main thread at any point, including in the middle of a read-modify-write sequence, leading to inconsistent states if proper synchronization isn't enforced. If the write_index is being updated at the same time the main thread reads it, the value read by the main thread could be incomplete or inconsistent.

Sequence Points in C: C sequence points (or in C++, sequencing rules) provide guarantees about when side effects (like memory writes) occur relative to other operations within the same thread. However, these guarantees do not extend across multiple threads or between an interrupt handler and the main thread. This is why memory ordering is required to ensure that changes made by one thread (or ISR) are visible to another in the correct order.

Even though variables up to unin32_t are single instruction load/store, without explicit memory ordering, there's no guarantee that:

The producer writes data to the buffer before updating the write_index.
The consumer reads the buffer data only after it sees the updated write_index.

Use of acquire-release semantics ensures that:

Release semantics (memory_order_release) in the put() function guarantee that any prior writes (to the buffer) are visible before the write_index is updated.
Acquire semantics (memory_order_acquire) in the get() function ensure that after reading the write_index, the consumer sees the correct data in the buffer.

With all this I ended up concluding there is indeed an issue that needs consideration for my use case, at least for ARM Cortex m4 and m7 that I am targeting.

Note: I also ran into issues mixing and in my C++ application that sent me down a rabbit hole. This surprised me as it is different than other std types. I'm thinking I need to stick to a C++ implementation.

I see from the release history and the literature added to /lit that you have been researching the issue discussed here. I'll close this issue at this point. Thanks!

Best Regards, George

QuantumLeaps / lock-free-ring-buffer

Possible documentation issue for ARM Cortex-M and 'strong memory consistency' restriction #7