HAL_ETH does non-atomic Test-and-Set operations on gState which is not thread-safe

KarstenHohmeier commented 3 weeks ago

HAL_ETH uses gState as a "poor man's mutex" to guard it's internal state and associated hardware registers against concurrent access and enforce a specific order of initialization.

When gState gets tested and set to BUSY for example, HAL_ETH does critical things to its own state or the hardware:

https://github.com/STMicroelectronics/stm32h7xx_hal_driver/blob/ceda3ceeca2ee77a76d2de2ee6b560ad87c90a48/Src/stm32h7xx_hal_eth.c#L863-L866

Yet these test and set operations are not done atomically (e.g. inside a critical section) so the HAL_ETH locking via gState is vulnerable to race conditions in situations with preemption (e.g. with lwIP + FreeRTOS)

ASEHSTM commented 2 weeks ago

Hello @KarstenHohmeier,

Thank you for your contribution. This function has been reliably used in previous projects without any race condition issues, and tests show that the code functions correctly. Interrupts are disabled to prevent any interference during critical operations via the __HAL_ETH_DMA_DISABLE_IT function at line 873, which includes transmission, reception, and fatal bus interrupts.

If this point has caused you any issues, please inform us about your specific use case. We remain open to further discussion.

Thank you for your understanding and cooperation. With Regards

KarstenHohmeier commented 2 weeks ago

The code referenced is a general example of the code-pattern ST Micro uses to deal with concurrency and locking inside HAL_ETH.

The real issues arise in the TX and RX functions in HAL_ETH where you omit this locking pattern entirely or execute this Test-and-Set pattern only half-way (test for STARTED but no set to BUSY) see #71 and #72.

If you would try and fix #71 and #72 by using the gState locking pattern - as you do throughout HAL_ETH - it would still not be entirely thread-safe because you don't do it atomically. This is what this issue is about. To remind you of the locking pattern you seem to use and make the case that it is conceptually broken by being non-atomic and would thus not be suitable (without being fixed itself) for fixing the real concurrency issues related to the TX and RX function (#71, #72).

This is also not something your tests would show. The realization as a developer that non-atomic test-and-set of a locking variable is flawed and should be fixed regardless of prior track record or existing tests.

ASEHSTM commented 1 week ago

Hello @KarstenHohmeier,

The HAL driver is designed to be a generic library, usable in a variety of projects, whether they use an RTOS or not. This flexibility means that certain responsibilities, such as managing race conditions, are delegated to the end user. This approach helps maintain the lightness and simplicity of the HAL code. Systematically adding critical sections or synchronization primitives to the HAL code could introduce performance overhead and additional complexity.

In many cases, users do not need these mechanisms, especially in non-preemptive environments or simple applications.

With Regards,

KarstenHohmeier commented 5 days ago

If you do not guarantee thread-safety for HAL_ETH - which is a valid design decision - then you should add this as a giant warning to the API / function descriptions.

Also you should probably talk to your colleagues that are responsible for generating ethernetif.c when setting up a H7-project from CubeIDE with LWIP and FreeRTOS.

If you are right, then the code they generate for low_level_input() and low_level_output() does not use HAL_ETH in a thread-safe manner for H7 projects which results in a "sub-optimal user experience" ;)

ethernetif.c is where I fixed the issue for our project - by scrapping the generated ST code and wrapping the HAL_ETH calls and associated critical operations into a mutex.

To think a little more about "non-preemptive environments": Imagine a scenario without preemption (no RTOS) where the user code runs in a main-loop and sends one TX packet every second. Then you connect a button to the system that triggers an interrupt and sends a time-critical notification packet directly from the IRQ handler. You would expect this to be ok, since the packets are handed to the DMA via descriptor lists. But effectively your IRQ code is preempting the main-loop and might cause the same concurrency issues that RTOS preemption would if the calls to HAL_ETH are not thread-safe. This gets even worse with nested IRQs where even interrupts can be interrupted. So even in a bare-metal scenario with only IRQs the user would have to guard the HAL_ETH usage themselves to make it function correctly under all circumstances. So whatever performance you think you gain by omitting thread-safety cannot be gained "legitimately" because the user has to add back thread-safety in the user code.

But this is more thought experiment and colored by my opinion. I am sure you will do the right thing and sort this out in a sensible fashion and thank you for taking these bug reports seriously.

STMicroelectronics / stm32h7xx_hal_driver

HAL_ETH does non-atomic Test-and-Set operations on gState which is not thread-safe #70