eclipse-threadx / threadx

Eclipse ThreadX is an advanced real-time operating system (RTOS) designed specifically for deeply embedded applications.
https://github.com/eclipse-threadx/rtos-docs/blob/main/rtos-docs/threadx/index.md
MIT License
2.87k stars 782 forks source link

Higher priority thread in ready state not become executing due to no SGI to inform the target core #209

Open meitlin opened 1 year ago

meitlin commented 1 year ago

Our system is 8-core SMP. We found a ready thread with highest priority (at that moment) not becoming executing. Details are shown in the figure. (Task is same as thread in below description) threadx_issue_01

Task A running in core 4 was suspended from _tx_event_flag_set() and updated the _tx_thread_current_ptr[4] as null (M0). Then, core 4 went into _tx_thread_schedule() to schedule next Task (Task B). _tx_thread_current_ptr[4] was updated as Task B in _tx_thread_schedule() (M1). Task C running in core 5 set the event flag to resume Task A and all control variables were updated. However, Task C checked the _tx_thread_current_ptr[4] is null and then skipped sending SGI 0 to core 4 in _tx_thread_smp_core_interrupt() (M2). As a result, Task A would not being scheduled even it is already in ready list with the highest priority and _tx_thread_execute_ptr[4] also updated as Task A. It seems there is a timing issue that cause core 5 not reading the lasted value of _tx_thread_current_ptr[4].

We do two experiments: Experiment 1. Always send SGI 0 to target core without checking the _tx_thread_current_ptr[core x] is null or equal to _tx_thread_execute_ptr[core x]. Experiment 2. Add memory barrier (DMB ISH) after updating _tx_thread_current_ptr[core x] in _tx_thread_schedule() It seems that both experiments could fix this issue.

However, we are not sure which solution is better or any other concern we might ignore?

Thanks.

goldscott commented 1 year ago

Hi @meitlin thank you for the excellent description and diagram, that is extremely helpful!

What architecture and toolchain are you using? What version of ThreadX (the version number in tx_thread_schedule)?

My first thought is the memory barrier would be the better solution in order to avoid unnecessary interrupts and the associated overhead.

meitlin commented 1 year ago

Hi @goldscott , thanks for the quick reply. We use Cortex-A55-SMP/GNU, and the ThreadX version is 6.1.

carll00226 commented 1 year ago

hi, meitlin: Does this problem caused by the store buffer between cpu and cache? if so, add barrier to flush the store buffer after write, do you also need to add barrier to clean the invalid queue before read int core5 ?