Found a bug which causes tx_byte_allocate() not thread-safe when use -O3 optimization!!

The way we use byte pool

We use byte pool as heap in our project, the detail code in our project was copied from the following link which implemets CMSIS API for threadx. https://github.com/STMicroelectronics/stm32_mw_cmsis_rtos_tx/blob/main/cmsis_os2.c

In cmsis_os2.c, byte pool is created in MemInit() and after that, we can just use MemAlloc() and MemFree() to request and release memorry.

static uint8_t *MemAlloc(uint32_t mem_size, uint8_t pool_type) was defined at line 158.
static osStatus_t MemInit(void) was defined at line 247.
The control block or so called handle was defined at line 142 to 144 as global varibles: TX_BYTE_POOL HeapBytePool; TX_BYTE_POOL StackBytePool; TX_BLOCK_POOL BlockPool;

What happened? MemAlloc() and MemFree() just worked fine at most of the time. But occationally, MemAlloc() failed with code of TX_NO_MEMORY . While, when we looked into this issue and after debugged a lot, we found that there was still enough memorry but the alloc algorithm not worked well.

Why this happened? After days of debugging, we found that the value of tx_byte_pool_fragments went wrong when this issue happen. To prove this, we defined a global varible of the same type, and do the same operation when tx_byte_pool_fragments increased or decreased. This is simple because tx_byte_pool_fragments only changes in the function _tx_byte_pool_search() which defined in threadx/threadx/common/src/tx_byte_pool_search.c. It turned out that, when MemAlloc() failed with code of TX_NO_MEMORY , the value of tx_byte_pool_fragments in struct TX_BYTE_POOL was just different with the global varible we defined.

The root cause After a deep dive into the source code of threadx, we found that _tx_byte_pool_search() can be interrupted after TX_RESTORE at line 258. This is a good design to let taskes with high priority alloc from the pool at first. But as mentioned above, The control block or so called handle was defined as global varibles, which may be changed by another high priority task.

Further explanation

After put a volatile before the tx_byte_pool_fragments varible, the issue was fixed and never came up.
When we changed the gcc compilation parameter from -O3 to -O1, this issue was fixed and never came up.

Further proof The following is the asm code of _tx_byte_pool_search(), when tx_byte_pool_fragments increases or decreases, it uses the value from the register LR instead of re-load the value from ram which may be changed already. The asm code may be different for the reason of different gcc version or MCU, in our case , the MCU is Cortex-R5 based and gcc version is arm-none-eabi-gcc 6.2.1 .

Looking forward for your reply!

start-address: 0x01a29818, end-address: 0x01a29990

xzimage.elf:     file format elf32-littlearm

Disassembly of section .text:

01a29818 <_tx_byte_pool_search>:
/*                                            calculation,                */
/*                                            resulting in version 6.1.7  */
/*                                                                        */
/**************************************************************************/
UCHAR  *_tx_byte_pool_search(TX_BYTE_POOL *pool_ptr, ULONG memory_size)
{
 1a29818:       e92d41f0        push    {r4, r5, r6, r7, r8, lr}
UCHAR           *work_ptr;
ULONG           total_theoretical_available;

    /* Disable interrupts.  */
    TX_DISABLE
 1a2981c:       e10fc000        mrs     ip, CPSR
 1a29820:       f10c00c0        cpsid   if

    /* First, determine if there are enough bytes in the pool.  */
    /* Theoretical bytes available = free bytes + ((fragments-2) * overhead of each block) */
    total_theoretical_available = pool_ptr -> tx_byte_pool_available + ((pool_ptr -> tx_byte_pool_fragments - 2) * ((sizeof(UCHAR *)) + (sizeof(ALIGN_TYPE))));
 1a29824:       e590e00c        ldr     lr, [r0, #12]
 1a29828:       e5907008        ldr     r7, [r0, #8]
 1a2982c:       e24e3002        sub     r3, lr, #2
 1a29830:       e0873183        add     r3, r7, r3, lsl #3
    if (memory_size >= total_theoretical_available)
 1a29834:       e1530001        cmp     r3, r1
 1a29838:       9a000018        bls     1a298a0 <_tx_byte_pool_search+0x88>
    }
    else
    {

        /* Pickup thread pointer.  */
        TX_THREAD_GET_CURRENT(thread_ptr)
 1a2983c:       e3094464        movw    r4, #37988      ; 0x9464
#endif

            /* Check to see if this block is free.  */
            work_ptr =  TX_UCHAR_POINTER_ADD(current_ptr, (sizeof(UCHAR *)));
            free_ptr =  TX_UCHAR_TO_ALIGN_TYPE_POINTER_CONVERT(work_ptr);
            if ((*free_ptr) == TX_BYTE_BLOCK_FREE)
 1a29840:       e30e8eee        movw    r8, #61166      ; 0xeeee
        current_ptr =      pool_ptr -> tx_byte_pool_search;
 1a29844:       e5903014        ldr     r3, [r0, #20]
        TX_THREAD_GET_CURRENT(thread_ptr)
 1a29848:       e3404000        movt    r4, #0
            if ((*free_ptr) == TX_BYTE_BLOCK_FREE)
 1a2984c:       e34f8fff        movt    r8, #65535      ; 0xffff
UINT            first_free_block_found =  TX_FALSE;
 1a29850:       e3a06000        mov     r6, #0
        examine_blocks =   pool_ptr -> tx_byte_pool_fragments + ((UINT) 1);
 1a29854:       e28e2001        add     r2, lr, #1
        TX_THREAD_GET_CURRENT(thread_ptr)
 1a29858:       e5945000        ldr     r5, [r4]
        pool_ptr -> tx_byte_pool_owner =  thread_ptr;
 1a2985c:       e5805020        str     r5, [r0, #32]
            if ((*free_ptr) == TX_BYTE_BLOCK_FREE)
 1a29860:       e5934004        ldr     r4, [r3, #4]
 1a29864:       e1540008        cmp     r4, r8
            else
            {

                /* Block is not free, move to next block.  */
                this_block_link_ptr =  TX_UCHAR_TO_INDIRECT_UCHAR_POINTER_CONVERT(current_ptr);
                current_ptr =  *this_block_link_ptr;
 1a29868:       15933000        ldrne   r3, [r3]
            if ((*free_ptr) == TX_BYTE_BLOCK_FREE)
 1a2986c:       0a00000e        beq     1a298ac <_tx_byte_pool_search+0x94>
            }

            /* Another block has been searched... decrement counter.  */
            if (examine_blocks != ((UINT) 0))
 1a29870:       e3520000        cmp     r2, #0
            {

                examine_blocks--;
 1a29874:       12422001        subne   r2, r2, #1
            }

            /* Restore interrupts temporarily.  */
            TX_RESTORE
 1a29878:       e121f00c        msr     CPSR_c, ip

            /* Disable interrupts.  */
            TX_DISABLE
 1a2987c:       e10fc000        mrs     ip, CPSR
 1a29880:       f10c00c0        cpsid   if

            /* Determine if anything has changed in terms of pool ownership.  */
            if (pool_ptr -> tx_byte_pool_owner != thread_ptr)
 1a29884:       e5904020        ldr     r4, [r0, #32]
 1a29888:       e1540005        cmp     r4, r5
 1a2988c:       0a000017        beq     1a298f0 <_tx_byte_pool_search+0xd8>
            {

                /* Pool changed ownership in the brief period interrupts were
                   enabled.  Reset the search.  */
                current_ptr =      pool_ptr -> tx_byte_pool_search;
 1a29890:       e5903014        ldr     r3, [r0, #20]
                examine_blocks =   pool_ptr -> tx_byte_pool_fragments + ((UINT) 1);

                /* Setup our ownership again.  */
                pool_ptr -> tx_byte_pool_owner =  thread_ptr;
            }
        } while(examine_blocks != ((UINT) 0));
 1a29894:       e29e2001        adds    r2, lr, #1
                pool_ptr -> tx_byte_pool_owner =  thread_ptr;
 1a29898:       e5805020        str     r5, [r0, #32]
        } while(examine_blocks != ((UINT) 0));
 1a2989c:       1affffef        bne     1a29860 <_tx_byte_pool_search+0x48>
        }
        else
        {

            /* Restore interrupts.  */
            TX_RESTORE
 1a298a0:       e121f00c        msr     CPSR_c, ip

            /* Set current pointer to NULL to indicate nothing was found.  */
            current_ptr =  TX_NULL;
 1a298a4:       e3a00000        mov     r0, #0
        }
    }

    /* Return the search pointer.  */
    return(current_ptr);
}
 1a298a8:       e8bd81f0        pop     {r4, r5, r6, r7, r8, pc}
                if (first_free_block_found == TX_FALSE)
 1a298ac:       e3560000        cmp     r6, #0
                    pool_ptr->tx_byte_pool_search =  current_ptr;
 1a298b0:       05803014        streq   r3, [r0, #20]
                next_ptr =             *this_block_link_ptr;
 1a298b4:       e5936000        ldr     r6, [r3]
                available_bytes =   TX_UCHAR_POINTER_DIF(next_ptr, current_ptr);
 1a298b8:       e0464003        sub     r4, r6, r3
                available_bytes =   available_bytes - ((sizeof(UCHAR *)) + (sizeof(ALIGN_TYPE)));
 1a298bc:       e2444008        sub     r4, r4, #8
                if (available_bytes >= memory_size)
 1a298c0:       e1510004        cmp     r1, r4
 1a298c4:       9a000016        bls     1a29924 <_tx_byte_pool_search+0x10c>
                    if ((*free_ptr) == TX_BYTE_BLOCK_FREE)
 1a298c8:       e5964004        ldr     r4, [r6, #4]
 1a298cc:       e1540008        cmp     r4, r8
 1a298d0:       0a000009        beq     1a298fc <_tx_byte_pool_search+0xe4>
                        current_ptr =  *next_block_link_ptr;
 1a298d4:       e5963000        ldr     r3, [r6]
                        if (examine_blocks != ((UINT) 0))
 1a298d8:       e3520000        cmp     r2, #0
                            examine_blocks--;
 1a298dc:       13a06001        movne   r6, #1
 1a298e0:       12422001        subne   r2, r2, #1
                        if (examine_blocks != ((UINT) 0))
 1a298e4:       1affffe1        bne     1a29870 <_tx_byte_pool_search+0x58>
 1a298e8:       e3a06001        mov     r6, #1
 1a298ec:       eaffffe1        b       1a29878 <_tx_byte_pool_search+0x60>
        } while(examine_blocks != ((UINT) 0));
 1a298f0:       e3520000        cmp     r2, #0
 1a298f4:       1affffd9        bne     1a29860 <_tx_byte_pool_search+0x48>
 1a298f8:       eaffffe8        b       1a298a0 <_tx_byte_pool_search+0x88>
                        *this_block_link_ptr =  *next_block_link_ptr;
 1a298fc:       e5964000        ldr     r4, [r6]
                        pool_ptr -> tx_byte_pool_fragments--;
 1a29900:       e24ee001        sub     lr, lr, #1
                        *this_block_link_ptr =  *next_block_link_ptr;
 1a29904:       e5834000        str     r4, [r3]
                        if (pool_ptr -> tx_byte_pool_search ==  next_ptr)
 1a29908:       e5904014        ldr     r4, [r0, #20]
                        pool_ptr -> tx_byte_pool_fragments--;
 1a2990c:       e580e00c        str     lr, [r0, #12]
                        if (pool_ptr -> tx_byte_pool_search ==  next_ptr)
 1a29910:       e1540006        cmp     r4, r6
                            pool_ptr -> tx_byte_pool_search =  current_ptr;
 1a29914:       03a06001        moveq   r6, #1
 1a29918:       05803014        streq   r3, [r0, #20]
                        if (pool_ptr -> tx_byte_pool_search ==  next_ptr)
 1a2991c:       13a06001        movne   r6, #1
 1a29920:       eaffffd2        b       1a29870 <_tx_byte_pool_search+0x58>
        if (available_bytes != ((ULONG) 0))
 1a29924:       e3540000        cmp     r4, #0
 1a29928:       0affffdc        beq     1a298a0 <_tx_byte_pool_search+0x88>
            if ((available_bytes - memory_size) >= ((ULONG) TX_BYTE_BLOCK_MIN))
 1a2992c:       e0442001        sub     r2, r4, r1
 1a29930:       e3520013        cmp     r2, #19
 1a29934:       9a00000b        bls     1a29968 <_tx_byte_pool_search+0x150>
                *free_ptr =             TX_BYTE_BLOCK_FREE;
 1a29938:       e30e5eee        movw    r5, #61166      ; 0xeeee
                next_ptr =  TX_UCHAR_POINTER_ADD(current_ptr, (memory_size + ((sizeof(UCHAR *)) + (sizeof(ALIGN_TYPE)))));
 1a2993c:       e2812008        add     r2, r1, #8
                pool_ptr -> tx_byte_pool_fragments++;
 1a29940:       e28ee001        add     lr, lr, #1
                *next_block_link_ptr =  *this_block_link_ptr;
 1a29944:       e7836002        str     r6, [r3, r2]
                next_ptr =  TX_UCHAR_POINTER_ADD(current_ptr, (memory_size + ((sizeof(UCHAR *)) + (sizeof(ALIGN_TYPE)))));
 1a29948:       e0832002        add     r2, r3, r2
                *this_block_link_ptr =  next_ptr;
 1a2994c:       e1a04001        mov     r4, r1
                *free_ptr =             TX_BYTE_BLOCK_FREE;
 1a29950:       e34f5fff        movt    r5, #65535      ; 0xffff
                *this_block_link_ptr =  next_ptr;
 1a29954:       e1a06002        mov     r6, r2
                *free_ptr =             TX_BYTE_BLOCK_FREE;
 1a29958:       e5825004        str     r5, [r2, #4]
                pool_ptr -> tx_byte_pool_fragments++;
 1a2995c:       e580e00c        str     lr, [r0, #12]
                *this_block_link_ptr =  next_ptr;
 1a29960:       e5907008        ldr     r7, [r0, #8]
 1a29964:       e5832000        str     r2, [r3]
            *this_block_link_ptr =  TX_BYTE_POOL_TO_UCHAR_POINTER_CONVERT(pool_ptr);
 1a29968:       e5830004        str     r0, [r3, #4]
            pool_ptr -> tx_byte_pool_available =  (pool_ptr -> tx_byte_pool_available - available_bytes) - ((sizeof(UCHAR *)) + (sizeof(ALIGN_TYPE)));
 1a2996c:       e2477008        sub     r7, r7, #8
            if (current_ptr == pool_ptr -> tx_byte_pool_search)
 1a29970:       e5902014        ldr     r2, [r0, #20]
            pool_ptr -> tx_byte_pool_available =  (pool_ptr -> tx_byte_pool_available - available_bytes) - ((sizeof(UCHAR *)) + (sizeof(ALIGN_TYPE)));
 1a29974:       e0474004        sub     r4, r7, r4
 1a29978:       e5804008        str     r4, [r0, #8]
            if (current_ptr == pool_ptr -> tx_byte_pool_search)
 1a2997c:       e1520003        cmp     r2, r3
                pool_ptr -> tx_byte_pool_search =  *this_block_link_ptr;
 1a29980:       05806014        streq   r6, [r0, #20]
            TX_RESTORE
 1a29984:       e121f00c        msr     CPSR_c, ip
            current_ptr =  TX_UCHAR_POINTER_ADD(current_ptr, (((sizeof(UCHAR *)) + (sizeof(ALIGN_TYPE)))));
 1a29988:       e2830008        add     r0, r3, #8
 1a2998c:       e8bd81f0        pop     {r4, r5, r6, r7, r8, pc}

Hi, I just found that there is no "-O3" parameter for CMAKE_ASM_FLAGS_RELEASE in arm-none-eabi.cmake file. https://github.com/azure-rtos/threadx/blob/master/cmake/arm-none-eabi.cmake So, is "-O3" forbidden for ASM compilation?

Historically, GCC -O3 optimization did not produce reliable code. I would not recommend using -O3 until/unless the entire ThreadX source is built with -O3 and passes all verification tests. That said, it would be useful to issue a pull request for your proposed change so it can be evaluated further.

This is very cool bug. Probably calling dummy function from other c file after the TX_RESTOREwill resolve it as the compiler now can't know if the global changed. Other approach will be to define pool_ptr to be pointer to volatile. What do you think about it @yuhiping ?

This is indeed an interesting issue. I like using volatile more than calling a dummy function - just from the perspective of performance. In your fix, did you just add volatile to the pool_ptr API parameter?

On Sun, Feb 4, 2024 at 12:12 AM amgross @.***> wrote:

This is very cool bug. Probably calling dummy function from other c file will resolve it as the compiler now can't know if the global changed. Other approach will be to define pool_ptr to be pointer to volatile. What do you think about it @yuhiping https://github.com/yuhiping ?

— Reply to this email directly, view it on GitHub https://github.com/eclipse-threadx/threadx/issues/334#issuecomment-1925625696, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3YCNAN3HI6MPUDD6JQ3YUDYR47GLAVCNFSM6AAAAABA2ZUUZ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGYZDKNRZGY . You are receiving this because you commented.Message ID: @.***>

I didn't tried the fix, but indeed adding the volatile to the parameter is what I thought

This is very cool bug. Probably calling dummy function from other c file after the TX_RESTOREwill resolve it as the compiler now can't know if the global changed. Other approach will be to define pool_ptr to be pointer to volatile. What do you think about it @yuhiping ?

@amgross @williamelamie Sorry for the late reply. Yes, as you suggested, adding the volatile to the parameter was just our solution. Moreover, Is it still necessary to issue a pull request ?

This is indeed an interesting issue. I like using volatile more than calling a dummy function - just from the perspective of performance. In your fix, did you just add volatile to the pool_ptr API parameter? … On Sun, Feb 4, 2024 at 12:12 AM amgross @.***> wrote: This is very cool bug. Probably calling dummy function from other c file will resolve it as the compiler now can't know if the global changed. Other approach will be to define pool_ptr to be pointer to volatile. What do you think about it @yuhiping https://github.com/yuhiping >

Hey @williamelamie I actually don't think this is the right solution, it exposes the underlying implementation via a public interface and is not very localized. Personally I think the function should make local copies while appropriately managing variable properties (eg: volatile). This doesn't address potentially issues that may arise in SMP, for example a barrier might be needed in SMP designs and the use of VOLATILE for the function prototype won't solve that.

I think a cleaner alternative to keep the changes localized to this function are create a local variable that is volatile and keep the function signature the same

I duplicated this with the same compiler version compiling for M33 (gcc 6.2) as well as GCC 8. The below 2 line solution fixes the issue for me and we don't need to change the public API now. Note that this bug is also present with -O2 as well. With optimizing for size (-Os) or debug (-Og), which are the flags most embedded developers use, the bug is not present.

This is actually a pretty nasty little problem... Here's my fix

UCHAR  *_tx_byte_pool_search(TX_BYTE_POOL *pool_ptr_in, ULONG memory_size)
 {
<snip>
// add local volatile ptr 
TX_BYTE_POOL volatile * pool_ptr = pool_ptr_in;

original assembly

<snip>

    total_theoretical_available = pool_ptr -> tx_byte_pool_available + ((pool_ptr -> tx_byte_pool_fragments - 2) * ((sizeof(UCHAR *)) + (sizeof(ALIGN_TYPE))));
/**
----> R7 loaded with the tx_byte_pool_fragments count at function entry, and re-used throughout the whole function call
*/ 
80092f0:       68c7            ldr     r7, [r0, #12]

<snip>

            TX_DISABLE

            /* Determine if anything has changed in terms of pool ownership.  */
            if (pool_ptr -> tx_byte_pool_owner != thread_ptr)
 8009330:       6a04            ldr     r4, [r0, #32]
 8009332:       42a5            cmp     r5, r4
 8009334:       d03f            beq.n   80093b6 <_tx_byte_pool_search+0xd6>
tx_byte_pool_search.c:270
            {

                /* Pool changed ownership in the brief period interrupts were
                   enabled.  Reset the search.  */
                current_ptr =      pool_ptr -> tx_byte_pool_search;
                examine_blocks =   pool_ptr -> tx_byte_pool_fragments + ((UINT) 1);
/**
----->  Bug here...  R7 is reused from early and not reloaded 
-----> (in the original bug for the Cortex R5 it used the LR, but the bug is the same).  
-----> R8 is the variable examine blocks
**/
 8009336:       f107 0801       add.w   r8, r7, #1

<snip>

            }
        } while(examine_blocks != ((UINT) 0));
 800933e:       f1b8 0f00       cmp.w   r8, #0

and using the local volatile it's fixed

<snip>
TX_DISABLE
            {
                /* Pool changed ownership in the brief period interrupts were
                   enabled.  Reset the search.  */
                current_ptr =      pool_ptr -> tx_byte_pool_search;
 8009330:       6942            ldr     r2, [r0, #20]
tx_byte_pool_search.c:270
/** 
----> fixed: R3 is now the register holding tx_byte_pool fragments.   This correctly re-loads R3 with pool_fragments count from the ptr instead of re-using the register 
*/
                examine_blocks =   pool_ptr -> tx_byte_pool_fragments + ((UINT) 1);
 8009332:       68c3            ldr     r3, [r0, #12]
tx_byte_pool_search.c:273

                /* Setup our ownership again.  */
                pool_ptr -> tx_byte_pool_owner =  thread_ptr;
 8009334:       6206            str     r6, [r0, #32]
tx_byte_pool_search.c:270
/** 
----> add 1 to fragments count in R3 to get the examine_blocks, this is correct now
*/
                examine_blocks =   pool_ptr -> tx_byte_pool_fragments + ((UINT) 1);
 8009336:       3301            adds    r3, #1
tx_byte_pool_search.c:275
            }
        } while(examine_blocks != ((UINT) 0));
 8009338:       2b00            cmp     r3, #0
 800933a:       d1eb            bne.n   8009314 <_tx_byte_pool_search+0x34>
__set_basepri_value():

I agree Pat. It is nicer to have the use of volatile hidden from the outside.

Another option would be to make the tx_byte_pool_owner structure member in TX_BYTE_POOL volatile, like the following:

struct TX_THREAD_STRUCT volatile
                    *tx_byte_pool_owner;

This is a slightly smaller change and something that might be good to apply to the head of each suspension list, since they are used by the tx_*_prioritize functions in a similar manner as tx_byte_pool_owner structure.

Is it possible that this is an issue with the cpu-specific port? If we look at the implementation of TX_RESTORE in threadx/ports/cortex_r5/gnu/inc/tx_port.h, it is implemented like this:

#define TX_RESTORE    asm volatile (" MSR CPSR_c,%0 "::"r" (interrupt_save) );

Other implementations, like the Cortex-M4, have a memory-clobber in the inline-assembly that restores the interrupt-posture:

__asm__ volatile ("MSR  PRIMASK,%0": : "r" (int_posture): "memory");

I can't test it right now, but I think it would be interesting to check if adding a memory-clobber in the Cortex-R5 port would fix this. My understanding is that it should prevent the compiler from reusing the cached value from the register.

Is it possible that this is an issue with the cpu-specific port?

Nice observation. Yes, you found the issue, you are correct, the clobbers are wrong and the compiler fence is missing. I'm using BASEPRI for Cortex M33/M4 instead of CPSID, and there's actually a bug in threadX BASEPRI support that triggers the same issue as Cortex R5

#ifdef TX_PORT_USE_BASEPRI
__attribute__( ( always_inline ) ) static inline void __set_basepri_value(UINT basepri_value)
{
    __asm__ volatile ("MSR  BASEPRI,%0 ": : "r" (basepri_value));
}
#else

Note the missing clobber, so no compiler fence is inserted.

For non-BASEPRI approach (CPSID)

__attribute__( ( always_inline ) ) static inline UINT __disable_interrupts(void)
{
UINT int_posture;

    int_posture = __get_interrupt_posture();

#ifdef TX_PORT_USE_BASEPRI
    __set_basepri_value(TX_PORT_BASEPRI);
#else
    __asm__ volatile ("CPSID i" : : : "memory");
#endif
    return(int_posture);
}

The barrier is present.

So the barrier is there for approach using CPSID , but if you use BASEPRI, the memory fence/barrier is missing.

The fix for BASEPRI (verified) is:

#ifdef TX_PORT_USE_BASEPRI
__attribute__( ( always_inline ) ) static inline void __set_basepri_value(UINT basepri_value)
{
    __asm__ volatile ("MSR  BASEPRI,%0 ": : "r" (basepri_value) : "memory");
}
#else

This fixes the issues for Cortex M33/M4 using BASEPRI. For anyone not using BASEPRI, this won't be an issue on Cortex M33/M4 as the non-BASEPRI path correctly inserts a compiler barrier.

For Cortex R5, TX_DISABLE is defined as

#define TX_DISABLE                              asm volatile (" MRS %0,CPSR; CPSID if ": "=r" (interrupt_save) );

This is incorrect, the fence is missing. It should be

#define TX_DISABLE                              asm volatile (" MRS %0,CPSR; CPSID if ": "=r" (interrupt_save) : "memory");

Summary: 2 bugs present in the tx ports.

1) barrier missing in basepri for cortex M 2) barrier missing in TX_DISABLE for cortex R

With the barriers in place, the problem will be gone.

This is actually a big problem, these missing fences can trigger some really hard to find issues.

What with ARC EM, do it need also to be changed?

https://github.com/eclipse-threadx/threadx/blob/485a02faec6edccef14812ddce6844af1d7d2eef/ports/arc_em/metaware/inc/tx_port.h#L306-L307

Looks like it's using a compiler built_in, the compiler built_in's will take care of the memory fences for you.

For example https://github.com/foss-for-synopsys-dwc-arc-processors/embarc_osp/blob/67ea926c6a62aa6e19efdc3b11cba3ea0b467e29/include/arc/arc_builtin.h#L198

Asm("seti %0" : : "ir" (key) : "memory");

eclipse-threadx / threadx

Found a bug which causes tx_byte_allocate() not thread-safe when use -O3 optimization!! #334