eclipse-threadx / threadx

Eclipse ThreadX is an advanced real-time operating system (RTOS) designed specifically for deeply embedded applications.
https://github.com/eclipse-threadx/rtos-docs/blob/main/rtos-docs/threadx/index.md
MIT License
2.96k stars 801 forks source link

tx_queue_receive(), TX_WAIT_FOREVER, Stack Corruption #137

Closed ajitbasarur closed 3 years ago

ajitbasarur commented 3 years ago

When tx_queue_receive() function is called inside a thread with TX_WAIT_FOREVER option, the thread exit is not possible. If we break the thread execution or exit the thread, then the execution goes to HardFault_Handler().

How did I verify? Before, tx_queue_receive() is called, I checked the stack where thread exit or return address is stored. After the call to tx_queue_receive(), the return address is altered in the stack. The return address in the stack gets corrupted. The issue is observed on STM32H7B3L processor with Azure RTOS 1.1.0 version pack.

If TX_NO_WAIT option is selected, then the stack contents are not altered and the thread can be exited without any Hard Fault.

For example, void msc_process_thread_entry(ULONG arg) {

UINT status ; FX_MEDIA *fx_media;

while(1) { / Wait infinitely for the usb message / status = tx_queue_receive(&ux_app_MsgQueue_msc, &fx_media, TX_WAIT_FOREVER); break; }

Check the source file ux_host_msc.c in the attached ZIP folder. USBX.zip @gdf8gdn8

goldscott commented 3 years ago

How does the return address in the stack get corrupted? What is your call to tx_queue_receive returning?

gdf8gdn8 commented 3 years ago

How does the return address in the stack get corrupted?

I will be overwritten by tx_queue_receive. I've set a UINT timeout right above initialized with 1000. After the call the timeout value is invalid. The stack is invalid also. thread_ptr outside the thread has invalid address. On my debug sessions 0x8.

What is your call to tx_queue_receive returning?

TX_SUCCESS

goldscott commented 3 years ago

I'm not following your description very well.

Can you step through tx_queue_receive and find where it is corrupting the thread stack?

ajitbasarur commented 3 years ago

Hi Scott,

I will clear up certain things. Before reading further, please download the source code that is attached with the issue. (In the first comment).

  1. In file ux_host_msc.c, we are waiting for the message by calling the following function "tx_queue_receive(&ux_app_MsgQueue_msc, &fx_media, TX_WAIT_FOREVER);". Then, the code waits there forever. Note that fx_media is the destination address, where the data from queue has to be copied.
  2. Once USB is inserted, the execution control goes to the function "tx_queue_send(&ux_app_MsgQueue_msc, media, TX_NO_WAIT); in the 'app_usbx_host.c' file".
  3. Single stepping leads us to the function "_tx_queue_send()" in the tx_queue_send.c file. And inside this function, the execution comes to the following MACRO "TX_QUEUE_MESSAGE_COPY(source, destination, size)". This is where we believe the problem occurs.
  4. In MACRO "TX_QUEUE_MESSAGE_COPY()", the destination pointer is fx_media. It seems that TX_QUEUE_MESSAGE_COPY() is copying an extra word and that is leading to stack corruption.
  5. If you look closely in file 'ux_host_msc.c' and with in the thread "msc_process_thread_entry()", fx_media is allocated in the stack next to the place, where return address of the thread "msc_process_thread_entry" is stored.
  6. Once the TX_QUEUE_MESSAGE_COPY() writes an extra word, it corrupts the stack and therefore, the return address of thread "msc_process_thread_entry()" is lost forever.
  7. If we want to exit the thread "msc_process_thread_entry()", then some random address is returned from the stack. This random address is popped back into the memory and it results in an invalid instruction execution. An invalid instruction execution is considered as a Hard Fault in STM32H7 chip.

On question though.. Why TX_QUEUE_MESSAGE_COPY() is not behaving in the expected way? I think this is the root cause of the problem.

P.S. We have enabled MISRA compliance in our stack and therefore, the MISRA memcopy library is included in the code. However, we are unable to expand the MACRO "TX_QUEUE_MESSAGE_COPY()" and single step into MISRA memcopy library.

Please feel free to ask if you have frurther questions. I will try to clarify them as much as possible.

Thanks and Best Regards, Ajit B

@gdf8gdn8 @ajitbasarur @goldscott

goldscott commented 3 years ago

Hi Ajit (@gdf8gdn8 @ajitbasarur),

Thank you for the detailed description - that is very helpful!

I think the problem is in your tx_queue_create arguments, specifically the size of the message. Take a look at the documentation here: https://docs.microsoft.com/en-us/azure/rtos/threadx/chapter4#tx_queue_create

You have it as sizeof(FX_MEDIA*). I believe it should be sizeof(FX_MEDIA*)/sizeof(ULONG).

ajitbasarur commented 3 years ago

Hi Scott,

Thanks for the prompt response. I really appreciate it. I too kind of figured out the same thing couple of hours ago. The issue lies with the improper usage of ThreadX Queue APIs. Thanks for the confirmation from your end. My colleague is also looking into the same problem but on a different project. Once he confirms it, I will close this issue on this rep.

We took the sample code from the following link and modified it for our project. We did not modify the usage of ThreadX Queue APIs. Therefore, I will create an issue in that Repo and upload the patch. In case you are associated with that Repo, then we could avoid the repetition and modify the source. In any case, let me know your comments.

I wish you a nice evening and weekend.

Best Regards, Ajit B

(@gdf8gdn8 @goldscott @ajitbasarur),

goldscott commented 3 years ago

Hi Ajit,

That repo is maintained by ST.

goldscott commented 3 years ago

Adding @xiaocq2001 to take a look at the ST stack.

ajitbasarur commented 3 years ago

Thanks Scott.

@xiaocq2001, this bug or the incorrect usage is present in many example projects. Also, there is an additional improvement for the API usage. Right now, it is implemented as pass by pointer. That brings down the whole purpose of queue system in ThreadX.

Instead of "passing by value" the pointer is passed. Although this works, it is not correct and can cause problems.
In send tx_queue_send(&ux_app_MsgQueue_msc, &media, TX_NO_WAIT); In receive tx_queue_receive(&ux_app_MsgQueue_msc, &fx_media, TX_WAIT_FOREVER);

We could discuss more about this in STM32 Repo if you want.

Best Regards, Ajit B (@gdf8gdn8 @goldscott @ajitbasarur),

goldscott commented 3 years ago

@gdf8gdn8 @ajitbasarur If you are discussing in ST repo, can you post a link here?

ajitbasarur commented 3 years ago

We have not yet started discussing about this issue in ST Repo. We were waiting until now reply from @xiaocq2001. If I do not get response from him with in this week, then I will post the issue in ST repo and share the link here.

ajitbasarur commented 3 years ago

Hi @goldscott,

STM is addressing the problem on their Github. I will close the issue here. Thanks for your support.

Regards, Ajit B