Hardfault unaligned access in _nx_tcp_socket_receive_queue_flush when disconnecting

gregtwice commented 1 year ago

What target device are you using? LPC55S69
Which version of Azure RTOS? 6.1
What toolchain and environment? arm-none-eabi-* + WSL Hi, I have a thread that handles the transfer of packets between two network interfaces, I use curl to transfer files between the interfaces. When i disconnect one of the interfaces, I sometimes get a Hardfault due to an unaligned memory access.

To reproduce :

build my project with arm-none-eabi-gcc
launch a transfer between the two interfaces,
fiddle with one of the interfaces (disconnect, reconnect)
when NetX considers the connection closed, the _nx_tcp_socket_receive_queue_flush will sometimes generate a HardFault due to unaligned memory access.

Stack frame:
 R0  = aaaaaaaa
 R1  = 00000000
 R2  = 2000d6a8
 R3  = 0000000a
 R12 = 00000000
 LR  = 00000001
 PC  = 00022710
 PSR = 21000000
- FSR/FAR:
 CFSR = 01000000
 HFSR = 40000000
 DFSR = 00000000
 AFSR = 00000000
 BFAR = e000ed38
 MFAR = e000ed34

When looking at the disassembly of my executable, I see that instruction 0x22710 is LDR r6 [r0, #32]. Looking at my stack frame I see that r0 equals 0xAAAAAAAA which is the value for an allocated packet. It would seem that the instruction is attempting to load the value at AAAAAAADC which is not 4 bytes aligned, resulting in a crash.

Have you ever seen this bug ? If so are there steps to fix it ? Or should I update NetX ?

Best regards,

TiejunMS commented 1 year ago

There is a race condition fix as describe in #43. Could you upgrade to latest version and see if this issue can be solved?

gregtwice commented 1 year ago

Thanks, the issue is solved

TEcwbg commented 1 year ago

I have similar problem and upgrading to latest version (v6.2.1) won't solve the issue. Is there anything I should try?

Target board: Renesas EK-RA6M5 Azure RTOS NetX Duo version: v6.2.1 Toolchain: GCC ARM Embedded Toolchain Version: 10.3.1.20210824

e2studio_cap

Best regards.

gregtwice commented 1 year ago

After further testing, it appears the issue is still there, still in _nx_tcp_socket_receive_queue_flush but the adress accessed is 0xEEEEEEEE + 0x20.

TiejunMS commented 1 year ago

Could you share a packet trace capture by wireshark or tcpdump?
Can this issue be reproduced easily? If so, please describe it.
Did you encounter this issue from beginning of using NetX Duo or after upgrade?
Did you port the network driver, or it is provided by NXP/Renesas/MSFT?

TEcwbg commented 1 year ago

Could you share a packet trace capture by wireshark or tcpdump? I don't have environment to capture now. I can provide it when ready.
Can this issue be reproduced easily? If so, please describe it. At first I experienced this issue when transmitting data to server by FTP. When the server got heavy load (congestion) it happened. After that, I found out disconnecting <-> connecting ethernet cable quickly while transmitting tcp data, can reproduce the issue. With this method, it can reproduced in few minutes.
Did you encounter this issue from beginning of using NetX Duo or after upgrade? From beginning. Start using from v6.1.11.
Did you port the network driver, or it is provided by NXP/Renesas/MSFT? I use network driver provided by Renesas.

TiejunMS commented 1 year ago

I'm not sure if we are looking at the same version of Renesas network driver. Could do a search of nx_packet_release in rm_netxduo_ether.c? Where TX BD is released, replace the function call with nx_packet_transmit_release. I suspend the issue is caused by multiple releases on the same packet.

TEcwbg commented 1 year ago

I replaced nx_packet_release to nx_packet_transmit_release inside the file /ra/fsp/src/rm_netxduo_ether.c . 6 replaced.

The result was same. The issue can be reproduced.

TiejunMS commented 1 year ago

Could you compile your project with NX_ENABLE_PACKET_DEBUG_INFO defined. When you hit the hard fault, add socket_ptr -> nx_tcp_socket_transmit_sent_head to watch list. Follow the link of nx_packet_union_next.nx_packet_tcp_queue_next till 0xaaaaaaaa. For the last packet, please share the value of nx_packet_debug_file, nx_packet_debug_line and `nx_packet_debug_thread.

TEcwbg commented 1 year ago

Conclusion first, the issue was my fault.

2 threads were using different ftp client while sharing the same packet pool. There is no problem just shareing it. When packet loss happen and error return from nx_ftp_client_xxx(), I was recreating (delete & create) packet pool as a charm.

So, packet loss happen on both thread, and each thread disconnect and recreate packet pool, there was a potential to access deleted (or initialized) packet pool as it has packets.

I removed recreating packet pool process and the issue won't be reproduced.

NX_ENABLE_PACKET_DEBUG_INFO This really helped. Thank you.

@gregtwice I'm not sure this is the same situation. I hope it will help you.

TiejunMS commented 1 year ago

Glad to know your issue is resolved, @TEcwbg! I will keep this issue open for a while in case @gregtwice still have questions.

TiejunMS commented 1 year ago

Closing.

eclipse-threadx / netxduo

Hardfault unaligned access in _nx_tcp_socket_receive_queue_flush when disconnecting #171