espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.75k stars 7.3k forks source link

[ieee802154] transmit with ie fields results in wrong state during RX abort (IDFGH-13124) #14068

Closed no2chem closed 2 months ago

no2chem commented 4 months ago

Answers checklist.

IDF version.

v5.4-dev-826-g8760e6d2a7

Espressif SoC revision.

esp32h2-20221101

Operating System used.

Linux

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

esp32-h2-devboard-m

Power Supply used.

USB

What is the expected behavior?

It should not assert and send the frame with ie fields properly

What is the actual behavior?

An assert in isr_handle_rx_abort, IEEE802154_ASSERT(s_ieee802154_state == IEEE802154_STATE_RX); is hit.

Steps to reproduce.

  1. Apply PR #14060 to allow time sync messages to work
  2. build RCP with time sync enabled
  3. run otbr-posix with time sync enabled
  4. change state to leader 5 when time service sends first message with header ie, the assert will occur.

Debug Logs.

assert failed: 0x4080a08a
0x4080a08a: isr_handle_rx_abort at esp-idf/components/ieee802154/driver/esp_ieee802154_dev.c:543 (discriminator 1)
 (inlined by) ieee802154_isr at esp-idf/components/ieee802154/driver/esp_ieee802154_dev.c:689 (discriminator 1)

Stack dump detected
Core  0 register dump:
MEPC    : 0x408007a0  RA      : 0x40805d8a  SP      : 0x40810c60  GP      : 0x4080edc4
0x408007a0: panic_abort at esp-idf/components/esp_system/panic.c:463
0x40805d8a: __ubsan_include at esp-idf/components/esp_system/ubsan.c:311

TP      : 0x4081d3b0  T0      : 0x37363534  T1      : 0x7271706f  T2      : 0x33323130
S0/FP   : 0x4080a08e  S1      : 0x00000010  A0      : 0x40810c64  A1      : 0x00000061
0x4080a08e: ieee802154_isr at esp-idf/components/ieee802154/driver/esp_ieee802154_dev.c:631 (discriminator 2)

A2      : 0x00000003  A3      : 0x40810c78  A4      : 0x00000001  A5      : 0x40818000
A6      : 0x00000000  A7      : 0x76757473  S2      : 0x00000002  S3      : 0x600a3000
S4      : 0x00300020  S5      : 0x00000004  S6      : 0x40813ad8  S7      : 0x40814058
S8      : 0x00000081  S9      : 0x40814128  S10     : 0x00000001  S11     : 0x00000010
T3      : 0x6e6d6c6b  T4      : 0x6a696867  T5      : 0x66656463  T6      : 0x62613938
MSTATUS : 0x00001881  MTVEC   : 0x40800001  MCAUSE  : 0x00000007  MTVAL   : 0x00000000
0x40800001: _vector_table at esp-idf/components/riscv/vectors_intc.S:54

MHARTID : 0x00000000

Backtrace:

panic_abort (details=0x40810c64 <xIsrStack+1428> "assert failed: 0x4080a08a") at esp-idf/components/esp_system/panic.c:463
463     *((volatile int *) 0) = 0; // NOLINT(clang-analyzer-core.NullDereference) should be an invalid operation on targets
#0  panic_abort (details=0x40810c64 <xIsrStack+1428> "assert failed: 0x4080a08a") at esp-idf/components/esp_system/panic.c:463
#1  0x40805d8a in esp_system_abort (details=<optimized out>) at esp-idf/components/esp_system/port/esp_system_chip.c:92
Backtrace stopped: frame did not save the PC
ELF file SHA256: 38afeeef0

More Information.

No response

no2chem commented 4 months ago

Debugging into this more, it looks like the problem is we get a IEEE802154_RX_ABORT_BY_RX_RESTART when we are in state IEEE802154_STATE_TX_ACK. Perhaps because the RX gets aborted in ieee802154_transmit, but we are handling event TX_SFD_DONE still (due to the OPENTHREAD_CONFIG_TIME_SYNC_ENABLE).

no2chem commented 4 months ago

So it appears that we end up in this state in error actually because of CslTransmit.

Specifically, it looks like in time sync messages a delay is set: aFrame->mInfo.mTxInfo.mTxDelay != 0, so we use the transmit_at function - esp_ieee802154_transmit_at.

I think when the delay is large enough (in a time sync message, perhaps), there is some interaction that causes a RX abort in a TX state.

no2chem commented 4 months ago

Okay - I think I've isolated the issue but I'm not sure what the fix is yet.

It looks like when ieee802154_transmit_at, the RX state isn't checked. So if the timer expires while we were in the middle of doing a RX, we'll get an RX abort but that will get picked up after the state has been transitioned to TX_ACK.

zwx1995esp commented 4 months ago

Hi, @no2chem , thanks for your so detailed analysis! I will try to reproduce this issue and fix it later.

no2chem commented 4 months ago

Hi @zwx1995esp - I'm not sure if you've made progress on this, but it seems like when isr_handle_rx_abort is hit, ieee802154_timer0_get_value() will return a value. So for some reason RX is still happening when the timer is loaded. There's not much documentation on the ieee820154 driver but I think we need to make sure no other operation takes place until the timer resets.

no2chem commented 4 months ago

@zwx1995esp adding further to this, it does seem to go away if I disable RX_WHEN_IDLE, however, that is not very useful as any router will need to rx when idle in order to talk to children. I can't tell if its just RX_WHEN_IDLE causing the problem, or if turning off RX_WHEN_IDLE just reduced the amount of RXes.

zwx1995esp commented 4 months ago

Hi, @no2chem I have already add this issue in my todo list, but sorry, a little busy these days(I worked on C5 IEEE802154 feature supporting these days), I think I will start this issue next week.

no2chem commented 4 months ago

Thanks! I see, I think I fixed the issue, will submit a PR shortly. edit: never mind, seems to still happen, still looking into it

zwx1995esp commented 4 months ago

Ok, welcome for PRs :) if you fix it.

no2chem commented 4 months ago

Hi @zwx1995esp - I submitted a PR which seems to address my issues. https://github.com/espressif/esp-idf/pull/14089

As I don't have any documentation on the 802.15.4 hardware, please check to see if my edits make sense.

Thanks!

chshu commented 3 months ago

@no2chem see https://github.com/espressif/esp-idf/pull/14089#issuecomment-2264960998, could you please test with the latest IDF master branch, and let us know if you see any further issues.