espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.33k stars 7.2k forks source link

Wifi stack hangs (IDFGH-9724) (IDFGH-9725) #11060

Closed rojer closed 1 week ago

rojer commented 1 year ago

Answers checklist.

IDF version.

4.4 release, tried latest 4.4 latest as well

Operating System used.

Linux

How did you build your project?

Command line with Make

If you are using Windows, please specify command line type.

None

Development Kit.

ESP32-PICO-KIT

Power Supply used.

USB

What is the expected behavior?

Wifi connects every time

What is the actual behavior?

Sometimes (approximately 1 in hundreds to a thousand) the wifi stack gets into an inconsistent state and is unable to connect or scan properly. The tell-tale sign is appearance of W (67573) wifi:m f probe req l=0 messages on the console, and additionally, wifi can no longer be properly stopped, after 5 seconds of blocking it prints the following:

[Mar 24 09:26:53.884] W (19912322) wifi:TX Q not empty: 500, TXQ_BLOCK=0
[Mar 24 09:26:53.884] W (19912322) wifi:force witi stop
[Mar 24 09:26:53.890] I (19912322) wifi:flush txq
[Mar 24 09:26:53.890] I (19912332) wifi:stop sw txq
[Mar 24 09:26:53.897] I (19912332) wifi:lmac stop hw txq
[Mar 24 09:26:53.897] W (19912332) wifi:sw tx 0 state not idle, potential error!
[Mar 24 09:26:53.903] W (19912342) wifi:force sw tx 0 state to idle, ebuf flag=3009

restarting does not solve the issue, and deinit causes a core dump due to what looks like a double-free on some internal buffer:

[Mar 24 09:26:53.909] I (19912342) wifi:Deinit lldesc rx mblock:4
[Mar 24 09:26:53.915].
[Mar 24 09:26:53.915] assert failed: tlsf_free heap_tlsf.c:964 (!block_is_free(block) && "block already marked as free")

Steps to reproduce.

I have firmware that reboots 5 seconds after wifi connects and it will get into this state within an hour or two.

Debug Logs.

No response

More Information.

No response

rojer commented 1 year ago

here is a typical "good" boot log: https://gist.github.com/rojer/85c3abd0157a5b7935ca7249ff04ea1c and a bad one (this is one, in total i have 3 occurences out of approx 8000 samples so far): https://gist.github.com/rojer/f2adba071ad424167bce5930d44f2f11 things to note:

  1. normally a device is online within a few seconds, not so in the bad case
  2. it does manage to associate but gets its wires crossed somewhere before obtaining a DHCP lease
  3. at line 138 wifi is restarted but it already exhibits the unhealthy behavior - "forced stop", "potential error", etc. and never recovers. scans are failing and eventually the "probe req l=0" message appears.
rojer commented 1 year ago

this seems to suggest that this has been fixed already. can i get a confirmation on whether that is the case?

igrr commented 1 year ago

No, that "duplicate" label is just a bug of our issue tracking system. Somehow two internal issues got created for this GitHub issue; the second one was marked as "duplicate", and the resolution got synced back to this issue. I'll clear the label.

AxelLin commented 1 year ago

No, that "duplicate" label is just a bug of our issue tracking system. Somehow two internal issues got created for this GitHub issue; the second one was marked as "duplicate", and the resolution got synced back to this issue. I'll clear the label.

The question is this issue is duplicated with which issue? I don't find the link.

rojer commented 1 year ago

@AxelLin it's not duplicated with anything, labeling was a mistake. it's still waiting for investigation.

lszelejewski commented 1 year ago

I can confirm the same issue but on 4.2.4

Exactly same symptoms related to frequent network connection and disconnection event. Happens more frequently when WiFi signal is weak compared to sites with better network coverage.

W (21654) wifi:TX Q not empty: 500, TXQ_BLOCK=0 W (21654) wifi:force witi stop I (21654) wifi:flush txq I (21654) wifi:stop sw txq I (21654) wifi:lmac stop hw txq W (21654) wifi:sw tx 0 state not idle, potential error! W (21654) wifi:force sw tx 0 state to idle, ebuf flag=0

esp_restart() does not help - only power cycle.

AxelLin commented 1 year ago

I can confirm the same issue but on 4.2.4

v4.2 is already EOL, maybe you can try with newer esp-idf version?

lszelejewski commented 1 year ago

v4.4 has been tried as well. Same issue. The biggest problem is that esp_restart does not help and the only solution is to power cycle device manually - very problematic for our customers.

AxelLin commented 1 year ago

No, that "duplicate" label is just a bug of our issue tracking system. Somehow two internal issues got created for this GitHub issue; the second one was marked as "duplicate", and the resolution got synced back to this issue. I'll clear the label.

This issue has been reported for several months, but there is no response from other espressif developers?

Xiehanxin commented 1 year ago

hi @rojer , I have seen your log and I would like to confirm that if you have used the BLE and Wi-Fi coex? and could you share the minimum reproduce code to us

Xiehanxin commented 1 year ago

hi @lszelejewski could you describe your application scenario in details

tomasznowik commented 1 year ago

On behalf of @lszelejewski I will describe our problem. Majority of our devices uses v4.2.4. We found this issue because due to our bug the device frees already freed pointer and calls esp_restart just after dropped wifi and reconnecting. We see

CORRUPT HEAP: multi_heap.c:540 detected at 0x3ffbb094
abort() was called at PC 0x40094971 on core 0

before the reboot.

But the problem is that WiFi can't connect anymore after startup. Software watchdog resets the device with esp_restart but issue persists until power cycle. We can recreate this error only with weak wifi signal when connection is dropped from time to time so our bug corrupts the heap.

So it looks like this:

  1. device works in environment with low wifi signal
  2. before the issue the device lost wifi connection, connects again but obtaining address from DHCP didn't finish
  3. abort() and CORRUPT HEAP occurred due to double free call (our bug) and then reboot
  4. after reboot the device was unable to connect to wifi
  5. software watchdog resets the device each 3 minutes but it doesn't help
  6. powercycle helped.

The log:

E (1894) UDP Server: Socket created
I (1894) UDP Server: Socket bound, port 1234
I (1894) UDP Server: Waiting for data
I (1984) wifi:new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
I (1984) wifi:state: init -> auth (b0)
I (2384) WATCHDOG: Subscribe watchdog to task esp_timer
I (2984) wifi:state: auth -> init (200)
I (2984) wifi:new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
W (2984) WiFi: wifi event handler 5
I (2984) WiFi: Retry to connect to the AP
I (2994) EventHandlers: NETWORKING event ID 1
I (4294) wifi:new:<6,0>, old:<1,0>, ap:<255,255>, sta:<6,0>, prof:1
I (4294) wifi:state: init -> auth (b0)

01:44:06.678    I (5294) wifi:state: auth -> init (200)
I (5294) wifi:new:<6,0>, old:<6,0>, ap:<255,255>, sta:<6,0>, prof:1
W (5294) WiFi: wifi event handler 5
I (5294) WiFi: Retry to connect to the AP
I (5294) EventHandlers: NETWORKING event ID 1

01:44:08.036    I (7704) wifi:new:<11,0>, old:<6,0>, ap:<255,255>, sta:<11,0>, prof:1
I (7704) wifi:state: init -> auth (b0)

01:44:09.395    I (8704) wifi:state: auth -> init (200)
I (8704) wifi:new:<11,0>, old:<11,0>, ap:<255,255>, sta:<11,0>, prof:1
W (8704) WiFi: wifi event handler 5
I (8714) WiFi: Retry to connect to the AP
I (8714) EventHandlers: NETWORKING event ID 1

01:44:12.419    W (12074) WiFi: wifi event handler 5
I (12074) WiFi: Retry to connect to the AP
I (12074) EventHandlers: NETWORKING event ID 1

01:44:13.783    I (12124) wifi:new:<1,0>, old:<11,0>, ap:<255,255>, sta:<1,0>, prof:1
I (12124) wifi:state: init -> auth (b0)
I (13124) wifi:state: auth -> init (200)
I (13124) wifi:new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
W (13124) WiFi: wifi event handler 5
I (13124) WiFi: Retry to connect to the AP
I (13124) EventHandlers: NETWORKING event ID 1

01:44:15.135    I (14424) wifi:new:<6,0>, old:<1,0>, ap:<255,255>, sta:<6,0>, prof:1
I (14434) wifi:state: init -> auth (b0)

01:44:16.494    I (15434) wifi:state: auth -> init (200)
I (15434) wifi:new:<6,0>, old:<6,0>, ap:<255,255>, sta:<6,0>, prof:1
W (15434) WiFi: wifi event handler 5
I (15434) WiFi: Retry to connect to the AP
I (15434) EventHandlers: NETWORKING event ID 1
W (15674) wifi:m f probe req l=0

W (15914) wifi:m f probe req l=0

W (16154) wifi:m f probe req l=0

01:44:17.846    W (16394) wifi:m f probe req l=0

W (16634) wifi:m f probe req l=0

W (16874) wifi:m f probe req l=0

W (17114) wifi:m f probe req l=0

W (17354) wifi:m f probe req l=0

01:44:19.192    W (17604) wifi:m f probe req l=0

W (17844) wifi:m f probe req l=0

I (17944) wifi:new:<11,0>, old:<6,0>, ap:<255,255>, sta:<11,0>, prof:1
I (17944) wifi:state: init -> auth (b0)

01:44:20.545    I (18944) wifi:state: auth -> init (200)
I (18944) wifi:new:<11,0>, old:<11,0>, ap:<255,255>, sta:<11,0>, prof:1
W (18944) WiFi: wifi event handler 5
W (18944) WiFi: Using fallback network

01:44:24.286    W (23954) wifi:TX Q not empty: 500, TXQ_BLOCK=0
W (23954) wifi:force witi stop
I (23954) wifi:flush txq
I (23954) wifi:stop sw txq
I (23954) wifi:lmac stop hw txq
W (23954) wifi:sw tx 0 state not idle, potential error!
W (23954) wifi:force sw tx 0 state to idle, ebuf flag=412

It looks like inconsistent state is kept during reset with esp_restart.

Is there a change that triggering hardware watchdog would reset WiFi better than esp_restart()?

Xiehanxin commented 1 year ago

hi @tomasznowik from your log, there might be somerhing wrong with phy or the Phy uses the wrong calibration data , could you tell me 1.how much device have this problem 2.how often the problem recurs and tell me the reproduce steps if it is convient. 3.provide your sdkconfig.

tomasznowik commented 1 year ago
  1. It seems that many random devices has this issues so it's rather software issue. We managed to create bad wifi environment that 4 out of 10 devices hanged due this issue in 24 hours. Normally it almost never happens.
  2. It's hard to recreate this problem but it can happen once a day when there's weak wifi signal. But we are sure it was triggered due to our bug in the code that corrupted heap memory and triggered esp_restart in other thread. Steps to reproduce are in previous message.
  3. Please see attachment: sdkconfig.zip
tomasznowik commented 10 months ago

@Xiehanxin We still observe the issue. After fixing our own bug the device doesn't reset so often so it looked like the issue is solved but it's not. After writing a configuration to flash we call esp_system_abort to reboot. And rarely the devices can't connect to wifi anymore because of the error I described previously.

W 19585 | wifi             force witi stop
I 19585 | wifi             flush txq
I 19585 | wifi             stop sw txq
I 19585 | wifi             lmac stop hw txq
W 19585 | wifi             sw tx 0 state not idle, potential error!
W 19585 | wifi             force sw tx 0 state to idle, ebuf flag=412

You wrote "there might be somerhing wrong with phy or the Phy uses the wrong calibration data". Is there anything we could try to fix this? We use:

CONFIG_ESP32_PHY_CALIBRATION_AND_DATA_STORAGE=y
# CONFIG_ESP32_PHY_INIT_DATA_IN_PARTITION is not set
CONFIG_ESP32_PHY_MAX_WIFI_TX_POWER=20
CONFIG_ESP32_PHY_MAX_TX_POWER=20
AxelLin commented 8 months ago

@Xiehanxin Any further feedback about https://github.com/espressif/esp-idf/issues/11060#issuecomment-1669941226 and https://github.com/espressif/esp-idf/issues/11060#issuecomment-1807573122 ?

Espressif-liuuuu commented 2 months ago

Hi @rojer , we find the cause of Wi-Fi connect issue after reset when coexisting with BLE scan. https://github.com/espressif/esp-idf/issues/13598#issuecomment-2188088182
If its similiar in your case, you can try to verify it on v5.2 as well.

rojer commented 2 months ago

@Espressif-liuuuu great news, thanks! we've been carrying a really hacky workarond for this for a while. i'll try to repro it with our current code (w/o workaround) and then try the fix.

AxelLin commented 1 month ago

i'll try to repro it with our current code (w/o workaround) and then try the fix.

@rojer Does the fix work for you?

Sherry616 commented 1 week ago

Thanks for reporting, will close due to short of feedback, feel free to reopen with more updates. Thanks for using our Espressif product!

rojer commented 1 week ago

yeah, i'm sorry - i have actually tried to reproduce it, but wasn't able to. something has changed somewhere, but the bottom line is we don't see these crashes anymore, and that's good.