espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.37k stars 7.21k forks source link

[Crash] [release/v4.4] LoadProhibited in esp_wifi/src/wifi_netif.c:110, when switching from STA to AP (IDFGH-7092) #8702

Open chipweinberger opened 2 years ago

chipweinberger commented 2 years ago

ESP32-S3 ESP-IDF release/v4.4 (c29343eb94d2f2ca17b3a5b38c82452e556147f2) from a few days ago.

Been trying to get AP -> APSTA mode switching working. Hit a crash in ESP-IDF.

netif or esp_wifi does not appear to be cleaning up after itself correctly. See the logs. My Logs make it clear how I am calling ESP-IDF code.

Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x42079352  PS      : 0x00060830  A0      : 0x82078c11  A1      : 0x3fcb58b0  
0x42079352: esp_wifi_get_if_mac at /Volumes/User/MBP-Google-Drive/jamcorder/firmware/esp-idf/components/esp_wifi/src/wifi_netif.c:110

A2      : 0x00000000  A3      : 0x3fcb58f6  A4      : 0x3fca5bbc  A5      : 0x00060823  
A6      : 0x00000001  A7      : 0x0000000c  A8      : 0x1873e946  A9      : 0x1873e946  
A10     : 0x00060823  A11     : 0x3fcb58f6  A12     : 0x00000001  A13     : 0x00060823  
A14     : 0x00060023  A15     : 0x00000003  SAR     : 0x00000017  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000008  LBEG    : 0x40056f5c  LEND    : 0x40056f72  LCOUNT  : 0x00000000  

Backtrace:0x4207934f:0x3fcb58b00x42078c0e:0x3fcb58e0 0x42078d19:0x3fcb5920 0x420f8679:0x3fcb5950 0x420f8d57:0x3fcb5990 0x420f8e36:0x3fcb59e0 0x40385b9a:0x3fcb5a10 
0x4207934f: esp_wifi_get_if_mac at /Volumes/User/MBP-Google-Drive/jamcorder/firmware/esp-idf/components/esp_wifi/src/wifi_netif.c:107

0x42078c0e: wifi_start at /Volumes/User/MBP-Google-Drive/jamcorder/firmware/esp-idf/components/esp_wifi/src/wifi_default.c:41

0x42078d19: wifi_default_action_sta_start at /Volumes/User/MBP-Google-Drive/jamcorder/firmware/esp-idf/components/esp_wifi/src/wifi_default.c:69

0x420f8679: handler_execute at /Volumes/User/MBP-Google-Drive/jamcorder/firmware/esp-idf/components/esp_event/esp_event.c:145

0x420f8d57: esp_event_loop_run at /Volumes/User/MBP-Google-Drive/jamcorder/firmware/esp-idf/components/esp_event/esp_event.c:598 (discriminator 3)

0x420f8e36: esp_event_loop_run_task at /Volumes/User/MBP-Google-Drive/jamcorder/firmware/esp-idf/components/esp_event/esp_event.c:115 (discriminator 15)

0x40385b9a: vPortTaskWrapper at /Volumes/User/MBP-Google-Drive/jamcorder/firmware/esp-idf/components/freertos/port/xtensa/port.c:131

ELF file SHA256: 29e5e3aefec34919

I (7796) esp_core_dump_flash: Save core dump to flash...
I (7803) esp_core_dump_flash: Erase flash 36864 bytes @ 0xc000
I (8331) esp_core_dump_flash: Write end offset 0x87c4, check sum length 4
I (8331) esp_core_dump_flash: Core dump has been saved to flash.
CPU halted.

Logs:

I (61210) pd wifi: Entering STA Mode
I (2680) pd wifi: esp_netif_init()
I (2680) pd wifi: esp_event_loop_create_default()
I (2700) pd wifi: esp_netif_create_default_wifi_sta()
I (2700) pd wifi: esp_wifi_init()
I (2780) pd wifi: esp_event_handler_instance_register(ESP_EVENT_ANY_ID)
I (2790) pd wifi: esp_event_handler_instance_register(IP_EVENT_STA_GOT_IP)
I (2800) pd wifi: esp_wifi_set_mode(WIFI_MODE_STA)
I (2820) pd wifi: esp_wifi_set_config()
I (2820) pd wifi: esp_wifi_start()

I (61210) pd wifi: STA Mode Is Running
I (2910) pd wifi: Wifi Event: STA Mode - Start
I (2910) pd mdns: init
I (2930) pd api init: Starting HTTP Server
I (2950) pd wifi: Wifi Event: STA Mode - Connected
I (3680) pd wifi: IP Event: Station Mode - Got IP
I (3690) pd wifi: got ip: 192.168.0.180

I (61210) pd wifi: Stopping STA Mode
I (61220) pd wifi: esp_wifi_stop()
I (61220) pd wifi: Wifi Event: STA Mode - Disconnected
E (61220) pd wifi: Disconnect reason : (8) Assoc Leave
I (61240) pd wifi: Wifi Event: STA Mode - Stop
I (61260) pd wifi: esp_event_handler_instance_unregister(IP_EVENT_STA_GOT_IP)
I (61260) pd wifi: esp_event_handler_instance_unregister(ESP_EVENT_ANY_ID)
I (61270) pd wifi: esp_wifi_deinit()
I (61290) pd wifi: esp_netif_destroy(0x3fce2eb4)

I (61210) pd wifi: Entering AP Mode
I (61290) pd wifi: esp_netif_create_default_wifi_ap()
I (61300) pd wifi: esp_wifi_init()
I (61370) pd wifi: esp_event_handler_instance_register(ESP_EVENT_ANY_ID)
I (61370) pd wifi: esp_event_handler_instance_register(IP_EVENT_STA_GOT_IP)
I (61380) pd wifi: esp_wifi_set_mode(WIFI_MODE_APSTA)
I (61400) pd wifi: esp_wifi_set_config()
...crashed...
chipweinberger commented 2 years ago

Is there any code with a working example of switching between APSTA and STA modes?

What I've tried:

  1. just using wifi_stop() / esp_wifi_set_config() / esp_wifi_set_mode(). The code shows success but my iPhone refuses to connect to the AP. It seems this is because I initialized with esp_netif_create_default_wifi_sta(). Edit: this was due to a mdns hostname conflict. I had another device plugged in.

  2. also calling esp_netif_destroy() / esp_netif_create_default_wifi_ap(), but then I hit this error: "netif already added", see existing issue: https://github.com/espressif/esp-idf/issues/4061

  3. on top of that, also calling esp_wifi_deinit() / esp_wifi_init(). I then hit this crash.

ESP-IDF should really have a "WifiSwitching" example that is maintained and tested with HTTP, MDNS, etc.

chipweinberger commented 2 years ago

Edit: I've tried everything. Deinitializing everything in reverse.., wifi, netif destroy, event loop, mdns, http server, all the way back to blank slate and then reinitializing. But still bugs. hit new crashes in ESP-IDF code after switching a couple times (PC 0x400553b6, corrupted backtrace), but even without the crash my phone still refuses to connect to the AP. DHCP seems to just not work after switching. My advice: switching wifi modes should be avoided. I have no confidence it will work reliably even if I got it to work. It's does not seem to be tested by Espressif.

Edit2: I fixed the non-connection issue. It was a stupid mistake. I had 2 devices both advertising the same mdns hostname by mistake and never realized it. I unplugged the other device and mdns started working.

If you aren't using MDNS, HTTP, etc, maybe you can get away with it. But for anything more complicated, wifi switching is a mountain of pain.

Arnaud-Oechslin commented 2 years ago

Are you using esp_netif_new() when you're switching ? They had a similar problem caused by esp_netif_init() when called mutliple times.

chipweinberger commented 2 years ago

Yes. My understanding is the esp_netif_new() is called when I call esp_netif_create_default_wifi_ap().

From my above logs:

I (61210) pd wifi: Entering AP Mode
I (61290) pd wifi: esp_netif_create_default_wifi_ap()
chipweinberger commented 2 years ago

Of course, worth reiterating that this crash should still be fixed and replaced with a log + err code if something is called incorrectly. Hopefully my logs + esp-idf code inspection are enough to find and fix it.

david-cermak commented 2 years ago

Hi @chipweinberger

Could you please try to call esp_netif_destroy_default_wifi() instead of esp_netif_destroy()?

The init-deinit sequence is being tested, not very thoroughly though, for example here: https://github.com/espressif/esp-idf/blob/36b83fcb9811437102a5b6b156221eeffe1e8fd7/components/esp_netif/test/test_esp_netif.c#L325

chipweinberger commented 2 years ago

Crashes. No backtrace =/

// Starting APSTA mode
I (2687) pd wifi: Entering AP Mode
I (2687) pd wifi: esp_netif_init()
I (2697) pd wifi: esp_event_loop_create_default()
I (2697) pd wifi: esp_netif_create_default_wifi_ap()
I (2707) pd wifi: esp_wifi_init()
I (2787) pd wifi: esp_event_handler_instance_register(ESP_EVENT_ANY_ID)
I (2797) pd wifi: esp_event_handler_instance_register(IP_EVENT_STA_GOT_IP)
I (2797) pd wifi: esp_wifi_set_mode(WIFI_MODE_APSTA)
I (2817) pd wifi: esp_wifi_set_config(WIFI_IF_AP)
I (2827) pd wifi: esp_wifi_start()
I (2907) pd wifi: Wifi Event: STA Mode - Start
I (2907) pd mdns: init
I (2907) pd wifi: Wifi Event: AP Mode - Start
I (2927) pd api init: Starting HTTP Server

// client connected
I (27367) pd wifi: Wifi Event: AP Mode - Client Connected
I (27367) pd wifi: Client connected: 96:c8:b6:c9:5a:df join, AID=1
I (27657) esp_netif_lwip: DHCP server assigned IP to a station, IP is: 192.168.4.2
I (30967) pd api init: Client connected 58
I (35767) pd api2 http: GET /internal called

// I now pushed a button on the device to switch to APSTA mode

// Stop APSTA mode
I (44407) pd wifi: Leaving AP Mode
I (44417) pd api init: Stopping HTTP Server
I (44417) pd api init: httpd_stop()
I (44427) pd api init: Client disconnected 57
I (44527) pd mdns: deinit
I (44537) pd wifi: esp_wifi_stop()
I (44537) pd wifi: Wifi Event: STA Mode - Stop
I (44537) pd wifi: Wifi Event: AP Mode - Client Disconnected
I (44537) pd wifi: Client disconnected: 96:c8:b6:c9:5a:df leave, AID=1
I (44547) pd wifi: Wifi Event: AP Mode - Stop
I (44587) pd wifi: esp_event_handler_instance_unregister(IP_EVENT_STA_GOT_IP)
I (44587) pd wifi: esp_event_handler_instance_unregister(ESP_EVENT_ANY_ID)
I (44597) pd wifi: esp_wifi_deinit()
I (44617) pd wifi: esp_netif_destroy_default_wifi(0x3fce2f74)
I (44617) pd wifi: esp_event_loop_delete_default()

// Start STA mode
I (44627) pd wifi: Entering Home Wifi Mode
I (44637) pd wifi: esp_event_loop_create_default()
I (44637) pd wifi: esp_netif_create_default_wifi_sta()
I (44647) pd wifi: esp_wifi_init()
I (44717) pd wifi: esp_event_handler_instance_register(ESP_EVENT_ANY_ID)
I (44717) pd wifi: esp_event_handler_instance_register(IP_EVENT_STA_GOT_IP)
I (44727) pd wifi: esp_wifi_set_mode(WIFI_MODE_STA)
I (44747) pd wifi: switching to 'Connect To Home Wifi' mode
I (44747) pd wifi: esp_wifi_set_config(WIFI_IF_STA)
I (44757) pd wifi: esp_wifi_start()
I (44767) pd wifi: Wifi Event: STA Mode - Start
I (44877) pd wifi: Wifi Event: STA Mode - Connected
I (48697) esp_netif_handlers: sta ip: 192.168.0.180, mask: 255.255.255.0, gw: 192.168.0.1
I (48697) pd wifi: IP Event: Station Mode - Got IP
I (48697) pd wifi: got ip: 192.168.0.180
I (48707) pd_sntp: Starting SNTP
LED: solid r:0 g:255 b:0
Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x4005544b  PS      : 0x00060c30  A0      : 0x820509f0  A1      : 0x3fcb78f0  
A2      : 0x00000000  A3      : 0x3fceea08  A4      : 0x000000a0  A5      : 0x0000ff00  
A6      : 0x00ff0000  A7      : 0xff000000  A8      : 0x00000000  A9      : 0x3fcb7520  
A10     : 0x0000000b  A11     : 0xffffffff  A12     : 0x0000000b  A13     : 0x3fcb7730  
A14     : 0x0000000b  A15     : 0x3fcb7730  SAR     : 0x0000001d  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000000  LBEG    : 0x400556d5  LEND    : 0x400556e5  LCOUNT  : 0xfffffffd  

Backtrace:0x40055448:0x3fcb78f0 |<-CORRUPTED
david-cermak commented 2 years ago

@chipweinberger Please refer to a very simple example which repeatably creates and destroys wifi in STA mode and AP+STA mode in turn. This works okay on my side.

https://gist.github.com/david-cermak/612487bc96df8a4dd56610d97b20dc3d

I noticed, that you also run mdns and SNTP client on some interfaces. Please note that these modules have to be properly deininitalized, separately; calling esp_netif_destroy() doesn't automatically delete and deinitializes all network modules on that interface.

chipweinberger commented 2 years ago

Really appreciate you trying to replicate this crash. It's good to know it works for you and appreciate your example code! That code will be very useful for other users in the future.

I did not know that SNTP also has to be deinitialized, it is possible that is causing the later crash I experienced after the 2nd wifi switch.

Your code convinces me it seems possible to get this to work. At the moment I changed my code to just reboot, which seems a lot easier heh. Perhaps with SNTP deinit this will start working.

It seems worthwhile for Espressif to maintain an example that uses all of the web stack (http, mdns, sntp, etc), and successfully preforms a wifi switch. It's very nontrivial! =) and probably will easily be broken unless tested. Your code seems like a good starting point for such an example.

david-cermak commented 2 years ago

Updated the gist to use SNTP

Perhaps with SNTP deinit this will start working.

this is very easy, just run sntp_stop()

It seems worthwhile for Espressif to maintain an example ...

Good idea. I agree that it's not very straight-forward to correctly deinit a module. What usually helps is to look at the unit tests of these components, as almost all have a simple init-deinit case (e.g. the one for esp-netif, referenced in the first comment).

chipweinberger commented 2 years ago

Updated the gist to use SNTP

Thanks ❤️

I agree that it's not very straight-forward to correctly deinit a module.

The bigger issue is much higher level imo. how all the pieces fit together.

  1. knowing that something needs to be deinited at all. HTTP, mdns, sntp, none of them take pointers to netif or the wifi stack, so there’s no reason to know they need to be deinited when switching.

  2. it’s hard to know at what point these things need to be deinitialized. after wifi_deinit(), after netif_destroy(), after either? do you need to wait for the wifi_stop event to fire before doing deinit. etc.

this code is complicated but important enough to warrant an example. to figure out 1 and 2 I basically had to spend a day guessing and checking.

david-cermak commented 2 years ago

knowing that something needs to be deinited at all.

The rule of thumb is that if the module uses "some specific" network interface it should be deinitialized before destroying the underlying esp-netif. For example HTTP server doesn't have to, unless it's bound to a specific netif.

it’s hard to know at what point these things need to be deinitialized.

Here, it's also quite simple IMO, the higher layered module should be initialized after the lower layered, and deinited before the lower layered. (saw that I made a mistake in my gist, shouldve deinited sntp 2 lines before)

this code is complicated but important enough to warrant an example

Agree that an example would be useful, but it wouldn't cover all usecases, or all options we have. e.g. in your case, I think you don't have to deinitialize the network at all when switching from STA to AP+STA, you could simply reuse the esp-netif handle. But you need a separate handle for the station interface and one for the AP interface.

chipweinberger commented 2 years ago

Appreciate all the information.

If not an example, perhaps a dedicated documentation page on Wifi Switching? I hate to think all this useful information will be hard to find! It would be especially useful for newer users =) Wifi switching should be made easy!

For the original crash this issue was filed for, feel free to close this issue if you don't think there is enough to go on. I think with code inspection there might be enough info to find it. It looks like a null dereference here:

esp_err_t esp_wifi_get_if_mac(wifi_netif_driver_t ifx, uint8_t mac[6])
{
    wifi_interface_t wifi_interface = ifx->wifi_if; // null deref

    return esp_wifi_get_mac(wifi_interface, mac);
}

Again, appreciate you helping! Launching a product this year =)

chipweinberger commented 2 years ago

Bump.

@david-cermak, is this worth investigating for you? If not, feel free to close.

My last comment has as much detail as I can provide. It may or may not be enough to go on.

chipweinberger commented 2 years ago

An easy and IMO still "proper" solution, given that we already return an error, return INVALID_ARG:

esp_err_t esp_wifi_get_if_mac(wifi_netif_driver_t ifx, uint8_t mac[6])
{
    if (ifx == NULL) {
        return ESP_ERR_INVALID_ARG;
    }

    wifi_interface_t wifi_interface = ifx->wifi_if; // null deref

    return esp_wifi_get_mac(wifi_interface, mac);
}

And might as well do the same for these:

esp_err_t esp_wifi_register_if_rxcb(wifi_netif_driver_t ifx, esp_netif_receive_t fn, void * arg)
{
    if (ifx == NULL) {
        return ESP_ERR_INVALID_ARG;
    }
    ....
}

Arguably, to be consistently "null safe", we should make every function handle NULL. It is pretty obvious that a NULL pointer is not "ready".

bool esp_wifi_is_if_ready_when_started(wifi_netif_driver_t ifx)
{
    if (ifx == NULL) {return false;}
    ....
}