SOTAmat / SOTAcat

CAT control for Elecraft KX radios and SOTAMAT
https://sotamat.com
Other
9 stars 2 forks source link

Waking from the dead (deep sleep) #10

Closed jeffkowalski closed 4 months ago

jeffkowalski commented 4 months ago

Brian states: Recent changes to the code seem to have caused a memory leak. Actually I think it is a stack overflow since the ESP32 ESP-IDF framework does not have a dynamically adjusting stack and instead you have to declare your stack size for each thread. Predicting what that stack size needs to be is hard and error prone. The error manifests as a random reboot after sitting idle for a while my Windows PC browser has the sotacat.local page open (which refreshes the battery every minute). In fact, the browser will call the battery voltage REST and the battery Percentage REST sometimes at the same moment, and I suspect the WiFi code gets an overflow.

jeffkowalski commented 4 months ago

https://docs.espressif.com/projects/esp-idf/en/stable/esp32/api-reference/system/heap_debug.html#how-to-diagnose-memory-leaks will be useful here

brianmathews commented 4 months ago

Letting the SOTACAT run on the bench while logging was enabled, I saw what I thought was our memory leak/crash/stack overflow reboot issue.

Turns out this is NOT a leak/crash causing the reboot (at least not what I saw). The log below shows:

  1. the unit's auto-sleep timer has reached the inactivity limit and it performs a shutdown to deep sleep sequence.
  2. The unit appears to properly go to sleep
  3. Somehow while in deep sleep, a WiFi event occurs and wakes the device. I did not know this was even possible. Everywhere I read in the docs there were two types of sleep: (a) light sleep where wifi might remain active, and (b) deep sleep where wifi is not active.
  4. Upon waking, the unit calls a WiFi event handler which crashes because the sequence of entering deep sleep got rid of all the wifi context needed by the handler.
  5. The crash of the handler causes the unit to do a full reboot
  6. The reboot runs into the issue where (reported in another issue / bug ID) we think the ESP sends garbage to the KX radio causing the KX to lock up until rebooted.
  7. The ESP32 keeps trying to connect to the (now hung) KX and loops forever until the 60 second boot watchdog shuts the unit down into deep sleep.
  8. ...and here we go again!

Here is the log:

I (3494178) SOTAmat: Blinks 4 I (3497778) SOTAmat: Battery voltage: 3.928 V I (3497778) SOTAmat: Blinks 4 I (3501378) SOTAmat: Battery voltage: 3.930 V I (3501378) SOTAmat: Blinks 5 I (3501828) SOTAmat: Powering off due to inactivity I (3501828) SOTAmat: Entering deep sleep... I (3501828) wifi:state: run -> init (0) I (3501828) wifi:pm stop, total sleep time: 3079622303 us / 3500412134 us

I (3501828) wifi:idx I (3501838) wifi:idx I (3501838) wifi:new:<5,0>, old:<5,0>, ap:<255,255>, sta:<5,0>, prof:1 I (3501848) SOTAmat: WiFi event handler called with event_base: WIFI_EVENT, event_id: 5 I (3501858) SOTAmat: ============ Failed to connect to home WiFi, setting up stand-alone AP mode... I (3501858) SOTAmat: Initializing WiFi in AP mode ESP_ERROR_CHECK failed: esp_err_t 0x3014 (ESP_ERR_WIFI_STOP_STATE) at 0x4200324e file: "src/wifi.c" line 33 func: wifi_init_ap expression: esp_wifi_set_mode(WIFI_MODE_AP)

abort() was called at PC 0x403890a5 on core 0 Core 0 register dump: MEPC : 0x4038166a RA : 0x403890b0 SP : 0x3fc9fec0 GP : 0x3fc92e00 TP : 0x3fc5e3c8 T0 : 0x37363534 T1 : 0x7271706f T2 : 0x33323130 S0/FP : 0x00000004 S1 : 0x3fc9ff24 A0 : 0x3fc9feec A1 : 0x3fc9ff22 A2 : 0x00000000 A3 : 0x3fc9ff19 A4 : 0x00000001 A5 : 0x3fc99000 A6 : 0x7a797877 A7 : 0x76757473 S2 : 0x3fc92000 S3 : 0x3c0a148c S4 : 0xffffffff S5 : 0x00000000 S6 : 0xffffffff S7 : 0x3fca05a0 S8 : 0x00000000 S9 : 0x00000000 S10 : 0x00000000 S11 : 0x00000000 T3 : 0x6e6d6c6b T4 : 0x6a696867 T5 : 0x66656463 T6 : 0x62613938 MSTATUS : 0x00001881 MTVEC : 0x40380001 MCAUSE : 0x00000007 MTVAL : 0x00000000 MHARTID : 0x00000000

Stack memory: 3fc9fec0: 0x3c0a1e2c 0x42003252 0x3fc9ff20 0x4038fa92 0x00001800 0x00000018 0x3fca063c 0x3fc940f4 3fc9fee0: 0x3fc9ff24 0x3fc94110 0x3fc9ff20 0x726f6261 0x20292874 0x20736177 0x6c6c6163 0x61206465 3fc9ff00: 0x43502074 0x34783020 0x39383330 0x20356130 0x63206e6f 0x2065726f 0x00000030 0x00000000 3fc9ff20: 0x3fc90030 0x38333034 0x35613039 0x3c0a1400 0x3fc92000 0x3c0a148c 0x3fcacea4 0x403890a8 3fc9ff40: 0x3fc92000 0x3c0a148c 0x3fcacea4 0x42003252 0x4038c17c 0x4038c1a6 0x3fc9a11c 0x00000001 3fc9ff60: 0x00000003 0x41544f53 0x00746163 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 3fc9ff80: 0x00000000 0x34333231 0x38373635 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 3fc9ffa0: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 3fc9ffc0: 0x00000000 0x00000107 0x00000004 0x00000400 0x00000000 0x00000000 0x00000000 0x00000000 3fc9ffe0: 0x00000000 0x00000000 0x00000000 0x00000000 0x3fc92000 0x3c0a148c 0x3fc92000 0x42003344 3fca0000: 0xffffffff 0x3fc9f2f4 0x00000000 0x00000004 0x00000000 0x3fc9f2f4 0x3fca022c 0x4209cdf4 3fca0020: 0x00000000 0x3fc9f2f4 0x00000000 0x4209d538 0x00000101 0x3c0c6734 0x00000005 0x3fca8868 3fca0040: 0x00000101 0x3c0c6734 0x00000005 0x3fca8868 0x00000000 0x00000000 0x00000000 0x00000000 3fca0060: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x3fc9f2f4 0x4209d66a 3fca0080: 0x00000000 0x00000000 0x00000000 0x4038c020 0x00000000 0x00000000 0x00000000 0x00000000 3fca00a0: 0x00000000 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 3fca00c0: 0x00000154 0x3fc9fdd0 0x3fc99314 0x3fc95b64 0x3fc95b64 0x3fca00c4 0x3fc95b5c 0x00000005 3fca00e0: 0x3fca0668 0x3fca0668 0x3fca00c4 0x00000000 0x00000014 0x3fc9f5c0 0x5f737973 0x00747665 3fca0100: 0x1f8a245a 0x00dbae4a 0x00000000 0x3fca00b0 0x00000014 0x00000001 0x3fca8794 0x4203fea0 3fca0120: 0x00000000 0x3fc99d00 0x3fc99d68 0x3fc99dd0 0x00000000 0x00000000 0x00000001 0x00000000 3fca0140: 0x00000000 0x00000000 0x420855f2 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 3fca0160: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 3fca0180: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 3fca01a0: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 3fca01c0: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 3fca01e0: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 3fca0200: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x3f000000 0x0000000c 0x00000000 3fca0220: 0x3fca022c 0x00000000 0x00000010 0x3c0c6734 0x3fca0240 0x3fca0480 0x3fca05a0 0x0000000c 3fca0240: 0x3fca0250 0x00000000 0x3fca0238 0x0000000c 0x4200329a 0x00000000 0x3fca0248 0x00000060 3fca0260: 0xf4ca4c40 0x000020f4 0x3fca02c4 0x3fca02d4 0x3fca02e4 0x4209a7f6 0x4209a720 0x3fca02e4 3fca0280: 0x00000000 0x00000000 0x3fca0470 0x4201b67c 0x4201b668 0x4201b654 0x00000001 0x00000000 3fca02a0: 0x00000000 0x00000000 0x00000001 0x00000099 0x00000000 0x3fca044c 0x3fca0460 0x00000064

ELF file SHA256: 1dc063cb57c822f4

Rebooting... ESP-ROM:esp32c3-api1-20210207 Build:Feb 7 2021 rst:0x3 (RTC_SW_SYS_RST),boot:0xd (SPI_FAST_FLASH_BOOT) Saved PC:0x4038121e SPIWP:0xee mode:DIO, clock div:1 load:0x3fcd5820,len:0x16a0 load:0x403cc710,len:0x968 load:0x403ce710,len:0x2e78 entry 0x403cc710 I (24) boot: ESP-IDF 5.0.2 2nd stage bootloader I (24) boot: compile time 20:25:58 I (24) boot: chip revision: v0.4 I (26) boot.esp32c3: SPI Speed : 80MHz I (30) boot.esp32c3: SPI Mode : DIO I (35) boot.esp32c3: SPI Flash Size : 4MB I (40) boot: Enabling RNG early entropy source... I (45) boot: Partition Table: I (49) boot: ## Label Usage Type ST Offset Length I (56) boot: 0 nvs WiFi data 01 02 00009000 00006000 I (64) boot: 1 phy_init RF data 01 01 0000f000 00001000 I (71) boot: 2 factory factory app 00 00 00010000 00100000 I (79) boot: End of partition table I (83) esp_image: segment 0: paddr=00010020 vaddr=3c0a0020 size=425b0h (271792) map I (132) esp_image: segment 1: paddr=000525d8 vaddr=3fc92600 size=025d4h ( 9684) load I (134) esp_image: segment 2: paddr=00054bb4 vaddr=40380000 size=0b464h ( 46180) load I (146) esp_image: segment 3: paddr=00060020 vaddr=42000020 size=9f0c4h (651460) map I (244) esp_image: segment 4: paddr=000ff0ec vaddr=4038b464 size=07168h ( 29032) load I (250) esp_image: segment 5: paddr=0010625c vaddr=50000000 size=00018h ( 24) load I (256) boot: Loaded app from partition at offset 0x10000 I (256) boot: Disabling RNG early entropy source... I (273) cpu_start: Pro cpu up. I (282) cpu_start: Pro cpu start user code I (282) cpu_start: cpu freq: 160000000 Hz I (282) cpu_start: Application information: I (285) cpu_start: Project name: SOTACATkx I (290) cpu_start: App version: 39def25-dirty I (295) cpu_start: Compile time: Feb 20 2024 20:25:36 I (302) cpu_start: ELF file SHA256: 1dc063cb57c822f4... I (308) cpu_start: ESP-IDF: 5.0.2 I (312) cpu_start: Min chip rev: v0.3 I (317) cpu_start: Max chip rev: v0.99 I (322) cpu_start: Chip rev: v0.4 I (327) heap_init: Initializing. RAM available for dynamic allocation: I (334) heap_init: At 3FC99300 len 00043410 (269 KiB): DRAM I (340) heap_init: At 3FCDC710 len 00002950 (10 KiB): STACK/DRAM I (347) heap_init: At 50000028 len 00001FD8 (7 KiB): RTCRAM I (354) spi_flash: detected chip: generic I (358) spi_flash: flash io: dio I (362) sleep: Configure to isolate all GPIO pins in sleep state I (368) sleep: Enable automatic switching of GPIO sleep configuration I (376) cpu_start: Starting scheduler. I (380) SOTAmat: app_main() started I (380) SOTAmat: Setup starting... I (380) SOTAmat: Current setup() task priority is 1 I (390) SOTAmat: Shutdown watchdog started... I (400) SOTAmat: NVS initialized I (400) gpio: GPIO[2]| InputEn: 0| OutputEn: 0| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0 I (410) SOTAmat: ADC calibration scheme is: Curve Fitting I (410) SOTAmat: Calibration Success I (420) SOTAmat: ADC configured and ready to read.... I (420) SOTAmat: Initializing WiFi in STA mode I (430) pp: pp rom version: 9387209 I (430) net80211: net80211 rom version: 9387209 I (450) wifi:wifi driver task: 3fca2c38, prio:23, stack:6656, core=0 I (450) system_api: Base MAC address is not set I (450) system_api: read default base MAC address from EFUSE I (460) wifi:wifi firmware version: 57982fe I (460) wifi:wifi certification version: v7.0 I (460) wifi:config NVS flash: enabled I (470) wifi:config nano formating: disabled I (470) wifi:Init data frame dynamic rx buffer num: 32 I (480) wifi:Init management frame dynamic rx buffer num: 32 I (480) wifi:Init management short buffer num: 32 I (490) wifi:Init dynamic tx buffer num: 32 I (490) wifi:Init static tx FG buffer num: 2 I (490) wifi:Init static rx buffer size: 1600 I (500) wifi:Init static rx buffer num: 10 I (500) wifi:Init dynamic rx buffer num: 32 I (510) wifi_init: rx ba win: 6 I (510) wifi_init: tcpip mbox: 32 I (510) wifi_init: udp mbox: 6 I (520) wifi_init: tcp mbox: 6 I (520) wifi_init: tcp tx win: 5744 I (520) wifi_init: tcp rx win: 5744 I (530) wifi_init: tcp mss: 1440 I (530) wifi_init: WiFi IRAM OP enabled I (540) wifi_init: WiFi RX IRAM OP enabled I (550) phy_init: phy_version 950,11a46e9,Oct 21 2022,08:56:12 I (580) wifi:mode : sta (40:4c:ca:f4:f4:20) I (580) wifi:enable tsf I (580) SOTAmat: WiFi event handler called with event_base: WIFI_EVENT, event_id: 2 I (590) SOTAmat: ============ Connecting to WiFi... I (600) wifi:new:<5,0>, old:<1,0>, ap:<255,255>, sta:<5,0>, prof:1 I (1300) wifi:state: init -> auth (b0) I (1300) wifi:state: auth -> assoc (0) I (1310) wifi:state: assoc -> run (10) I (1320) wifi:connected with whipitgood, aid = 33, channel 5, BW20, bssid = a0:36:bc:b6:e4:88 I (1330) wifi:security: WPA2-PSK, phy: bgn, rssi: -38 I (1340) wifi:pm start, type: 1

I (1340) wifi:set rx beacon pti, rx_bcn_pti: 0, bcn_timeout: 0, mt_pti: 25000, mt_time: 10000 I (1340) SOTAmat: WiFi event handler called with event_base: WIFI_EVENT, event_id: 4 I (1350) SOTAmat: ============ Success connecting to existing WiFi network... I (1420) wifi:AP's beacon interval = 102400 us, DTIM period = 1 I (1450) wifi:idx:0 (ifx:0, a0:36:bc:b6:e4:88), tid:0, ssn:0, winSize:64 I (1790) SOTAmat: WiFi initialization complete, returning. I (1790) SOTAmat: WiFi initialized I (1800) SOTAmat: mDNS initialized I (1800) SOTAmat: Completed defining Webserver callbacks I (1800) SOTAmat: Web Server initialized E (2240) wifi:CCMP replay detected: A1=40:4c:ca:f4:f4:20 A2=a0:36:bc:b6:e4:88 PN=9, RSC=10 seq=15920 I (2240) wifi:idx:1 (ifx:0, a0:36:bc:b6:e4:88), tid:6, ssn:4, winSize:64 I (2360) esp_netif_handlers: sta ip: 192.168.50.51, mask: 255.255.255.0, gw: 192.168.50.1 I (2900) SOTAmat: No response received for baud rate 4800 I (3400) SOTAmat: No response received for baud rate 9600 I (3900) SOTAmat: No response received for baud rate 19200 I (4400) SOTAmat: No response received for baud rate 38400 I (4900) SOTAmat: No response received for baud rate 4800 I (5400) SOTAmat: No response received for baud rate 9600 I (5900) SOTAmat: No response received for baud rate 19200 I (6400) SOTAmat: No response received for baud rate 38400

jeffkowalski commented 4 months ago

Perhaps we need to wait for WIFIEVENT[STA/AP]_STOP event after calling esp_wifi_stop(), to know it's really stopped?

brianmathews commented 4 months ago

I don't know. I have different ESP32 models in my prototypes that have very different behaviors in shutting down. In fact some of the ESP-IDF shutdown routines crash on this XAIO ESP32C3. The C3 is based on a different MCU with a different machine language and hardware than other ESP32's. In other words, I am not an expert in shutdown. I am surprised about the ability to wake as my understanding is that a wake is dependent on setting a trigger such as a clock/timer, or a GPIO pin transition, etc. and I haven't configured any triggers (unless there is a default?).

jeffkowalski commented 4 months ago

I don't think it's waking. I think that during the sleep process, various listeners are notified, including our wifi_event_handler, which may get a WIFIEVENT[STA/AP]_STOP. The handler would be confused. Just conjecturing. I'll dig in.

brianmathews commented 4 months ago

You are correct. The crash happens at shutdown, not later. It just looks that way in the sequence, but in reality the device isn't shut down.

The log shows timestamps in 1/1000ths of seconds (milliseconds):

jeffkowalski commented 4 months ago

Interestingly, the bug doesn't seem to occur if the device is running in AP mode rather than STA mode. If I cause it to fail to connect to my home Wifi, and allow it to broadcast an AP, then without connecting it to KX I see it correctly times out and goes into deep sleep:

...
I (60417) SOTAmat: Startup watchdog timer expired.  Shutting down.
I (60417) SOTAmat: Entering deep sleep...
I (60417) wifi:station: 5a:ce:f7:4b:3c:51 leave, AID = 1, bss_flags is 658547, bss:0x3fcaeb6c
I (60417) wifi:new:<1,0>, old:<1,0>, ap:<1,1>, sta:<255,255>, prof:1
I (60427) wifi:<ba-del>idx:2, tid:0
I (60427) wifi:<ba-del>idx:3, tid:7
I (60437) SOTAmat: WiFi event handler called with event_base: WIFI_EVENT, event_id: 15
I (60447) SOTAmat: WiFi event handler called with event_base: WIFI_EVENT, event_id: 13
W (60457) httpd_txrx: httpd_sock_err: error in recv : 113
W (60457) httpd_txrx: httpd_sock_err: error in recv : 113
W (60467) httpd_txrx: httpd_sock_err: error in recv : 113
I (60487) wifi:flush txq
I (60487) wifi:stop sw txq
I (60487) wifi:lmac stop hw txq
I (60487) SOTAmat: Wifi stopped...
I (60487) SOTAmat: ADC deregister Curve Fitting calibration scheme
I (60487) SOTAmat: ADC shutdown...
I (60497) SOTAmat: GPIO's off and isolated...
I (60497) SOTAmat: Entering Deep Sleep...
I (60507) SOTAmat: Goodnight!
Disconnected (device reports readiness to read but returned no data (device disconnected or multiple access on port?))
Reconnecting to /dev/ttyACM0 ...........
brianmathews commented 4 months ago

Interesting. I think in the error case the WiFi system we asked to shut down first does a network disconnect, which in turn might call our prior WiFi event handler, and might be trying to pass us a "disconnected from router" event. We might need to disconnect our WiFi event handler after we connect (at boot time) or we may need it to accept a "disconnect" event which it isn't coded to handle. Not sure.