arendst / Tasmota

Alternative firmware for ESP8266 and ESP32 based devices with easy configuration using webUI, OTA updates, automation using timers or rules, expandability and entirely local control over MQTT, HTTP, Serial or KNX. Full documentation at
https://tasmota.github.io/docs
GNU General Public License v3.0
21.97k stars 4.77k forks source link

Devices with energy monitor: Tasmota boot loops on powerup UNLESS web client is attempting to connect at boot, the webserver is disabled, or the energy monitoring connections are removed from template #21094

Closed jgow closed 5 months ago

jgow commented 5 months ago

PROBLEM DESCRIPTION

This issue affects Knightsbridge CU9KW UK wall sockets with embedded energy monitoring and the template specified in this report.

It is important to note that if the energy monitoring is disabled (by disconnecting the three GPIO lines from 'BL0937 CF', 'HLWBL SEL_i and HLWBL_CF1 and specifying them as 'none'), the issue does NOT occur and the device functions normally, albeit with no energy monitoring. The issue ONLY arises when the energy monitoring is enabled by connecting it in the template.

It took quite some time and effort to drill down in to the exact circumstances that triggered this issue, so I have tried to be brief here but there is considerably more information below in 'Additional Context' that may or may not be helpful in tracking this down:

If energy monitoring is connected AND the webserver is enabled on the device (WebServer 1 or 2), the device will not complete its boot, and restarts before connecting to the MQTT host UNLESS a web client is attempting to connect to the web UI at the time of boot - in which case the boot completes and the device becomes operational.

If EITHER the lines associated with energy monitoring are disconnected OR the web server is disabled (WebServer 0), boot continues normally, MQTT connects and the device functions normally.

Once the device has been encouraged to complete its boot sequence successfully, it functions normally and all functions work (including the energy monitoring if enabled)

There seems to be some interaction between the energy monitoring and the web server that causes a boot issue.

REQUESTED INFORMATION

Make sure your have performed every step and checked the applicable boxes before submitting your issue. Thank you!

- [X] If using rules, provide the output of this command: `Backlog Rule1; Rule2; Rule3`:
```lua
  No rules are used.
- [X] Set `weblog` to 4 and then, when you experience your issue, provide the output of the Console log:
```lua
2024-04-04T10:06:38.520898+01:00 PS1-LAB-4443 ESP-HDW: ESP8266EX
2024-04-04T10:06:38.522729+01:00 PS1-LAB-4443 ESP-CFG: Loaded from flash at F5, Count 235
2024-04-04T10:06:38.524157+01:00 PS1-LAB-4443 ESP-QPC: Count 1
2024-04-04T10:06:38.525534+01:00 PS1-LAB-4443 ESP-CFG: CR 370/699, Busy 0
2024-04-04T10:06:38.526949+01:00 PS1-LAB-4443 ESP-TYA: Active=0
2024-04-04T10:06:38.528456+01:00 PS1-LAB-4443 ESP-ROT: Mode 1
2024-04-04T10:06:38.531185+01:00 PS1-LAB-4443 ESP-NRG: Init driver 1
2024-04-04T10:06:38.532668+01:00 PS1-LAB-4443 ESP-SRC: Restart
2024-04-04T10:06:38.534050+01:00 PS1-LAB-4443 ESP-Project PS - PS1-LAB Version 13.4.0.4(tasmota)-2_7_6(2024-04-03T17:41:14)
2024-04-04T10:06:48.501041+01:00 PS1-LAB-4443 ESP-HDW: ESP8266EX
2024-04-04T10:06:48.505318+01:00 PS1-LAB-4443 ESP-CFG: Loaded from flash at F5, Count 235
2024-04-04T10:06:48.505916+01:00 PS1-LAB-4443 ESP-CFG: CR 370/699, Busy 0
2024-04-04T10:06:48.507254+01:00 PS1-LAB-4443 ESP-TYA: Active=0
2024-04-04T10:06:48.508659+01:00 PS1-LAB-4443 ESP-ROT: Mode 1
2024-04-04T10:06:48.510066+01:00 PS1-LAB-4443 ESP-NRG: Init driver 1
2024-04-04T10:06:48.512239+01:00 PS1-LAB-4443 ESP-SRC: Restart
2024-04-04T10:06:48.513608+01:00 PS1-LAB-4443 ESP-Project PS - PS1-LAB Version 13.4.0.4(tasmota)-2_7_6(2024-04-03T17:41:14)
2024-04-04T10:06:48.515003+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:06:48.516502+01:00 PS1-LAB-4443 ESP-WIF: Attempting connection...
2024-04-04T10:06:48.517844+01:00 PS1-LAB-4443 ESP-WIF: Network (re)scan started...
2024-04-04T10:06:48.519934+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:06:48.521514+01:00 PS1-LAB-4443 ESP-WIF: Attempting connection...
2024-04-04T10:06:48.522134+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:06:48.524901+01:00 PS1-LAB-4443 ESP-WIF: Attempting connection...
2024-04-04T10:06:48.525625+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:06:48.527001+01:00 PS1-LAB-4443 ESP-WIF: Attempting connection...
2024-04-04T10:06:48.528435+01:00 PS1-LAB-4443 ESP-WIF: Network 0, AP-, SSId <ssid1>, Channel 3, BSSId <id>, RSSI -52, Encryption 1
2024-04-04T10:06:48.529893+01:00 PS1-LAB-4443 ESP-WIF: Network 1, AP1, SSId <ssid2>, Channel 3, BSSId <id>, RSSI -52, Encryption 1
2024-04-04T10:06:48.534139+01:00 PS1-LAB-4443 ESP-WIF: Connecting to AP1 <ssid2> Channel 3 BSSId <id> in mode 11n as PS1-LAB-4443...
2024-04-04T10:06:48.534742+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:06:48.536237+01:00 PS1-LAB-4443 ESP-WIF: Connected
2024-04-04T10:06:48.540465+01:00 PS1-LAB-4443 ESP-WIF: DNS resolved '<ip>' (<ip>) in 0 ms
2024-04-04T10:06:48.751926+01:00 PS1-LAB-4443 ESP-HTP: Web server active on PS1-LAB-4443 with IP address <ip-local>
2024-04-04T10:06:49.323301+01:00 PS1-LAB-4443 ESP-WIF: Sending Gratuitous ARP
2024-04-04T10:06:49.324666+01:00 PS1-LAB-4443 ESP-NTP: Sync time...
2024-04-04T10:06:49.326084+01:00 PS1-LAB-4443 ESP-WIF: DNS resolved '<ip>' (<ip>) in 0 ms
2024-04-04T10:06:49.326774+01:00 PS1-LAB-4443 ESP-RTC: UTC 2024-04-04T09:06:49Z, DST 2024-03-31T02:00:00, STD 2024-10-27T03:00:00
2024-04-04T10:06:49.327523+01:00 PS1-LAB-4443 ESP-RTC: Synced by NTP
2024-04-04T10:06:58.463789+01:00 PS1-LAB-4443 ESP-HDW: ESP8266EX
2024-04-04T10:06:58.468600+01:00 PS1-LAB-4443 ESP-CFG: Loaded from flash at F5, Count 235
2024-04-04T10:06:58.470016+01:00 PS1-LAB-4443 ESP-CFG: CR 370/699, Busy 0
2024-04-04T10:06:58.471611+01:00 PS1-LAB-4443 ESP-TYA: Active=0
2024-04-04T10:06:58.472896+01:00 PS1-LAB-4443 ESP-ROT: Mode 1
2024-04-04T10:06:58.474375+01:00 PS1-LAB-4443 ESP-NRG: Init driver 1
2024-04-04T10:06:58.475100+01:00 PS1-LAB-4443 ESP-SRC: Restart
2024-04-04T10:06:58.476421+01:00 PS1-LAB-4443 ESP-Project PS - PS1-LAB Version 13.4.0.4(tasmota)-2_7_6(2024-04-03T17:41:14)
2024-04-04T10:06:58.477782+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:06:58.479180+01:00 PS1-LAB-4443 ESP-WIF: Attempting connection...
2024-04-04T10:06:58.480578+01:00 PS1-LAB-4443 ESP-WIF: Network (re)scan started...
2024-04-04T10:06:58.481977+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:06:58.483368+01:00 PS1-LAB-4443 ESP-WIF: Attempting connection...
2024-04-04T10:06:58.485620+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:06:58.486962+01:00 PS1-LAB-4443 ESP-WIF: Attempting connection...
2024-04-04T10:06:58.489103+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:06:58.490528+01:00 PS1-LAB-4443 ESP-WIF: Attempting connection...
2024-04-04T10:06:58.491872+01:00 PS1-LAB-4443 ESP-WIF: Network 0, AP-, SSId <ssid1>, Channel 3, BSSId <id>, RSSI -53, Encryption 1
2024-04-04T10:06:58.493394+01:00 PS1-LAB-4443 ESP-WIF: Network 1, AP1, SSId <ssid2>, Channel 3, BSSId <id>, RSSI -52, Encryption 1
2024-04-04T10:06:58.494750+01:00 PS1-LAB-4443 ESP-WIF: Connecting to AP1 <ssid2> Channel 3 BSSId 86:CC:9C:47:66:07 in mode 11n as PS1-LAB-4443...
2024-04-04T10:06:58.496862+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:06:58.498277+01:00 PS1-LAB-4443 ESP-WIF: Connected
2024-04-04T10:06:58.500380+01:00 PS1-LAB-4443 ESP-WIF: DNS resolved '<ip>' (<ip>) in 0 ms
2024-04-04T10:06:58.715299+01:00 PS1-LAB-4443 ESP-HTP: Web server active on PS1-LAB-4443 with IP address <ip-local>
2024-04-04T10:06:59.280409+01:00 PS1-LAB-4443 ESP-WIF: Sending Gratuitous ARP
2024-04-04T10:06:59.281823+01:00 PS1-LAB-4443 ESP-NTP: Sync time...
2024-04-04T10:06:59.284067+01:00 PS1-LAB-4443 ESP-WIF: DNS resolved '<ip>' (<ip>) in 0 ms
2024-04-04T10:06:59.285430+01:00 PS1-LAB-4443 ESP-RTC: UTC 2024-04-04T09:06:59Z, DST 2024-03-31T02:00:00, STD 2024-10-27T03:00:00
2024-04-04T10:06:59.290350+01:00 PS1-LAB-4443 ESP-RTC: Synced by NTP
.....repeats from start of this fragment.

TO REPRODUCE

Configure device with template {"NAME":"Knightsbridge Dual Socket","GPIO":[321,544,320,544,225,2720,1,1,2624,193,2656,224,192,1],"FLAG":0,"BASE":18}

Device will not fully boot and repeatedly restart if webserver is enabled AND a web client is not attempting to connect during the boot. If webserver is disabled (WebServer 0), device boots normally and functions normally If webserver is enabled (WebServer 1), device does not fully boot (keeps restarting) unless EITHER:

EXPECTED BEHAVIOUR

Device should power up, complete boot and connect to MQTT irrespective of whether the webserver is enabled and/or a web client is attempting connection at the time of boot.

SCREENSHOTS

If applicable, add screenshots to help explain your problem.

ADDITIONAL CONTEXT

I use a version of Tasmota compiled from source, as my MQTT infrastructure uses TLS with a private CA and it is necessary to include the CA certificate in the build. I have included the user_config_override.h at the end of this section.

The issue only seems to affect the Knightsbridge devices with energy monitoring enabled. I have other Tasmota-based devices using exactly the same binaries and these work with no issues at all.

The initial problem manifested as the devices failing to connect to the MQTT broker every time (without exception) there was a power cut or a significant brownout. If I subsequently connected to the device via the web UI, the device would then connect to the MQTT broker but its configuration had reverted to a Sonoff Basic.

Reenabling the configuration and rebooting via the web UI seemed to work. I then recompiled the source but with FALLBACK_MODULE set to the custom module, ensuring that the configuration was not lost on boot. However the devices seemed to only start working and connecting to the MQTT broker when the device web UI was connected to via a web browser. Switching off the boot loop detection (SetOption36 0) and enabling syslog finally identified that the device was not completing its boot cycle and continually resetting. The syslog dump given above shows the output when the issue is experienced.

After some experimentation I discovered that the device would only successfully complete its boot sequence if a web client was attempting to connect to the UI exactly at the time of boot. The syslog dump below shows a successful boot sequence when a web client is connected at time of boot:

2024-04-04T10:14:49.119453+01:00 PS1-LAB-4443 ESP-HDW: ESP8266EX
2024-04-04T10:14:49.120899+01:00 PS1-LAB-4443 ESP-CFG: Loaded from flash at FB, Count 237
2024-04-04T10:14:49.122229+01:00 PS1-LAB-4443 ESP-QPC: Count 1
2024-04-04T10:14:49.123590+01:00 PS1-LAB-4443 ESP-CFG: CR 370/699, Busy 0
2024-04-04T10:14:49.125014+01:00 PS1-LAB-4443 ESP-TYA: Active=0
2024-04-04T10:14:49.126410+01:00 PS1-LAB-4443 ESP-ROT: Mode 1
2024-04-04T10:14:49.127793+01:00 PS1-LAB-4443 ESP-NRG: Init driver 1
2024-04-04T10:14:49.129263+01:00 PS1-LAB-4443 ESP-SRC: Restart
2024-04-04T10:14:49.131320+01:00 PS1-LAB-4443 ESP-Project PS - PS1-LAB Version 13.4.0.4(tasmota)-2_7_6(2024-04-03T17:41:14)
2024-04-04T10:14:49.132786+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:14:49.134864+01:00 PS1-LAB-4443 ESP-WIF: Attempting connection...
2024-04-04T10:14:49.136325+01:00 PS1-LAB-4443 ESP-WIF: Network (re)scan started...
2024-04-04T10:14:49.136967+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:14:49.138366+01:00 PS1-LAB-4443 ESP-WIF: Attempting connection...
2024-04-04T10:14:49.139781+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:14:49.144046+01:00 PS1-LAB-4443 ESP-WIF: Attempting connection...
2024-04-04T10:14:49.146204+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:14:49.146861+01:00 PS1-LAB-4443 ESP-WIF: Attempting connection...
2024-04-04T10:14:49.148397+01:00 PS1-LAB-4443 ESP-WIF: Network 0, AP-, SSId <ssid1>, Channel 3, BSSId <id>, RSSI -54, Encryption 1
2024-04-04T10:14:49.150345+01:00 PS1-LAB-4443 ESP-WIF: Network 1, AP1, SSId <ssid2>, Channel 3, BSSId <id>, RSSI -53, Encryption 1
2024-04-04T10:14:49.151778+01:00 PS1-LAB-4443 ESP-WIF: Connecting to AP1 <ssid2> Channel 3 BSSId <id> in mode 11n as PS1-LAB-4443...
2024-04-04T10:14:49.153160+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:14:49.155262+01:00 PS1-LAB-4443 ESP-WIF: Connected
2024-04-04T10:14:49.156822+01:00 PS1-LAB-4443 ESP-WIF: DNS resolved '<ip>' (<ip>) in 0 ms
2024-04-04T10:14:49.376729+01:00 PS1-LAB-4443 ESP-HTP: Web server active on PS1-LAB-4443 with IP address <ip-local>
2024-04-04T10:14:49.645836+01:00 PS1-LAB-4443 ESP-HTP: Main Menu
2024-04-04T10:14:50.005128+01:00 PS1-LAB-4443 ESP-QPC: Reset
2024-04-04T10:14:50.007229+01:00 PS1-LAB-4443 ESP-WIF: Sending Gratuitous ARP
2024-04-04T10:14:50.008625+01:00 PS1-LAB-4443 ESP-NTP: Sync time...
2024-04-04T10:14:50.012892+01:00 PS1-LAB-4443 ESP-WIF: DNS resolved '<ip>' (<ip>) in 0 ms
2024-04-04T10:14:50.013560+01:00 PS1-LAB-4443 ESP-RTC: UTC 2024-04-04T09:14:49Z, DST 2024-03-31T02:00:00, STD 2024-10-27T03:00:00
2024-04-04T10:14:50.014977+01:00 PS1-LAB-4443 ESP-RTC: Synced by NTP
2024-04-04T10:14:51.845143+01:00 PS1-LAB-4443 ESP-MQT: Attempting connection...
2024-04-04T10:14:51.847206+01:00 PS1-LAB-4443 ESP-WIF: DNS resolved '<ip>' (<ip>) in 0 ms
2024-04-04T10:14:51.848585+01:00 PS1-LAB-4443 ESP-MQT: TLS connected in 1213 ms, max ThunkStack used 5284
2024-04-04T10:14:51.849996+01:00 PS1-LAB-4443 ESP-MQT: Connected
2024-04-04T10:14:51.852304+01:00 PS1-LAB-4443 ESP-MQT: tele/PS1-LAB/LWT = Online (retained)
2024-04-04T10:14:51.853514+01:00 PS1-LAB-4443 ESP-MQT: cmnd/PS1-LAB/POWER = 
2024-04-04T10:14:51.855712+01:00 PS1-LAB-4443 ESP-MQT: Subscribe to cmnd/PS1-LAB/#
2024-04-04T10:14:51.857784+01:00 PS1-LAB-4443 ESP-MQT: Subscribe to cmnd/tasmotas/#
2024-04-04T10:14:51.859917+01:00 PS1-LAB-4443 ESP-MQT: Subscribe to cmnd/DVES_25B15B_fb/#
2024-04-04T10:14:51.861303+01:00 PS1-LAB-4443 ESP-MQT: tele/PS1-LAB/INFO1 = {"Info1":{"Module":"Knightsbridge Dual Socket","Version":"13.4.0.4(tasmota)","FallbackTopic":"cmnd/DVES_25B15B_fb/","GroupTopic":"cmnd/tasmotas/"}}
2024-04-04T10:14:51.863410+01:00 PS1-LAB-4443 ESP-MQT: tele/PS1-LAB/INFO2 = {"Info2":{"WebServerMode":"Admin","Hostname":"PS1-LAB-4443","IPAddress":"<local-ip>"}}
2024-04-04T10:14:51.864885+01:00 PS1-LAB-4443 ESP-MQT: tele/PS1-LAB/INFO3 = {"Info3":{"RestartReason":"Software/System restart","BootCount":45}}
2024-04-04T10:14:51.866951+01:00 PS1-LAB-4443 ESP-MQT: stat/PS1-LAB/RESULT = {"POWER1":"OFF"}
2024-04-04T10:14:51.868502+01:00 PS1-LAB-4443 ESP-MQT: stat/PS1-LAB/POWER1 = OFF
2024-04-04T10:14:51.869862+01:00 PS1-LAB-4443 ESP-MQT: stat/PS1-LAB/RESULT = {"POWER2":"OFF"}
2024-04-04T10:14:51.872000+01:00 PS1-LAB-4443 ESP-MQT: stat/PS1-LAB/POWER2 = OFF
2024-04-04T10:14:51.878937+01:00 PS1-LAB-4443 ESP-APP: Boot Count 45
2024-04-04T10:14:52.304683+01:00 PS1-LAB-4443 ESP-CFG: Saved to flash at FA, Count 238, Bytes 4096
.........normal operation follows.

Similarly, if I disable the web server (WebServer 0), the boot proceeds normally and the device works as expected. The syslog dump below shows this:

2024-04-04T10:22:38.769614+01:00 PS1-LAB-4443 ESP-HDW: ESP8266EX
2024-04-04T10:22:38.777462+01:00 PS1-LAB-4443 ESP-CFG: Loaded from flash at F9, Count 239
2024-04-04T10:22:38.778790+01:00 PS1-LAB-4443 ESP-QPC: Count 1
2024-04-04T10:22:38.780244+01:00 PS1-LAB-4443 ESP-CFG: CR 370/699, Busy 0
2024-04-04T10:22:38.780886+01:00 PS1-LAB-4443 ESP-TYA: Active=0
2024-04-04T10:22:38.781587+01:00 PS1-LAB-4443 ESP-ROT: Mode 1
2024-04-04T10:22:38.783006+01:00 PS1-LAB-4443 ESP-NRG: Init driver 1
2024-04-04T10:22:38.784409+01:00 PS1-LAB-4443 ESP-SRC: Restart
2024-04-04T10:22:38.785818+01:00 PS1-LAB-4443 ESP-Project PS - PS1-LAB Version 13.4.0.4(tasmota)-2_7_6(2024-04-03T17:41:14)
2024-04-04T10:22:38.786548+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:22:38.787961+01:00 PS1-LAB-4443 ESP-WIF: Attempting connection...
2024-04-04T10:22:38.789382+01:00 PS1-LAB-4443 ESP-WIF: Network (re)scan started...
2024-04-04T10:22:38.790837+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:22:38.792291+01:00 PS1-LAB-4443 ESP-WIF: Attempting connection...
2024-04-04T10:22:38.796483+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:22:38.797096+01:00 PS1-LAB-4443 ESP-WIF: Attempting connection...
2024-04-04T10:22:38.798592+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:22:38.799927+01:00 PS1-LAB-4443 ESP-WIF: Attempting connection...
2024-04-04T10:22:38.802135+01:00 PS1-LAB-4443 ESP-WIF: Network 0, AP-, SSId <ssid1>, Channel 3, BSSId <id>, RSSI -52, Encryption 1
2024-04-04T10:22:38.803457+01:00 PS1-LAB-4443 ESP-WIF: Network 1, AP1, SSId <ssid2>, Channel 3, BSSId <id>, RSSI -54, Encryption 1
2024-04-04T10:22:38.805636+01:00 PS1-LAB-4443 ESP-WIF: Connecting to AP1 WF2-PRINTERNET Channel 3 BSSId 86:CC:9C:47:66:07 in mode 11n as PS1-LAB-4443...
2024-04-04T10:22:38.807038+01:00 PS1-LAB-4443 ESP-WIF: Checking connection...
2024-04-04T10:22:38.807669+01:00 PS1-LAB-4443 ESP-WIF: Connected
2024-04-04T10:22:38.809029+01:00 PS1-LAB-4443 ESP-WIF: DNS resolved '<ip>' (<ip>) in 0 ms
2024-04-04T10:22:39.721622+01:00 PS1-LAB-4443 ESP-QPC: Reset
2024-04-04T10:22:39.723744+01:00 PS1-LAB-4443 ESP-WIF: Sending Gratuitous ARP
2024-04-04T10:22:39.725177+01:00 PS1-LAB-4443 ESP-NTP: Sync time...
2024-04-04T10:22:39.726660+01:00 PS1-LAB-4443 ESP-WIF: DNS resolved '<ip>' (<ip>) in 0 ms
2024-04-04T10:22:39.728845+01:00 PS1-LAB-4443 ESP-RTC: UTC 2024-04-04T09:22:39Z, DST 2024-03-31T02:00:00, STD 2024-10-27T03:00:00
2024-04-04T10:22:39.730169+01:00 PS1-LAB-4443 ESP-RTC: Synced by NTP
2024-04-04T10:22:41.508746+01:00 PS1-LAB-4443 ESP-MQT: Attempting connection...
2024-04-04T10:22:41.510807+01:00 PS1-LAB-4443 ESP-WIF: DNS resolved '<ip>' (<ip>) in 0 ms
2024-04-04T10:22:41.512312+01:00 PS1-LAB-4443 ESP-MQT: TLS connected in 1219 ms, max ThunkStack used 5284
2024-04-04T10:22:41.513674+01:00 PS1-LAB-4443 ESP-MQT: Connected
2024-04-04T10:22:41.515024+01:00 PS1-LAB-4443 ESP-MQT: tele/PS1-LAB/LWT = Online (retained)
2024-04-04T10:22:41.517291+01:00 PS1-LAB-4443 ESP-MQT: cmnd/PS1-LAB/POWER = 
2024-04-04T10:22:41.518680+01:00 PS1-LAB-4443 ESP-MQT: Subscribe to cmnd/PS1-LAB/#
2024-04-04T10:22:41.520814+01:00 PS1-LAB-4443 ESP-MQT: Subscribe to cmnd/tasmotas/#
2024-04-04T10:22:41.522173+01:00 PS1-LAB-4443 ESP-MQT: Subscribe to cmnd/DVES_25B15B_fb/#
2024-04-04T10:22:41.524265+01:00 PS1-LAB-4443 ESP-MQT: tele/PS1-LAB/INFO1 = {"Info1":{"Module":"Knightsbridge Dual Socket","Version":"13.4.0.4(tasmota)","FallbackTopic":"cmnd/DVES_25B15B_fb/","GroupTopic":"cmnd/tasmotas/"}}
2024-04-04T10:22:41.525564+01:00 PS1-LAB-4443 ESP-MQT: tele/PS1-LAB/INFO3 = {"Info3":{"RestartReason":"Software/System restart","BootCount":46}}
2024-04-04T10:22:41.527650+01:00 PS1-LAB-4443 ESP-MQT: stat/PS1-LAB/RESULT = {"POWER1":"OFF"}
2024-04-04T10:22:41.529826+01:00 PS1-LAB-4443 ESP-MQT: stat/PS1-LAB/POWER1 = OFF
2024-04-04T10:22:41.531316+01:00 PS1-LAB-4443 ESP-MQT: stat/PS1-LAB/RESULT = {"POWER2":"OFF"}
2024-04-04T10:22:41.532727+01:00 PS1-LAB-4443 ESP-MQT: stat/PS1-LAB/POWER2 = OFF
2024-04-04T10:22:41.589859+01:00 PS1-LAB-4443 ESP-APP: Boot Count 46
2024-04-04T10:22:41.889255+01:00 PS1-LAB-4443 ESP-CFG: Saved to flash at F8, Count 240, Bytes 4096
.......normal operation follows.

If I disable the energy monitoring by removing it completely from the template, then with the web server enabled (WebServer 2) the device boots normally, even if there is no web client attempting to connect at the time of boot.

Relevant parts of my user_config_override.h (sanitized, comment lines removed to save space):

#undef PROJECT
#define PROJECT         "PS"         // PROJECT is used as the default topic delimiter

#define MODULE          USER_MODULE
#define USER_TEMPLATE   "{\"NAME\":\"Knightsbridge Dual Socket\",\"GPIO\":[321,544,320,544,225,2720,1,1,2624,193,2656,224,192,1],\"FLAG\":0,\"BASE\":18}"
#ifdef FALLBACK_MODULE
#undef FALLBACK_MODULE
#endif
#define FALLBACK_MODULE USER_MODULE

#undef USE_MQTT_TLS
#ifndef FIRMWARE_MINIMAL
#ifndef USE_MQTT_TLS
    #define USE_MQTT_TLS
    #define USE_MQTT_TLS_CA_CERT              
    #define USE_MQTT_AWS_IOT 
    #undef  MQTT_TLS_ENABLED
    #define MQTT_TLS_ENABLED       true
    #define INCLUDE_LOCAL_CERT
    #define OMIT_LETS_ENCRYPT_CERT
    #define OMIT_AWS_CERT
    #define USE_MQTT_TLS_FORCE_EC_CIPHER 
#endif
#endif

#undef  STA_SSID1
#define STA_SSID1         "<ssid-2>" 

#undef  STA_PASS1
#define STA_PASS1         "<password>"     

#undef  MQTT_HOST
#define MQTT_HOST         "<ip>" 

#undef  MQTT_PORT
#define MQTT_PORT         8883

#undef  NTP_SERVER1
#undef  NTP_SERVER2
#undef  NTP_SERVER3

#define NTP_SERVER1      "<ntp-ip>"   
#define NTP_SERVER2      "<ntp-ip>"
#define NTP_SERVER3      "<ntp-ip>"

#undef USE_DOMOTICZ
#undef USE_EMULATION_HUE
#undef USE_EMULATION_WEMO
#undef USE_SONOFF_RF
#undef USE_SHELLY_DIMMER
#undef USE_I2C
#undef USE_TELEGRAM

#undef BOOT_LOOP_OFFSET
#define BOOT_LOOP_OFFSET 12

(Please, remember to close the issue when the problem has been addressed)

arendst commented 5 months ago

Doing TLS on ESP8266 is a challenge due to lack of RAM. I suggest you enable heap logging (SO130 1) and watch it decline during boot.

I noticed you're having Tuya in the mix. If not needed I suggest to remove it to get some more free RAM.

EDIT: Having both GPIO1 and GPIO3 configured as LedLink doesn't help either....

jgow commented 5 months ago

Ok - in the back of my mind was a possible memory issue but was distracted from this by the fact that all my other ESP8266-based devices were fully working with the same binary...

arendst commented 5 months ago

You're right regarding the non heap logging in syslog.

The exception 2 is worrying though as it shouldn't be there. May have many causes but this is a known one: Occurs when out of stack (chk with debug and FreeMem 1).

Stack space is just max 4k and depending on function deep calls it is used up quitte soon.

Pls try to find out if you always get the same exception 2 with the same call stack (it differs on every build you make so keep the same build and the accompanied map file for debugging) en see what happens at the call chain addresses.

jgow commented 5 months ago

Interestingly I could not repeat the exception 2. However, on the next successful boot after the failure an exception 28 is repeatable. The call trace is very similar but not exactly the same:

2024-04-04T15:35:17.238523+01:00 PS1-LAB-4443 ESP-MQT: tele/PS1-LAB/INFO3 = {"Info3":{"RestartReason":{"Exception":28,"Reason":"Exception","EPC":["4024d8fc","00000000","00000000"],"EXCVADDR":"0000036e","DEPC":"00000000","CallChain":["4024d9ec","4024e02c","4024da1c","4024e034","4024e120","40275ab8","40275ac5","40275b0a","40243dd0","40000f49","40000f49","40000f49","40000e19","40105871","40105877","4010000d","4026e530","4026e4e1","40100230","401011c5","401011b4","401010ec","4000050c","40257ebd","40257ea8","4028b7d0","401011c5","40258487","4025837b","4028b6b4","402584b4"]},"BootCount":65}}

tele/PS1-LAB/INFO3 = {"Info3":{"RestartReason":{"Exception":28,"Reason":"Exception","EPC":["4024d8fc","00000000","00000000"],"EXCVADDR":"0d194b4d","DEPC":"00000000","CallChain":["4024d9ec","4024e02c","4024da1c","4028dfbc","4024e034","4024e120","40275ab8","40275ac5","40275b0a","40243dd0","40000f49","40000f49","40000f49","40000e19","40105871","40105877","4010000d","4026e530","4026e4e1","40100230","401011c5","401011b4","401010ec","4000050c","402580ed","40257ea8","4028b7d0","401011c5","40258487","4025837b","4028b6b4"]},"BootCount":66}}
15:28:51.160-021 MQT: tele/PS1-LAB/INFO3 = {"Info3":{"RestartReason":{"Exception":28,"Reason":"Exception","EPC":["4024d8fc","00000000","00000000"],"EXCVADDR":"00002923","DEPC":"00000000","CallChain":["4024d9ec","4024e02c","4024da1c","4024e034","4024e120","40275ab8","40275ac5","40275b0a","40243dd0","40000f49","40000f49","40000f49","40000e19","40105871","40105877","4010000d","4026e530","4026e4e1","40100230","401011c5","401011b4","401010ec","4000050c","40258105","40257ea8","4028b770","4015b4a0","4028b7d0","40258487","4025837b","4028b6b4"]},"BootCount":61}}

tele/PS1-LAB/INFO3 = {"Info3":{"RestartReason":{"Exception":28,"Reason":"Exception","EPC":["4024d8fc","00000000","00000000"],"EXCVADDR":"000016fb","DEPC":"00000000","CallChain":["4024d9ec","4024e02c","4024da1c","4024e034","4024e120","40275ab8","40275ac5","40275b0a","40243dd0","40000f49","40000f49","40000f49","40000e19","40105871","40105877","4010000d","4026e530","4026e4e1","40100230","401011c5","401011b4","401010ec","4000050c","40258074","40257ea8","4028b7d0","40258487","4025837b","4028b6b4","402584b4","4028b6b4"]},"BootCount":63}}

tele/PS1-LAB/INFO3 = {"Info3":{"RestartReason":{"Exception":28,"Reason":"Exception","EPC":["4024d8fc","00000000","00000000"],"EXCVADDR":"01dd6f3d","DEPC":"00000000","CallChain":["4024d9ec","4024e02c","4024da1c","4024e034","4024e120","40275ab8","40275ac5","40275b0a","40243dd0","40000f49","40000f49","40000f49","40000e19","40105871","40105877","4010000d","4026e530","4026e4e1","40100230","401011c5","401011b4","401010ec","4000050c","40257eca","40257ea8","4028b7d0","40258487","4025837b","4028b6b4","402584b4","4028b6b4"]},"BootCount":64}}

Looking at the call trace and map file, 0x4024d9ec corresponds to tcp_process_refused_data in liblwip2-1460.a. At the other end 0x4028b6b4 corresponds to br_sha256_vtable, within libbearssl. I will keep looking but I am unfamiliar with bearssl internals.

jgow commented 5 months ago

It seems fairly consistent across builds. The exception occurs when an interrupt (possibly a timer in the TCP stack as the call sequence is lwip_cyclic_timers -> tcp_tmr -> tcp_fasttmr -> tcp_process_refused_data) occurs during the execution of one or more functions in sha2small.c, usually br_sha224_update or br_sha256_vtable.

I can't trace this back further in brssl thanks to the vtables used therein, and there is no way I can use any form of serial debugging (this is a nonisolated mains powered device - a wall socket - and I currently do not have an isolated serial interface available to me). However I think the exception may be in the initial MQTT connection code, as if I look in the broker logs in the fault condition, there is an attempt to initiate the connection but the handshake is never completed - this also seems to be consistent with the point in the boot sequence that the device restarts.

Have tried a ham-fisted approach by disabling interrupts around the functions in sha2small.c, but the exception just occurs earlier.

I suspect, but can't confirm, that if stack space is tight then this interrupt in the middle of some TLS processing may just be enough.

This is about as far as I can get. Disablling the web server certainly prevents the issue from occurring, so I suspect that the interrupt is somehow related to the web server.

arendst commented 5 months ago

You might want to try increase TLS stackspace in file \lib\lib_ssl\tls_mini\src\StackThunk_light.cpp line 43 and up.

//#define _stackSize (5600/4)
#if defined(USE_MQTT_AWS_IOT) || defined(USE_MQTT_AWS_IOT_LIGHT) || defined(USE_MQTT_AZURE_IOT)
  #define _stackSize (5300/4)   // using a light version of bearssl we can save 300 bytes
#else
  #define _stackSize (4800/4)   // no private key, we can reduce a little, max observed 4300
#endif

As you enabled USE_MQTT_AWS_IOT try to change 5300/4 into 5600/4 or 5800/4

jgow commented 5 months ago

That fixed it - setting the stack size to 5600/4. I will update the other sockets to this build and allow the devices to run for a while to check for any unintended consequences of the increased stack size, then close issue closed if all is well.

I will just have to remember to change this in the build when I upgrade the firmware on these devices in future (unless this value could be made user-configurable?).

Can I check that I really need to enable USE_MQTT_AWS_IOT in order just to include the private CA? I only need to include my private CA certificate; I do not care about AWS or LetsEncrypt. The documentation seems to suggest that I do, and if I disable the setting the device will not connect so I assume this is correct?

Out of curiosity, as the problem occurred when an interrupt clearly overran the stack, as BearSSL has its own stack it would appear that the interrupt handlers do not switch the stack back to the core before servicing the interrupt - is that correct? Thus, it would appear that the BearSSL stack must be larger than is strictly needed by BearSSL in order to accommodate any interrupts that may come in during SSL processing - and this increase would be somewhat determined by the handlers implemented elsewhere in the firmware. Is this additional size requirement known?

jgow commented 5 months ago

I spoke too soon - using 5600/4 improved the situation but the device still failed on some boots. I had to increase this to 5800/4 to get stable operation and the fault no longer appears on reboots.

I will close this issue now as I can build a version of Tasmota that is stable and appears to work reliably with these devices.