Closed jgow closed 5 months ago
Doing TLS on ESP8266 is a challenge due to lack of RAM. I suggest you enable heap logging (SO130 1
) and watch it decline during boot.
I noticed you're having Tuya in the mix. If not needed I suggest to remove it to get some more free RAM.
EDIT: Having both GPIO1 and GPIO3 configured as LedLink
doesn't help either....
Ok - in the back of my mind was a possible memory issue but was distracted from this by the fact that all my other ESP8266-based devices were fully working with the same binary...
00:00:00.001 HDW: ESP8266EX
00:00:00.050-033 CFG: Loaded from flash at F7, Count 269
00:00:00.062-027 NRG: Init driver 1
00:00:00.065-027 Project PS - PS1-LAB Version 13.4.0.4(tasmota)-2_7_6(2024-04-04T13:56:50)
00:00:04.001-027 WIF: Connecting to AP1 <ssid> Channel 3 BSSId <id> in mode 11n as PS1-LAB-4443...
00:00:05.533-026 WIF: Connected
00:00:05.783-024 HTP: Web server active on PS1-LAB-4443 with IP address <local-ip>
14:05:58.445-023 MQT: Attempting connection...
14:05:59.958-022 MQT: TLS connected in 1209 ms, max ThunkStack used 5236
14:05:59.959-022 MQT: Connected
14:06:00.016-022 MQT: tele/PS1-LAB/LWT = Online (retained)
14:06:00.025-022 MQT: cmnd/PS1-LAB/POWER =
14:06:00.110-022 MQT: tele/PS1-LAB/INFO1 = {"Info1":{"Module":"Knightsbridge Dual Socket","Version":"13.4.0.4(tasmota)","FallbackTopic":"cmnd/DVES_25B15B_fb/","GroupTopic":"cmnd/tasmotas/"}}
14:06:00.131-022 MQT: tele/PS1-LAB/INFO2 = {"Info2":{"WebServerMode":"Admin","Hostname":"PS1-LAB-4443","IPAddress":"<ip>"}}
14:06:00.160-022 MQT: tele/PS1-LAB/INFO3 = {"Info3":{"RestartReason":{"Exception":2,"Reason":"Exception","EPC":["3fff662c","00000000","00000000"],"EXCVADDR":"3fff662c","DEPC":"00000000","CallChain":["4024d9ec","4024e02c","4024da1c","4024e034","4024e120","40275ab8","40275ac5","40275b0a","40243dd0","40000f49","40000f49","40000f49","40000e19","40105871","40105877","4010000d","4026e530","4026e4e1","40100230","401011c5","401011b4","401010ec","4000050c","40258215","40257ea8","4028b7d0","401011c5","40258487","4025837b","4028b6b4","402584b4"]},"BootCount":59}}
14:06:00.215-022 MQT: stat/PS1-LAB/RESULT = {"POWER1":"OFF"}
I would have thought that the greater heap usage (and a greater likelihood of running out of RAM) would occur with the web client connected, not in the condition where the web client is not connected - the condition where the failure occurs.
#undef USE_TUYA_MCU
#undef USE_WS2812
#undef USE_RULES
You're right regarding the non heap logging in syslog.
The exception 2 is worrying though as it shouldn't be there. May have many causes but this is a known one: Occurs when out of stack (chk with debug and FreeMem 1)
.
Stack space is just max 4k and depending on function deep calls it is used up quitte soon.
Pls try to find out if you always get the same exception 2 with the same call stack (it differs on every build you make so keep the same build and the accompanied map file for debugging) en see what happens at the call chain addresses.
Interestingly I could not repeat the exception 2. However, on the next successful boot after the failure an exception 28 is repeatable. The call trace is very similar but not exactly the same:
2024-04-04T15:35:17.238523+01:00 PS1-LAB-4443 ESP-MQT: tele/PS1-LAB/INFO3 = {"Info3":{"RestartReason":{"Exception":28,"Reason":"Exception","EPC":["4024d8fc","00000000","00000000"],"EXCVADDR":"0000036e","DEPC":"00000000","CallChain":["4024d9ec","4024e02c","4024da1c","4024e034","4024e120","40275ab8","40275ac5","40275b0a","40243dd0","40000f49","40000f49","40000f49","40000e19","40105871","40105877","4010000d","4026e530","4026e4e1","40100230","401011c5","401011b4","401010ec","4000050c","40257ebd","40257ea8","4028b7d0","401011c5","40258487","4025837b","4028b6b4","402584b4"]},"BootCount":65}}
tele/PS1-LAB/INFO3 = {"Info3":{"RestartReason":{"Exception":28,"Reason":"Exception","EPC":["4024d8fc","00000000","00000000"],"EXCVADDR":"0d194b4d","DEPC":"00000000","CallChain":["4024d9ec","4024e02c","4024da1c","4028dfbc","4024e034","4024e120","40275ab8","40275ac5","40275b0a","40243dd0","40000f49","40000f49","40000f49","40000e19","40105871","40105877","4010000d","4026e530","4026e4e1","40100230","401011c5","401011b4","401010ec","4000050c","402580ed","40257ea8","4028b7d0","401011c5","40258487","4025837b","4028b6b4"]},"BootCount":66}}
15:28:51.160-021 MQT: tele/PS1-LAB/INFO3 = {"Info3":{"RestartReason":{"Exception":28,"Reason":"Exception","EPC":["4024d8fc","00000000","00000000"],"EXCVADDR":"00002923","DEPC":"00000000","CallChain":["4024d9ec","4024e02c","4024da1c","4024e034","4024e120","40275ab8","40275ac5","40275b0a","40243dd0","40000f49","40000f49","40000f49","40000e19","40105871","40105877","4010000d","4026e530","4026e4e1","40100230","401011c5","401011b4","401010ec","4000050c","40258105","40257ea8","4028b770","4015b4a0","4028b7d0","40258487","4025837b","4028b6b4"]},"BootCount":61}}
tele/PS1-LAB/INFO3 = {"Info3":{"RestartReason":{"Exception":28,"Reason":"Exception","EPC":["4024d8fc","00000000","00000000"],"EXCVADDR":"000016fb","DEPC":"00000000","CallChain":["4024d9ec","4024e02c","4024da1c","4024e034","4024e120","40275ab8","40275ac5","40275b0a","40243dd0","40000f49","40000f49","40000f49","40000e19","40105871","40105877","4010000d","4026e530","4026e4e1","40100230","401011c5","401011b4","401010ec","4000050c","40258074","40257ea8","4028b7d0","40258487","4025837b","4028b6b4","402584b4","4028b6b4"]},"BootCount":63}}
tele/PS1-LAB/INFO3 = {"Info3":{"RestartReason":{"Exception":28,"Reason":"Exception","EPC":["4024d8fc","00000000","00000000"],"EXCVADDR":"01dd6f3d","DEPC":"00000000","CallChain":["4024d9ec","4024e02c","4024da1c","4024e034","4024e120","40275ab8","40275ac5","40275b0a","40243dd0","40000f49","40000f49","40000f49","40000e19","40105871","40105877","4010000d","4026e530","4026e4e1","40100230","401011c5","401011b4","401010ec","4000050c","40257eca","40257ea8","4028b7d0","40258487","4025837b","4028b6b4","402584b4","4028b6b4"]},"BootCount":64}}
Looking at the call trace and map file, 0x4024d9ec corresponds to tcp_process_refused_data in liblwip2-1460.a. At the other end 0x4028b6b4 corresponds to br_sha256_vtable, within libbearssl. I will keep looking but I am unfamiliar with bearssl internals.
It seems fairly consistent across builds. The exception occurs when an interrupt (possibly a timer in the TCP stack as the call sequence is lwip_cyclic_timers -> tcp_tmr -> tcp_fasttmr -> tcp_process_refused_data) occurs during the execution of one or more functions in sha2small.c, usually br_sha224_update or br_sha256_vtable.
I can't trace this back further in brssl thanks to the vtables used therein, and there is no way I can use any form of serial debugging (this is a nonisolated mains powered device - a wall socket - and I currently do not have an isolated serial interface available to me). However I think the exception may be in the initial MQTT connection code, as if I look in the broker logs in the fault condition, there is an attempt to initiate the connection but the handshake is never completed - this also seems to be consistent with the point in the boot sequence that the device restarts.
Have tried a ham-fisted approach by disabling interrupts around the functions in sha2small.c, but the exception just occurs earlier.
I suspect, but can't confirm, that if stack space is tight then this interrupt in the middle of some TLS processing may just be enough.
This is about as far as I can get. Disablling the web server certainly prevents the issue from occurring, so I suspect that the interrupt is somehow related to the web server.
You might want to try increase TLS stackspace in file \lib\lib_ssl\tls_mini\src\StackThunk_light.cpp
line 43 and up.
//#define _stackSize (5600/4)
#if defined(USE_MQTT_AWS_IOT) || defined(USE_MQTT_AWS_IOT_LIGHT) || defined(USE_MQTT_AZURE_IOT)
#define _stackSize (5300/4) // using a light version of bearssl we can save 300 bytes
#else
#define _stackSize (4800/4) // no private key, we can reduce a little, max observed 4300
#endif
As you enabled USE_MQTT_AWS_IOT try to change 5300/4 into 5600/4 or 5800/4
That fixed it - setting the stack size to 5600/4. I will update the other sockets to this build and allow the devices to run for a while to check for any unintended consequences of the increased stack size, then close issue closed if all is well.
I will just have to remember to change this in the build when I upgrade the firmware on these devices in future (unless this value could be made user-configurable?).
Can I check that I really need to enable USE_MQTT_AWS_IOT in order just to include the private CA? I only need to include my private CA certificate; I do not care about AWS or LetsEncrypt. The documentation seems to suggest that I do, and if I disable the setting the device will not connect so I assume this is correct?
Out of curiosity, as the problem occurred when an interrupt clearly overran the stack, as BearSSL has its own stack it would appear that the interrupt handlers do not switch the stack back to the core before servicing the interrupt - is that correct? Thus, it would appear that the BearSSL stack must be larger than is strictly needed by BearSSL in order to accommodate any interrupts that may come in during SSL processing - and this increase would be somewhat determined by the handlers implemented elsewhere in the firmware. Is this additional size requirement known?
I spoke too soon - using 5600/4 improved the situation but the device still failed on some boots. I had to increase this to 5800/4 to get stable operation and the fault no longer appears on reboots.
I will close this issue now as I can build a version of Tasmota that is stable and appears to work reliably with these devices.
PROBLEM DESCRIPTION
This issue affects Knightsbridge CU9KW UK wall sockets with embedded energy monitoring and the template specified in this report.
It is important to note that if the energy monitoring is disabled (by disconnecting the three GPIO lines from 'BL0937 CF', 'HLWBL SEL_i and HLWBL_CF1 and specifying them as 'none'), the issue does NOT occur and the device functions normally, albeit with no energy monitoring. The issue ONLY arises when the energy monitoring is enabled by connecting it in the template.
It took quite some time and effort to drill down in to the exact circumstances that triggered this issue, so I have tried to be brief here but there is considerably more information below in 'Additional Context' that may or may not be helpful in tracking this down:
If energy monitoring is connected AND the webserver is enabled on the device (WebServer 1 or 2), the device will not complete its boot, and restarts before connecting to the MQTT host UNLESS a web client is attempting to connect to the web UI at the time of boot - in which case the boot completes and the device becomes operational.
If EITHER the lines associated with energy monitoring are disconnected OR the web server is disabled (WebServer 0), boot continues normally, MQTT connects and the device functions normally.
Once the device has been encouraged to complete its boot sequence successfully, it functions normally and all functions work (including the energy monitoring if enabled)
There seems to be some interaction between the energy monitoring and the web server that causes a boot issue.
REQUESTED INFORMATION
Make sure your have performed every step and checked the applicable boxes before submitting your issue. Thank you!
Backlog Template; Module; GPIO 255
:Status 0
:TO REPRODUCE
Configure device with template {"NAME":"Knightsbridge Dual Socket","GPIO":[321,544,320,544,225,2720,1,1,2624,193,2656,224,192,1],"FLAG":0,"BASE":18}
Device will not fully boot and repeatedly restart if webserver is enabled AND a web client is not attempting to connect during the boot. If webserver is disabled (WebServer 0), device boots normally and functions normally If webserver is enabled (WebServer 1), device does not fully boot (keeps restarting) unless EITHER:
EXPECTED BEHAVIOUR
Device should power up, complete boot and connect to MQTT irrespective of whether the webserver is enabled and/or a web client is attempting connection at the time of boot.
SCREENSHOTS
If applicable, add screenshots to help explain your problem.
ADDITIONAL CONTEXT
I use a version of Tasmota compiled from source, as my MQTT infrastructure uses TLS with a private CA and it is necessary to include the CA certificate in the build. I have included the user_config_override.h at the end of this section.
The issue only seems to affect the Knightsbridge devices with energy monitoring enabled. I have other Tasmota-based devices using exactly the same binaries and these work with no issues at all.
The initial problem manifested as the devices failing to connect to the MQTT broker every time (without exception) there was a power cut or a significant brownout. If I subsequently connected to the device via the web UI, the device would then connect to the MQTT broker but its configuration had reverted to a Sonoff Basic.
Reenabling the configuration and rebooting via the web UI seemed to work. I then recompiled the source but with FALLBACK_MODULE set to the custom module, ensuring that the configuration was not lost on boot. However the devices seemed to only start working and connecting to the MQTT broker when the device web UI was connected to via a web browser. Switching off the boot loop detection (SetOption36 0) and enabling syslog finally identified that the device was not completing its boot cycle and continually resetting. The syslog dump given above shows the output when the issue is experienced.
After some experimentation I discovered that the device would only successfully complete its boot sequence if a web client was attempting to connect to the UI exactly at the time of boot. The syslog dump below shows a successful boot sequence when a web client is connected at time of boot:
Similarly, if I disable the web server (WebServer 0), the boot proceeds normally and the device works as expected. The syslog dump below shows this:
If I disable the energy monitoring by removing it completely from the template, then with the web server enabled (WebServer 2) the device boots normally, even if there is no web client attempting to connect at the time of boot.
Relevant parts of my user_config_override.h (sanitized, comment lines removed to save space):
(Please, remember to close the issue when the problem has been addressed)