Open luckylinux opened 1 month ago
I'm also struggling with the ESP32 just rebooting/crashing/freezing thus causing an Inverter Trip, whenever I reboot the WiFi.
At first I thought disabling the fallback ap and the captive_portal
could have fixed this (according to https://github.com/esphome/issues/issues/1679 reboot_timeout
is ignored in case where the fallback ap is enabled).
Today I rebooted the WiFi again after applying those fixes and ... ESP32 Crashed/Rebooted/Freezed again and of course the Inverter Tripped.
@Sleeper85 , @MrPabloUK: Is there some timing Issue connected to WiFi loss, whereby the code automatically triggers a Restart ?
Configuration File (finally SNTP Time Sync and Home Assistant Time Sync work correctly by the way): https://github.com/luckylinux/jk-bms-build-helpers/blob/main/esphome-jk-bms-can/esp32-ble-1.17.5.yaml
This Commit seems to Fix the issue related to WiFi AP (and/or Home Assistant Server and/or MQTT Server and/or ... whatever):
https://github.com/luckylinux/jk-bms-build-helpers/commit/b60d922b7966af904dc10909634094d517847a9d
After testing stopping each Service Individually, I took it one step further and stopped them ALL one by one. Still no Trip / Watchdog event.
Issued 5 Reboots (from a Normal Operating State) now and nothing happened again.
It's now Working [for now] as you (and I) would expect ...
NOT sure in the end what really helped:
I also had a configuration issue in /etc/dhcpcd.conf with a Typo on an Interface Name (NOT related to either WLAN or LAN).
Furthermore there was a configuration issue in my "Headless" Management Script /usr/local/sbin/check-network.sh
which tried to bring up/down the wrong (non-existing) Interface Name to try to ping the LAN Gateway
#!/bin/bash
gateway="192.168.1.1"
interface="eth0"
ping -c4 $gateway > /dev/null
if [ $? != 0 ]
then
echo "No network connection, restarting $interface"
/sbin/ifdown "$interface"
sleep 5
/sbin/ifup --force "$interface"
fi
The CAN bus status will be marked down after 20 loops (a different CAN ID is sent per loop) without response from the inverter.
Deye only responds after receiving ID 0x356. For me (at home) the links are turned OFF 1s every 2 hours and I don't know why but that doesn't pose a problem. If the link was turned OFF due to non-response from the inverter it will be OFF for 120s so this problem does not come from my code.
After the link is marked down, the code stops sending CAN IDs for 120s before testing again for the presence of an inverter.
interval:
- interval: 120s
then:
- lambda: id(can_ack_counter) = 0; // Reset ACK counter for test inverter ACK
In my case I am using PYLON+ Protocol (IIRC that was recommended a while ago by either you or MrPablo). Not sure if there are major Differences though ...
In my case I am using PYLON+ Protocol (IIRC that was recommended a while ago by either you or MrPablo). Not sure if there are major Differences though ...
There is no point in using "PYLON +" with Deye because additional IDs (0x70, 0x371 and 0x379) are not supported by Deye. Just use the name "PYLON" with the protocol "PYLON 1.2".
IF (when :smile:) I'll have another maintenance stop I'll switch the Protocol to Pylon 1.2.
While you are very likely right, I'm a bit scared of the system, given how much susceptible it was with this elusive bug I told you about (watchdog triggering if WiFi/MQTT/HA/... goes down) ...
Not necessarily an issue with the Code, there is probably some interaction going on between the different Components (and I probably have more Sensors enabled for tuning/troubleshooting than you do, so more RAM used, etc).
The "fix" (workaround) as I said seems to increase the Watchdog Timeout to 30s (increasing to 10s improved the situation, but did NOT solve it), possibly combined with some of the other stuff I did (although since it was a Watchdog triggering a reboot, this is probably the solution: increasing the Watchdog Timeout). Why the ESP32 would "Freeze" / Hang and then trigger the Watchdog in the first Place, as I told you, I was not really able to diagnose.
Debug logs via USB showed everything normal then ... Watchdog Triggered ... Rebooting.
I'm more of the Attitude right now ... "If it works, don't touch it" :laughing: .
PS: maybe add a small note somewhere (or at least keep it in the back of your mind): if you have reboot_timeout
set to 0s or say 24h for both api
, mqtt
and wifi
, then connect the ESP32 via usb and set logging to DEBUG
level. Most likely this is the ESP32 freezing/hanging and the Watchdog triggering a Reboot, thus tripping the Inverter due to lack of BMS Communication.
https://github.com/luckylinux/jk-bms-build-helpers/commit/b60d922b7966af904dc10909634094d517847a9d
Again, not saying there is an issue with the Code, this is probably an Edge Case for some Reason ... but it was driving me crazy !
Workaround:
sdkconfig_options:
CONFIG_ESP_TASK_WDT: y
CONFIG_ESP_TASK_WDT_TIMEOUT_S: "30"
CONFIG_BT_BLE_42_FEATURES_SUPPORTED: y
I did NOT test this but maybe it could also help, at least to some extent (MQTT Options):
The System has been running quite well since approximatively 3 weeks now.
However today, out of the blue, the Inverter Tripped due to BMS Communication Failure (BMS-Err_Stop enabled on Deye Inverter).
Looking at Home Assistant Dashboard the main thing that caught my eye is this:![image](https://github.com/Sleeper85/esphome-jk-bms-can/assets/7126291/5f9a7ebc-d671-49eb-9c81-1e702e2534a7)
14:24:44
is the time where the Issue occurred.Looking at the logbook there are plenty of Entries like this every day (yesterday was 22 times, so basically 11 couples of brief moments on/off for a few seconds).
Not sure why it happened.
Also weird is that the ESP32 (both of them, both the one of the "dumb" battery and the one connected to the Deye Inverter) seem to have rebooted, if we can trust the Diagnostic Data.
ESP32 not Connected to any Inverter:![image](https://github.com/Sleeper85/esphome-jk-bms-can/assets/7126291/2ea4d289-cb6f-4829-af3d-1b0cd3b76256)
ESP32 Connected to the Deye Inverter:![image](https://github.com/Sleeper85/esphome-jk-bms-can/assets/7126291/067b9651-43ad-4a79-8970-683680181853)
This should not happen, since both ESP32s and the Rock 5B SBC that runs Home Assistant, MQTT etc, are connected to a 230VAC UPS. The Rock 5B SBC didn't reboot:
EDIT 1: Adding Uptime Sensor evolution when the Issue Occurred (for the Battery that is connected to the Deye Inverter only)
Exact Value seems to be 1'710'511 [seconds], which translates into 475.141944444 hours or 19.7975810185 days.
Not sure if it's a specific Event which is time-based (cannot remember if I read issues about ESP32 resetting themselves every 3 weeks / 21 days or so), or it was just the right combo of glitch in timing with respect to when the Deye "checks" that the CANbus Communication is actually working.