Sleeper85 / esphome-jk-bms-can

GNU General Public License v3.0
62 stars 18 forks source link

Possible Glitch / Timing issue #48

Open luckylinux opened 1 month ago

luckylinux commented 1 month ago

The System has been running quite well since approximatively 3 weeks now.

However today, out of the blue, the Inverter Tripped due to BMS Communication Failure (BMS-Err_Stop enabled on Deye Inverter).

Looking at Home Assistant Dashboard the main thing that caught my eye is this: image

14:24:44 is the time where the Issue occurred.

Looking at the logbook there are plenty of Entries like this every day (yesterday was 22 times, so basically 11 couples of brief moments on/off for a few seconds).

Not sure why it happened.

Also weird is that the ESP32 (both of them, both the one of the "dumb" battery and the one connected to the Deye Inverter) seem to have rebooted, if we can trust the Diagnostic Data.

ESP32 not Connected to any Inverter: image

ESP32 Connected to the Deye Inverter: image

This should not happen, since both ESP32s and the Rock 5B SBC that runs Home Assistant, MQTT etc, are connected to a 230VAC UPS. The Rock 5B SBC didn't reboot:

18:10:11 up 20 days,  1:12,  1 user,  load average: 0.80, 1.01, 1.01

EDIT 1: Adding Uptime Sensor evolution when the Issue Occurred (for the Battery that is connected to the Deye Inverter only)

image

Exact Value seems to be 1'710'511 [seconds], which translates into 475.141944444 hours or 19.7975810185 days.

Not sure if it's a specific Event which is time-based (cannot remember if I read issues about ESP32 resetting themselves every 3 weeks / 21 days or so), or it was just the right combo of glitch in timing with respect to when the Deye "checks" that the CANbus Communication is actually working.

luckylinux commented 1 month ago

I'm also struggling with the ESP32 just rebooting/crashing/freezing thus causing an Inverter Trip, whenever I reboot the WiFi.

At first I thought disabling the fallback ap and the captive_portal could have fixed this (according to https://github.com/esphome/issues/issues/1679 reboot_timeout is ignored in case where the fallback ap is enabled).

Today I rebooted the WiFi again after applying those fixes and ... ESP32 Crashed/Rebooted/Freezed again and of course the Inverter Tripped.

@Sleeper85 , @MrPabloUK: Is there some timing Issue connected to WiFi loss, whereby the code automatically triggers a Restart ?

Configuration File (finally SNTP Time Sync and Home Assistant Time Sync work correctly by the way): https://github.com/luckylinux/jk-bms-build-helpers/blob/main/esphome-jk-bms-can/esp32-ble-1.17.5.yaml

luckylinux commented 1 month ago

This Commit seems to Fix the issue related to WiFi AP (and/or Home Assistant Server and/or MQTT Server and/or ... whatever):

https://github.com/luckylinux/jk-bms-build-helpers/commit/b60d922b7966af904dc10909634094d517847a9d

Other Remarks

After testing stopping each Service Individually, I took it one step further and stopped them ALL one by one. Still no Trip / Watchdog event.

Issued 5 Reboots (from a Normal Operating State) now and nothing happened again.

It's now Working [for now] as you (and I) would expect ...

NOT sure in the end what really helped:

I also had a configuration issue in /etc/dhcpcd.conf with a Typo on an Interface Name (NOT related to either WLAN or LAN). Furthermore there was a configuration issue in my "Headless" Management Script /usr/local/sbin/check-network.sh which tried to bring up/down the wrong (non-existing) Interface Name to try to ping the LAN Gateway

#!/bin/bash

gateway="192.168.1.1"
interface="eth0"

ping -c4 $gateway > /dev/null

if [ $? != 0 ]
then
  echo "No network connection, restarting $interface"
  /sbin/ifdown "$interface"
  sleep 5
  /sbin/ifup --force "$interface"
fi
Sleeper85 commented 1 month ago

The CAN bus status will be marked down after 20 loops (a different CAN ID is sent per loop) without response from the inverter.

This number of IDs is different depending on the protocol chosen, with Deye I advise you to take "PYLON 1.2".

Deye only responds after receiving ID 0x356. For me (at home) the links are turned OFF 1s every 2 hours and I don't know why but that doesn't pose a problem. If the link was turned OFF due to non-response from the inverter it will be OFF for 120s so this problem does not come from my code.

After the link is marked down, the code stops sending CAN IDs for 120s before testing again for the presence of an inverter.

interval:
  - interval: 120s
    then:
      - lambda: id(can_ack_counter) = 0; // Reset ACK counter for test inverter ACK
luckylinux commented 1 month ago

In my case I am using PYLON+ Protocol (IIRC that was recommended a while ago by either you or MrPablo). Not sure if there are major Differences though ...

Sleeper85 commented 1 month ago

In my case I am using PYLON+ Protocol (IIRC that was recommended a while ago by either you or MrPablo). Not sure if there are major Differences though ...

There is no point in using "PYLON +" with Deye because additional IDs (0x70, 0x371 and 0x379) are not supported by Deye. Just use the name "PYLON" with the protocol "PYLON 1.2".

luckylinux commented 1 month ago

IF (when :smile:) I'll have another maintenance stop I'll switch the Protocol to Pylon 1.2.

While you are very likely right, I'm a bit scared of the system, given how much susceptible it was with this elusive bug I told you about (watchdog triggering if WiFi/MQTT/HA/... goes down) ...

Not necessarily an issue with the Code, there is probably some interaction going on between the different Components (and I probably have more Sensors enabled for tuning/troubleshooting than you do, so more RAM used, etc).

The "fix" (workaround) as I said seems to increase the Watchdog Timeout to 30s (increasing to 10s improved the situation, but did NOT solve it), possibly combined with some of the other stuff I did (although since it was a Watchdog triggering a reboot, this is probably the solution: increasing the Watchdog Timeout). Why the ESP32 would "Freeze" / Hang and then trigger the Watchdog in the first Place, as I told you, I was not really able to diagnose.

Debug logs via USB showed everything normal then ... Watchdog Triggered ... Rebooting.

I'm more of the Attitude right now ... "If it works, don't touch it" :laughing: .

PS: maybe add a small note somewhere (or at least keep it in the back of your mind): if you have reboot_timeout set to 0s or say 24h for both api, mqtt and wifi, then connect the ESP32 via usb and set logging to DEBUG level. Most likely this is the ESP32 freezing/hanging and the Watchdog triggering a Reboot, thus tripping the Inverter due to lack of BMS Communication.

https://github.com/luckylinux/jk-bms-build-helpers/commit/b60d922b7966af904dc10909634094d517847a9d

Again, not saying there is an issue with the Code, this is probably an Edge Case for some Reason ... but it was driving me crazy !

Workaround:

    sdkconfig_options:
      CONFIG_ESP_TASK_WDT: y
      CONFIG_ESP_TASK_WDT_TIMEOUT_S: "30"
      CONFIG_BT_BLE_42_FEATURES_SUPPORTED: y

I did NOT test this but maybe it could also help, at least to some extent (MQTT Options): image