Open tschundler opened 2 years ago
Updated with logs. I took a look at the total_daily_energy code, and there isn't even division by anything other than constants, so this is very odd. But it's very reproducible if I comment out total_daily_energy sensor, > 1day uptme. Re-enable it, and I get reboots.
Maybe I'll look at the diff of the generated code
Code diff looks as-expected
$ diff s31_4/src/esphome.h s31_4.dp/src/esphome.h
59a60
> #include "esphome/components/total_daily_energy/total_daily_energy.h"
$ diff s31_4/src/main.cpp s31_4.dp/src/main.cpp
46a47,48
> total_daily_energy::TotalDailyEnergy *total_daily_energy_totaldailyenergy;
> mqtt::MQTTSensorComponent *mqtt_mqttsensorcomponent_5;
491a494,518
> // sensor.total_daily_energy:
> // platform: total_daily_energy
> // name: S31 4 Daily Energy
> // power_id: s31_4_power
> // disabled_by_default: false
> // mqtt_id: mqtt_mqttsensorcomponent_5
> // force_update: false
> // device_class: energy
> // state_class: measurement
> // last_reset_type: auto
> // id: total_daily_energy_totaldailyenergy
> // time_id: sntp_sntpcomponent
> // min_save_interval: 0s
> total_daily_energy_totaldailyenergy = new total_daily_energy::TotalDailyEnergy();
> App.register_component(total_daily_energy_totaldailyenergy);
> App.register_sensor(total_daily_energy_totaldailyenergy);
> total_daily_energy_totaldailyenergy->set_name("S31 4 Daily Energy");
> total_daily_energy_totaldailyenergy->set_disabled_by_default(false);
> total_daily_energy_totaldailyenergy->set_device_class("energy");
> total_daily_energy_totaldailyenergy->set_state_class(sensor::STATE_CLASS_MEASUREMENT);
> total_daily_energy_totaldailyenergy->set_last_reset_type(sensor::LAST_RESET_TYPE_AUTO);
> total_daily_energy_totaldailyenergy->set_force_update(false);
> mqtt_mqttsensorcomponent_5 = new mqtt::MQTTSensorComponent(total_daily_energy_totaldailyenergy);
> App.register_component(mqtt_mqttsensorcomponent_5);
> total_daily_energy_totaldailyenergy->set_parent(s31_4_power);
509a537,538
> total_daily_energy_totaldailyenergy->set_time(sntp_sntpcomponent);
> total_daily_energy_totaldailyenergy->set_min_save_interval(0);
Next reboot logs.. another watchdog reset
13:57:44][D][text_sensor:015]: 'S31 4 Uptime Human': Sending state '48m 44s'
[13:57:44][D][sensor:131]: 'S31 4 Uptime': Sending state 2924.18799 s with 0 decimals of accuracy
[13:58:05][D][sensor:131]: 'S31 4 Power': Sending state 172.48734 W with 1 decimals of accuracy
[13:58:05][D][sensor:131]: 'S31 4 Daily Energy': Sending state 411.97653 Wh with 3 decimals of accuracy
..reboot..
[14:00:13][I][mqtt:212]: MQTT Connected!
...
[14:00:14][D][debug:196]: Reset Reason: Software Watchdog
[14:00:14][D][debug:197]: Reset Info: Fatal exception:4 flag:3 (Software Watchdog) epc1:0x40003b4d epc2:0x00000000 epc3:0x00000000 excvaddr:0x00000000 depc:0x00000000
...
[14:00:54][D][text_sensor:015]: 'S31 4 Uptime Human': Sending state '49s'
...
[14:00:57][D][sensor:131]: 'S31 4 Power': Sending state 159.15079 W with 1 decimals of accuracy
[14:00:57][D][sensor:131]: 'S31 4 Daily Energy': Sending state 417.18314 Wh with 3 decimals of accuracy
Wh in the first reboot: ~157 Wh in the next reboot: ~262
So it's not a rollover thing, since those are very different amounts. Also it is hitting WDT, not something else
Since I'm using NTP, I wonder if this is related to #2003 / #2299
...adding memory debugging. I wouldn't expect to see WDT for out of ram though. And why does it seem to die faster on the devices pulling more energy? (Maybe it's a coincidence and really related to packet drops? The device with the most power is closest to the WiFi router)
I'll add free memory logging.
[17:22:42][V][mqtt:402]: Publish(topic='s31-5/sensor/s31_5_power/state' payload='157.4' retain=1)
[17:22:59][D][heap:128]: Free heap: 27144 bytes
[17:23:09][D][text_sensor:015]: 'S31 5 Uptime Human': Sending state '5h 8m 7s'
[17:23:09][V][mqtt:402]: Publish(topic='s31-5/sensor/s31_5_uptime_human/state' payload='5h 8m 7s' retain=1)
[17:23:09][V][sensor:037]: 'S31 5 Uptime': Received new state 18487.203125
[17:23:09][D][sensor:131]: 'S31 5 Uptime': Sending state 18487.20312 s with 0 decimals of accuracy
[17:23:09][V][mqtt:402]: Publish(topic='s31-5/sensor/s31_5_uptime/state' payload='18487' retain=1)
[17:25:30][I][mqtt:212]: MQTT Connected!
[17:25:30][C][sntp:022]: Setting up SNTP...
...so not a RAM thing. Though will try changing from NTP to HASS as time source
Not NTP either - switched to homeassistant API as the time source, and still have regular WDT reboots.
Now trying disabling MQTT completely... maybe it is related to sending more data than it did before?
With MQTT disabled, and only using Native API, the device is now up > 24h without reboot. So the problem seems to be related to MQTT. This also means I have a solution - don't use MQTT.
But there is still a bug, and somehow it affects some devices more that others. Maybe related to power usage, now I suspect more likely related to wifi. Bug might be with MQTT or might be with wifi itself.
I suppose I could tcpdump / wireshark so see if there are incomplete transmissions or something. I took a quick look, and the biggest packet sizes for MQTT writes are ~501b (excluding header, with daily power) vs 455 (excluding header, without daily power). So it is far from the MTU limit. Most of the content is debug messaging for the power monitoring. Looking at a device running the native protobuf API, packets are up to 449 bytes. ...is it a problem only with packets > ~500b?
Maybe next test is MQTT on, daily power on, debug logging off.
MQTT on / debug logging off still failed, and the device rebooted. So it's not related to packet size. Also an experiment leaving API only running longer lead to a reboot.
And it still seems very tied to how much energy is used, which makes it all the more confusing - one device that I forgot I enabled it on rebooted a few times today after being up for days but switched off. Now switched on, it rebooted once at ~380Wh & again ~90Wh later.
Maybe it's not related to how much power is used but rather how often the number is updated? No reboots when the number stays at 0?
oops, I wasn't clear in the last comments.
Also an experiment leaving API only running longer lead to a reboot.
That was a reference to MQTT off. It still rebooted. It just took over a day (maybe by luck, who knows)
I had the same problem: device didn't boot after flashing with example yaml snippet. Disabling time:
and platform: total_daily_energy
made it check into wifi and operate just fine. I didn't dig deeper to figure out which one of the two helped.
The problem
The devices work file with
total_daily_energy
disabled. But withtotal_daily_energy
enabled, they periodically reboot. I added uptime measurement, and it isn't a consistent amount of time - as little as a few minutes and as long as many hours. But I have noticed it reboots sooner on devices using more power.Is there some sort of possible overflow + divide by zero happening.
It is measuring Watt-hours, not kWh. I'm suspicious this might occur 1000x less if I switch to kWh, but I haven't tested that yet.
Which version of ESPHome has the issue?
v2021.8.2
What type of installation are you using?
pip
Which version of Home Assistant has the issue?
2021.8.8
What platform are you using?
ESP8266
Board
Sonoff S31
Component causing the issue
total_daily_energy
Example YAML snippet
Anything in the logs that might be useful for us?
boot after update:
last entries before reboot:
reboot:
Additional information
No response