UtilitechAS / amsreader-firmware

ESP8266 and ESP32 compatible firmware to read, interpret and publish data to MQTT from smart electrical meters, both DLMS and DSMR is supported
Other
388 stars 73 forks source link

Pow-U reboots often, with indication "Reason: Vbat power on reset (1/0)" #627

Closed ArnieO closed 1 year ago

ArnieO commented 1 year ago

Message from the amsleser.no team (@gskjold and myself):

We are opening this issue in order to obtain two things:

  1. Get feedback from other users that might be seeing the same issue. So if you are seeing this issue on your Pow-U ESP32 device: Please leave a comment with details here, or email us at post@amsleser.no. Please post screenshot of your Info- and Config page, including the header that identifies firmware version. Please also indicate power meter brand and model.
  2. Update those affected on the status of our debugging process.

Summary We have over the last few weeks received a few notifications on Pow-U devices that have started to reboot often, sometimes several times of day. These reports suddenly started to appear on a product that has not been changed since 2022, and on the same PCBA production batch that we have sold since January 23.

We do not see the issue on our own device, so debugging this is a challenge!

We currently have 4 reports related to Aidon meters, 1 report related to a Kaifa/Nuri meter.

The number of reports are few compared to the number of devices sold, but that does not reduce the problem for those affected.

What we have tried so far

Some technical background There is in principle (by design) only one way the ESP32 can report "Vbat power on reset (1/0)", and that is if the voltage has for some reason dropped below approx 2.85V.

The device has a voltage supervisor chip that controls the ENABLE line to the ESP32 module. It is implemented with a hysteresis, so that:

This ensures that the ESP32 never enters the so called Brownout voltage region, which is voltage below 2.8V. If ESP32 reboots because it detected such a low voltage while operating, it will reboot with a different notification, saying it recovered from brownout.

The Pow-U generates its operating voltage (nominal 3.3V) from the M-bus signal. That signal is at 24 V (34 V in mid-European meters) in between datagrams, and varies between 24 (or 34) V and 12V lower during data reception. Moreover, it uses a 1F super capacitor to keep the voltage stable while the ESP32 pulls current pulses during data transmission. This has proven to be a successful and stable design - until the issue described herein suddenly started to appear.

Call for assistance As we are unable to recreate the issue, we kindly ask for assistance in getting closer to the cause of the issue.

Our first step will be to confirm whether the issue is indeed voltage drop as described in previous paragraph. The way we intend to get that sorted out will be to make a Test firmware to replace the current firmware on your device.

The Test firmware will:

The test firmware will be posted by @gskjold in this thread as soon as it is available.

Those who are willing to participate in the test must be familiar on how to upload new firmware to the device via the USB cable. It will probably NOT be possible to go back to normal firmware via OTA one-click upgrade. If this is unfamiliar to you, please do not install the test firmware.

simenovrebo commented 1 year ago

Hi,

I'm seeing this a lot with my unit from january 2023. Sometimes it stays up for 2-3 days, but most of the time it seems to like several reboots a day. Would be happy to assist with testing.

Skjermbilde 2023-09-01 kl  16 13 05

Skjermbilde 2023-09-01 kl  16 14 03

wknutsen commented 1 year ago

I have this issue randomly. I just started to save vcc and uptime to timeseries db from 1. September. Since then i have have had two reboots one vbat and one software reset. This is trend from vbat issue with lowest voltage 2.94v AMS

Update: Last night it rebooted 24 times. I don't know what reason. By looking at voltage it most likely a vbat power on reset issue. AMS2

I like to participate with the test firmware.

Is the reboot reason posted to MQTT ?

Hardware information:

Country: Norway Meter: Aidon Encryption enabled: No AMS reader: ESP32S2 Pow-U+ M-bus adapter (if applicable): Relevant firmware information:

Version: 2.2.21 MQTT: Yes MQTT payload type: JSON HAN GPIO: Unkown HAN baud and parity: Unknown Temperature sensors: No ENTSO-E API enabled: Yes

ArnieO commented 1 year ago

@wknutsen Thank you for reporting and accepting to run test firmware when it becomes available. Could you be so kind as to post screenshot of your config page? I'm in particular curious to see your setting on Power saving.

Miklagarur commented 1 year ago

Same issue here. Usually reboots 3-4 times a day, but from time to time it will last longer, like much longer, the longest is 4 days. Meter is KAIFA/Nuri MA304H3E with modified cable as recommended. Power option set to Minimum and are showing 3.29v. -48dBm signal strength. My test shows that it will reboot once wifi is not available regardless of chosen settings. That said, I really appreciate this excellent product (Pow-U).

wknutsen commented 1 year ago

This is my config: image

ArnieO commented 1 year ago

@wknutsen

By looking at voltage it most likely a vbat power on reset issue.

Thank you very much for providing the voltage plots, this is very useful for us.

gskjold commented 1 year ago

@wknutsen That plot was very interesting, thank you. It suggests that the WiFi modem is kept active more often in that period. Do you have anything on the network requesting data (or ping) the device?

I am also wondering if there could be some types of multicast that could affect the device in this manner, but not sure what that could be yet.

gskjold commented 1 year ago

As a start, I have created a version of the official firmware that has low voltage logging embedded. I think this will be sufficient. The lowest voltage together with timestamp and a clear button can be found on the status page Screenshot from 2023-09-05 10-29-22

Firmware contained inside this zip. Use "firmware.bin" and upload this via status page. esp32s2.zip

ArnieO commented 1 year ago

@Miklagarur

My test shows that it will reboot once wifi is not available regardless of chosen settings.

If you have this option activated, the cause of your reboots are probably then connection issues in your network: image In that case, you should not see "Reason: Vbat power on reset (1/0)" on the Info-page. Kindly check and confirm.

wknutsen commented 1 year ago

@gskjold

@wknutsen That plot was very interesting, thank you. It suggests that the WiFi modem is kept active more often in that period. Do you have anything on the network requesting data (or ping) the device?

I have a web page open 24/7 with AMS reader other than that nothing i know of. I think i have a 2.4 Ghz wifi adapter laying around that can be used to sniff traffic. When i have time I will set up a Raspberry Pi to do some sniffing and give you an update.

I am also wondering if there could be some types of multicast that could affect the device in this manner, but not sure what that could be yet.

I have a couple of google tv and home that probably generates some multicast traffic.

gskjold commented 1 year ago

Thanks for the quick feedback. It should survive having the webpage open over time, but I will look into if there are any edge cases where it ends up requesting too often from the device.

ArnieO commented 1 year ago

UPDATE

Based on the various feedback we have received, the problem seems to be voltage stability on the internal 3.3V, generated by a buck converter that pulls power from the M-bus signal line.

We do not yet understand why and how; we're working on digging further into this.

Remedy strategy

  1. For those who are affected in a way that causes issues, the emergency solution will be to power the device from the USB connector (from a wall wart or a powerbank, depending on situation).
  2. One of the largest issues is with those of you who exploit the running energy estimate. Each reboot destroys that estimate. To remedy this, @gskjold will release a patch that stores accumulated value to flash memory, so that the device wakes from reboot with relatively correct estimate. However, each estimate cannot be stored to flash memory, as there is a limited number of read/write cycles for such a device. The implementation will therefore store for each e.g. 1 kWh accumulated energy (or maybe a bit smaller). This means that recovery will be a potentially "bad estimate" - but not as bad as starting from zero. So... far from perfect, but better.
  3. We continue working on finding root cause and fix it.
Miklagarur commented 1 year ago

@Miklagarur

My test shows that it will reboot once wifi is not available regardless of chosen settings.

If you have this option activated, the cause of your reboots are probably then connection issues in your network: image In that case, you should not see "Reason: Vbat power on reset (1/0)" on the Info-page. Kindly check and confirm.

Hi Arne, I can confirm that the device will reboot regardless if this option is checked or not, when I power down my wifi router. I have an Asus zenWIFI AX router with Smart Connect enabled. To me it seems like wifi is the culprit, once there are over 20 clients and a lot of wifi activity, AMS unit goes into everlasting reboot or at least are not accessible any longer. Here are my 2.4ghz settings, anything that should be changed perhaps? image

ArnieO commented 1 year ago

@Miklagarur Thank you for interesting input! I know @gskjold is more competent than me on the more advanced Wifi settings, so I will lett him respond.

gskjold commented 1 year ago

I can confirm that the device will reboot regardless if this option is checked or not, when I power down my wifi router.

Interesting indeed, currently trying to reproduce that.

Here are my 2.4ghz settings, anything that should be changed perhaps?

All this depends on what you use that network for:

gskjold commented 1 year ago

I can also confirm that it will reboot when loosing wifi, of the simple fact that it burns too much power while searching for the AP. This could mean that this whole thing is related to some connection instability.

ArnieO commented 1 year ago

I can also confirm that it will reboot when loosing wifi, of the simple fact that it burns too much power while searching for the AP. This could mean that this whole thing is related to some connection instability.

Veeery interesting observation, indeed!

gskjold commented 1 year ago

I have a firmware for you guys to test, this should be less power hungry when wifi is lost. Hopefully it is enough

esp32s2.zip

gskjold commented 1 year ago

By the way, if any of you have a FTDI adapter that fits on the header on the board, there will be debugging there if you enable serial debugging in settings and set it to INFO level. You will find reason for disconnect there.

wknutsen commented 1 year ago

I have a firmware for you guys to test, this should be less power hungry when wifi is lost. Hopefully it is enough

PriceAPI don't work in this version. Kinda annoying to test this version over time.

gskjold commented 1 year ago

Strange, are you supplying your own API key or not?

simenovrebo commented 1 year ago

I'm seeing this also on the price API, but I got it working by removing my API key and enabling without API key after reboot.

My lowest vcc by the way: 2.86

wknutsen commented 1 year ago

I'm seeing this also on the price API, but I got it working by removing my API key and enabling without API key after reboot.

Thank you. That solved it

gskjold commented 1 year ago

You are absolutely right, using ENTSO-E API key does not work with that version, weird... I have reverted a bump in espressif platform version I did earlier, which resolves the issue.

I have added a feature to preserve realtime data between software reboots. Also added feature that disconnects WiFi for 30s if power drops too low. I'm not entirely sure how this impacts operation of the device, so only test if you are confident you can re-flash with official release version if this goes south.

esp32s2.zip

gskjold commented 1 year ago

I have been running this the last few days in an Ubiquiti environment with channel optimization enabled. Earlier the Pow-U was rebooting every night during the channel optimization, but now it has survived the last few nights. Feel free to test this firmware and report back

senspix commented 1 year ago

637 is possibly a duplicate of this issue although it is a POW-P1 device

wknutsen commented 1 year ago

Update using the first firmware in thread with ENTSO-E api key issue. I have a all time high with 7 days uptime. This looks very promising :)

image

ArnieO commented 1 year ago

Update using the first firmware in thread with ENTSO-E api key issue. I have a all time high with 7 days uptime. This looks very promising :)

Are you able to make a plot of Vcc only, including y-axes? I am puzzled by the Vcc downstep caused by the firmware change, so would like to see which average value it is now at.

wknutsen commented 1 year ago

Are you able to make a plot of Vcc only, including y-axes? I am puzzled by the Vcc downstep caused by the firmware change, so would like to see which average value it is now at.

image

Firmware 2.2.21 image

New test firmware image

ArnieO commented 1 year ago

@wknutsen Thanks a lot, very interesting. The difference is not large, but it is a distinct difference in voltage level.

I'll run some tests with the same two firmware versions and see if I can recreate this.

enedberg commented 1 year ago

I am also having this issue

Pow-U was just rebooted because of a change in the settings ("Auto reboot on connection problem"), but have been showing the "Vbat power on reset (1/0)" error. I have noticed wrong readings in HA so I expect this problem to have started earlier this summer.

Skjermbilde 2023-09-21 kl  08 37 49 Skjermbilde 2023-09-21 kl  08 38 11
gskjold commented 1 year ago

I made some final adjustments that can be tested here: #641

gskjold commented 1 year ago

Quick summary from v2.2.22 testing: This issue will be fixed for most cases in this version. Disabling 802.11b will further reduce the chance for reboot. Some still have issues, which we will continue to monitor and investigate.

pomok commented 1 year ago

Observation before trying #641: Voltage displayed is around 3.24-3.29V, readings about 3.27V seems to be the norm.

[AMS reader b9d6f16] Up 6 days - The highest uptime for quite some time Chip: esp32s2 (240MHz) Device: [Pow-U+] Last boot: 24.09.2023 03:41 Reason: Software reset (3/0)

Manufacturer: Kaifa Model: MA105H2E

ArnieO commented 1 year ago

Observation before trying #641:

v2.2.22 is released now - so you can just update it directly.

My status: Pow-K+ on a Kamstrup. I am used to seeing several reboots per day, that I have struggled to understand. I logged Pow-K Vcc on the board during operation (using a voltage logger) and found unexplainable voltage drops lasting up to 10 seconds.

Now: Since installing v2.2.22 and disabling 802.11b legacy rates: Still no reboot after 20 hours. So this looks very promising!

senspix commented 1 year ago

More than 48 hours uptime so far! Looking good

ArnieO commented 1 year ago

My Pow-K too has not rebooted since the upgrade Friday afternoon. 😃

pomok commented 1 year ago

Have not upgraded to .22 yet. Old test version has gone stable on me...

[AMS reader b9d6f16] -- Up 7 days Chip: esp32s2 (240MHz) Device: Pow-U+ Last boot: 24.09.2023 03:41 Reason: Software reset (3/0)

Manufacturer: Kaifa Model: MA105H2E

bmork commented 1 year ago

I see that we already have Vcc plots here, but just to add another data point (running v2.2.22): pow-u-plus-vcc

Still missing the uptime plot, but the dips do of course correlate with

Last boot: 02.10.2023 04:06
Reason: Vbat power on reset (1/0)

The level changes puzzled me too. They happen on every reboot. My best guess is that there is some boot time calibration ending up with slightly different results here.

ArnieO commented 1 year ago

The level changes puzzled me too.

Thank you for the plot! There is a phenomena we see that is maybe related to this. We see it when we connect the devices to a current measurement system (Nordic Power Profiler II) that shows us dynamically the power consumption with great resolution.

The phenomena is: Almost systematically, the device slightly changes "consumption profile" with each reboot, giving a distinctly different average consumption. This could indeed be the reason for the (small) voltage difference.

For the moment we do not understand this - and are in an "information collection mode" regarding this phenomena. We see that it is HW-independent (we see the same on all our products).

If anyone in the community with hands-on experience has already seen something like that before - it would be great if you could share your experience with us!

The cause could in principle be "anything", from HW error in the ESP, to library error from Espressif, to error in Arduino implementation. Or some "obscure" setting that we have missed somewhere. We find it very peculiar that it "toggles" for each reset.

dbeinder commented 1 year ago

The ADC in ESP32s are notoriously bad, I wouldn't be surprised to see its output change with every reboot. But do you really see the voltage level change when measured independently? What chip do you use as a voltage regulator? The average consumption change can't be more than a 1mA, surely? And ~10mV/mA load dependency seems way too high for any voltage regulator made this century ;)

Edit: For the next HW iteration, it may be worth it to simply allow the regulator to fill the supercap to 3.6V, the max voltage the ESP can handle:

E = C x (Umax^2 - Umin^2)/2
with 1F from 3.3V to 2.85V: 2.8J
with 1F from 3.6V to 2.85V: 4.8J (+75% energy reserve, likely +66% if the ESP draws the same current)

Though, the way to get the most out of the supercap is charging it to 5.5V and adding a second dc/dc:
5.5V to 2.85V: 21.9J (with an additional buck)
5.5V to 1.5V: 28.0J (with an additional buck/boost)
Of course you take a small efficiency hit but get 10x your current reservoir.

That was the original plan for my board, before I settled on making it Kaifa-only which works with 150µF @ 27V (~0.1J) 
ArnieO commented 1 year ago

Hi @dbeinder, I appreciate your comments! The puzzling topic of different current draw modes deserves a dedicated discussion thread, so I started one now, please go to https://github.com/UtilitechAS/amsreader-firmware/discussions/648

I prefer to use fixed-voltage LDO in these design, and do not want to "stretch margins" by increasing operating voltage to 3.6V. So thank you for the idea - but ... no.

ArnieO commented 1 year ago

Still no reboot since I installed v2.2.22 and disabled 802.11b legacy rates (29th Sep). Pow-K, TP-link Deco M5 mesh network.

I will now enable 802.11b legacy rates to verify whether it is this setting that makes the difference.

wknutsen commented 1 year ago

A quick update using 2.2.22 with disabled legacy rates. I just had a vbat incidence. Hope this was just a unlucky coincidence. Voltage seems more stable with this release. I have been away this weekend and checked status page yesterday last reason was software reset. fw2 2 22

ArnieO commented 1 year ago

I will now enable 802.11b legacy rates to verify whether it is this setting that makes the difference.

My Pow-K rebooted with "Vbat power on reset (1/0)" only 1 hour after I enabled 802.11b legacy rates. So 802.11b legacy rates is definitively what has made my unit reboot frequently. "Disabled" it is, then.

simenovrebo commented 1 year ago

Seems like disabling 802.11b legacy rates did it for me too. I don't think I've seen 6 days uptime ever before on my Pow-U+. 😊

gskjold commented 1 year ago

It's good to hear that the issue is mostly resolved. For those who had reboot issues resolved by v2.2.22, could you please also checked attached firmware to confirm that this is still the case before I release v2.2.23

esp32s2.zip esp8266.zip

Changes:

dbeinder commented 1 year ago

@gskjold Have you verified this works on ESP8266? When I looked into it, I figured the API you used does not disable 11b, since there are only 3 options: image The default has to be PHY_MODE_11N. But the default behavior is b/g/n support, so figured PHY_MODE_11G is simply going to limit to b/g.

https://www.espressif.com/sites/default/files/documentation/2c-esp8266_non_os_sdk_api_reference_en.pdf

Looking over it again, it should actually be doable using wifi_set_user_rate_limit and wifi_set_user_limit_rate_mask. As I read it, they don't allow you to get rid of 11b completely, but you can still disable the problematic 1 & 2 Mbps rates in 11b mode which should work just as well.

I really like my cheap Mikrotik AP to test these things, since you can configure almost any aspect, drop the WiFi signal 40dB by closing your fist around it and get realtime client stats: image

gskjold commented 1 year ago

I have had mixed on this, but I found a few references where setting 11g resolved power and connection issues, so I decided to test it.

Looking over it again, it should actually be doable using wifi_set_user_rate_limit and wifi_set_user_limit_rate_mask

I was looking at the same thing earlier, but decided to test setting phy mode, but I think you are right, we should be using this, I will run a new test a bit later today.

bmork commented 1 year ago

For those who had reboot issues resolved by v2.2.22, could you please also checked attached firmware to confirm that this is still the case before I release v2.2.23

Not completely resolved for me, but definitely improved by v2.2.22. A rough guesstimate is that I have about twice the uptime I had before - now typically a day or two.

I just installed the version you attached. Will let you know if there are any significant changes.