CircuitSetup / Expandable-6-Channel-ESP32-Energy-Meter

Hardware & Software documentation for the CircuitSetup Expandable 6 Channel ESP32 Energy Meter. Works with ESPHome and Home Assistant.
https://circuitsetup.us/product/expandable-6-channel-esp32-energy-meter/
MIT License
532 stars 106 forks source link

Feature requests: EmonESP #5

Closed presslab-us closed 4 years ago

presslab-us commented 4 years ago

Not sure the best place to document this so I put it here.

I'd like to request a few features for the EmonESP software. I have been using the ESPHome software because I require the configuration options but I have these issues:

Mainly I would like more flexibility in the configuration of EmonESP:

Thanks for reading!

presslab-us commented 4 years ago

I have implemented these features here: https://github.com/presslab-us/Expandable-6-Channel-ESP32-Energy-Meter/tree/config

CircuitSetup commented 4 years ago

@presslab-us thanks so much for doing this. If you want, you can put together a pull request so I can merge the changes/features.

presslab-us commented 4 years ago

Thanks. Yes I can do a pull request once I get some time on it. I have this new code installed on two units. One of them has been working fine and the other crashes after some time. I'd like to figure out this issue first.

CircuitSetup commented 4 years ago

Humm, is the esp32 different between the 2 or is one using MQTT where the other is not? You may want to monitor the available memory as well.

presslab-us commented 4 years ago

They both connect to Emoncms and MQTT. The one that is failing has 3 add-on boards, the working one only has the main board. The free memory is pretty much the same on both and does not look to decrease over time.

I think I saw this same problem with ESPHome as sometimes the values would not update for a bit. Perhaps the difference with EmonESP is that it does not enable the watchdog timer. ESPHome would just mask the problem by rebooting. I'm not sure the validity of that theory yet.

presslab-us commented 4 years ago

Before going to bed I plugged in a USB power adapter (in combination with the AC wall transformer). It was still working this morning. I then unplugged the USB power adapter and it crashed within 20 minutes.

I will keep looking at it to see what I can figure out.

CircuitSetup commented 4 years ago

To rule out some things, are you using the latest versions of the async libraries to compile? There was a bug that was recently fixed that would cause connections to not be properly closed. This would happen after visiting the web interface or if it was left open.

Also, do you have the latest esp32-arduino?

CircuitSetup commented 4 years ago

Also, thanks so much for refactoring all of the redundant code! My approach that lacked arrays was a bit rushed.

presslab-us commented 4 years ago

My AsyncTCP library is version 1.1.1, the latest from GitHub. My esp32-arduino is version 1.0.4, it looks like that is the latest released version. It came from here: https://dl.espressif.com/dl/package_esp32_index.json

I have made a few changes to the power supply. I am going to run it overnight to see how it goes.

presslab-us commented 4 years ago

It didn't lock up last night, so that is good. But the psent value is resetting to zero, I assume that's because the unit has reset. I see there is something in the code that will do it after some number of communication errors. I will disable this and continue testing. https://github.com/CircuitSetup/Expandable-6-Channel-ESP32-Energy-Meter/blob/894a906c3a43a59bbede0b8fa6303a8b46e84f99/Software/EmonESP/src_6chan/emoncms.cpp#L89

image

CircuitSetup commented 4 years ago

Humm, that line will only execute if it fails to send data to emoncms >30 times while connected to wifi. Maybe it is connected to wifi, but is being blocked, or for some reason emoncms is becoming unavailable to the esp32. Is there any reason that would happen? Firewall?

presslab-us commented 4 years ago

With that line disabled it's not rebooted (or locked up) once so far, but we'll see. It's currently at Successful messages: 13543/13584 99.69817432273263%. I don't know if all the lost ones happened at once or what. There is no firewall between the units and the server.

Honestly I don't mind so much if that causes a reboot. I just want to be sure the locking up is fixed, and that the reboots aren't related to the locking up problem, or a software bug.

I believe the locking up problem was caused by insufficient capacitance on the supply input. Would you like me to create a separate issue for that?

CircuitSetup commented 4 years ago

I'll have to take a closer look at that line, and why it may be executing.

Can you take note whether you visited the web interface or not after a reboot, and if that correlates to it locking up again or not.

Also, there may be something I haven't discovered yet with the async libraries and the MQTT library. Previously there was a bug where it would block everything for 14 min if it lost a connection to MQTT. That was fixed, but it could still block wifi functionality for 5 seconds at a time if MQTT is lost. It should really be changed to an async MQTT library to prevent this.

Yes, put anything you found with capacitance in a new issue, since that's hardware related.

presslab-us commented 4 years ago

In my plot above when the unit reset psent went to zero. You can see it had rebooted a couple times during the night and I did not access the web page between reboots (I was sleeping).

I have not seen any lock ups since modifying the power supply, only reboots now.

It sounds like reboots might have been a problem before my modifications? Do you think what I am seeing may not be related to the changes I have made? If so maybe another issue should be created for the reboots.

presslab-us commented 4 years ago

I have captured a serial log when the reboot happened. Looks like it is crashing in get_http(). I've put in additional logging to try and narrow it down.

.
Plain old HTTP
Guru Meditation Error: Core  1 panic'ed (InstrFetchProhibited). Exception was unhandled.
Core 1 register dump:
PC      : 0xc000610e  PS      : 0x00060d30  A0      : 0x800e6e3a  A1      : 0x3ffb1dd0
A2      : 0x3ffc1c48  A3      : 0xc000610e  A4      : 0x000004c0  A5      : 0x0e6350da
A6      : 0x07fd6bdc  A7      : 0x07fd6bdc  A8      : 0x8018617a  A9      : 0x3ffd9f2b
A10     : 0x3ffd7d9c  A11     : 0x0e6350da  A12     : 0x3ffc2758  A13     : 0x0e6350da
A14     : 0x3ffb88d0  A15     : 0x3ffb88d0  SAR     : 0x00000008  EXCCAUSE: 0x00000014
EXCVADDR: 0xc000610c  LBEG    : 0x4000c46c  LEND    : 0x4000c477  LCOUNT  : 0x00000000

Backtrace: 0x4000610e:0x3ffb1dd0 0x400e6e37:0x3ffb1df0 0x400e6ea9:0x3ffb1e10 0x400e7ff1:0x3ffb1e30 0x400d4141:0x3ffb1e70 0x400d34a5:0x3ffb1ed0 0x400d4b86:0x3ffb1f60 0x400ea895:0x3ffb1fb0 0x40088$

Rebooting...
ets Jun  8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x17 (SPI_FAST_FLASH_BOOT)
CircuitSetup commented 4 years ago

Humm, I'm not sure what would be causing a panic related to http_get. Definitely let me know what you find.

Previously I had assumed that the memory leaks were causing all the issues, but they may have only been a part of it.

presslab-us commented 4 years ago

I'm getting closer as I narrow it down. Right now I suspect it is an issue of memory fragmentation with the use of String.

CircuitSetup commented 4 years ago

I see the recent changes you made moving things from the heap to static memory. Good idea. I was going to say it looks like it's running out of heap in the energy_meter_loop() in energy_meter.cpp and/or mqtt_publish() in mqtt.cpp.

presslab-us commented 4 years ago

Well, yes I think the static memory is better too. But it still crashed in the same place. :) I was able to decode the stack and see that it is crashing somewhere inside http.begin(). Possibly it has some problem when it tries to reuse the connection. I have turned on more debug logging, so we'll see. It crashes only a few times a day so it makes debugging lengthy...

presslab-us commented 4 years ago

I was able to get GDB working, and I can see the crash is here:

Remote debugging using /dev/ttyUSB0
0x40189394 in HTTPClient::connected (this=0x3ffc3bd0 <http>)
    at /home/rpress/Arduino/hardware/espressif/esp32/libraries/HTTPClient/src/HTTPClient.cpp:395
395         return ((_client->available() > 0) || _client->connected());
(gdb) bt
#0  0x40189394 in HTTPClient::connected (this=0x3ffc3bd0 <http>)
    at /home/rpress/Arduino/hardware/espressif/esp32/libraries/HTTPClient/src/HTTPClient.cpp:395
#1  0x400e6b6a in HTTPClient::disconnect (this=0x3ffc3bd0 <http>, preserveClient=false)
    at /home/rpress/Arduino/hardware/espressif/esp32/libraries/HTTPClient/src/HTTPClient.cpp:359
#2  0x400e6bdc in HTTPClient::end (this=0x3ffc3bd0 <http>)
    at /home/rpress/Arduino/hardware/espressif/esp32/libraries/HTTPClient/src/HTTPClient.cpp:347
#3  0x400e7d50 in HTTPClient::begin (this=0x3ffc3bd0 <http>, url=...)
    at /home/rpress/Arduino/hardware/espressif/esp32/libraries/HTTPClient/src/HTTPClient.cpp:216
#4  0x400d4504 in get_http (host=<optimized out>, 
    url=0x3ffc19f0 <url> "/emoncms/input/post.json?json={temp:28.0,freq:60.02,V1:123.38,V2:123.17,CT1:3.3000,PF1:0.802,W1:328.38,VA1:408.68,CT2:0.2700,PF2:0.546,W2:18.30,VA2:33.53,CT3:0.3150,PF3:0.969,W3:75.34,VA3:77.76,CT4:-0"...) at /tmp/arduino_build_301065/sketch/http.cpp:98
#5  0x400d38dc in emoncms_publish (data=<optimized out>) at /tmp/arduino_build_301065/sketch/emoncms.cpp:63
#6  0x400d4bc9 in loop () at /home/rpress/Expandable-6-Channel-ESP32-Energy-Meter/Software/EmonESP/src_6chan/src_6chan.ino:122
#7  0x400ea500 in loopTask (pvParameters=<optimized out>) at /home/rpress/Arduino/hardware/espressif/esp32/cores/esp32/main.cpp:19
#8  0x4008df21 in vPortTaskWrapper (pxCode=0x400ea4e8 <loopTask(void*)>, pvParameters=0x0)
    at /home/rpress/esp/esp-idf/components/freertos/port.c:143

However it turns out that using HTTPClient like this is deprecated. I've changed the code to use WiFiClient. Now I need to wait a day to see if this crashes.

CircuitSetup commented 4 years ago

Thanks for the update! That's interesting! I didn't realize that use of HTTPClient was deprecated. Any idea why that would cause a crash here but not in the meter with less current channels? Maybe because of the amount of data being transferred?

presslab-us commented 4 years ago

The way that begin() was being called was deprecated, and this seemed to contribute to the crash. The code in the backtrace would only be reached if there was some conflict running in the deprecated mode. Not all of HTTPClient was deprecated, I could have just used the new begin() call (which actually uses WifiClient). But to keep the code common with the HTTPS code I just switched everything to WiFiClient.

I don't know what the exact problem with deprecated HTTPClient is, but it was always crashing in that same place. I suspect if I let the 6 channel meter run for longer it too would eventually crash.

The 24 channel meter is still working (with the updated code), over 22 hours so far. Successful messages: 79867/79868 99.99874793409124%

Likely the only reason it dropped a packet is because I restarted the apache2 service running emoncms...

presslab-us commented 4 years ago

There have been no crashes at all using WiFiClient instead of HTTPClient.

I've been working on the task watchdog timer. Since I have enabled it, I have found numerous places where the default timeouts violate the 5 second watchdog. This 5 second timeout is in the IDF sdkconfig, so it's pretty hard to change especially when using the Arduino toolchain.

It's obvious that most of the Arduino libraries are not designed to work with a watchdog. But I feel a watchdog is really a necessity with any embedded product, and also if the watchdog were enabled I feel the unit would not have locked up with the power glitch. Not that a watchdog is a proper solution, but rebooting is better than locking up.

CircuitSetup commented 4 years ago

That's great! Thanks so much for testing and figuring this out. I never would have seen that wificlient should have been used there.

What have you found that is triggering the watchdog? It definitely makes sense to use, but I also want to prevent it from getting triggered when something else can be changed or fall back to another state.

If you feel like this is stable enough, create a PR and I'll merge it in.

CircuitSetup commented 4 years ago

Nevermind, I see you made some changes to the http timeout and MQTT timeout/error handling.

Btw, I was looking into changing the MQTT library to something that supports async. Otherwise if it cant connect to MQTT, everything else is blocked while it tries to reconnect.

presslab-us commented 4 years ago

It seems stable so far, I was thinking to let it run for a week to see how it goes.

There are still some watchdog "boobytraps", like in WiFiClient (which is used by both the HTTP get and MQTT) it has a transmit timeout of 10 seconds. So that will trip the watchdog if it happens to connect but later cannot transmit. This timeout can be changed by editing WiFiClient.cpp but that's hacky, there really should be a method to configure this. But for the time being I guess a reboot due to a timeout is better than a potential lock up.

Async MQTT would be good. I've been playing around with another branch that encodes the MQTT as JSON for sending to Telegraf, InfluxDB, and Grafana. It might make more sense to send directly to InfluxDB, not sure.

CircuitSetup commented 4 years ago

Thanks for explaining - that makes sense. It's strange that there isn't a function to change the watchdog timeout.

The method for sending to InfluxDB and others probably depends more on how the user has things set up. I've seen a couple users modify the existing or write their own code to send directly to mySQL too. It may actually be better as a separate option from MQTT in the config.

presslab-us commented 4 years ago

I tried changing the timeout to no avail. I read about, and can see, the hard coded timeout in the sdkconfig file. My guess is that it's hard coded so some runaway code can't accidentally change it while running.

I needed to do some preprocessing of the sensors (summing phases) before sending to InfluxDB. While I did get math operations to work in InfluxDB, it slowed it to a crawl. I have a small Python script that does this instead, so MQTT is actually better for that.

That's one area Iotawatt does pretty well.

CircuitSetup commented 4 years ago

This was all merged in PR #10 Thanks for contributing - awesome work!