emsesp / EMS-ESP32

ESP32 firmware to read and control EMS and Heatronic compatible equipment such as boilers, thermostats, solar modules, and heat pumps
https://emsesp.github.io/docs
GNU Lesser General Public License v3.0
548 stars 96 forks source link

Upgrade core to latest Arduino/Espressif core #473

Closed proddy closed 1 year ago

proddy commented 2 years ago

The core ESP32 arduino framework has been upgraded to v2.0.0 and PlatformIO will upgrade automatically to this unless we force it to stay on 3.5. Which is fine, but breaks a few things. At first glance, we need to modify the UART code and replace our LittleFS library with the core's version. While we're at it we could look at upgrading to the brand new NodeJS 18.0 and also migrating to ReactJS 18.

MichaelDvP commented 2 years ago

emsuart_ep32.h: insert #include "soc/uart_struct.h" NTPSettingsService.h: insert #include <esp_sntp.h> OneWire_direct_gpio.h: change all rtc_gpio_desc[pin] to rtc_io_desc[pin] WebStatusService.cpp: change info.disconnected.reason to info.prov_fail_reason WebAutentification.cpp: change md5 calls to

  mbedtls_md5_init(&_ctx);
  mbedtls_md5_update_ret (&_ctx,data,len);
  mbedtls_md5_finish_ret(&_ctx,data);
  mbedtls_internal_md5_process( &_ctx ,data);
  // mbedtls_md5_starts(&_ctx);
  // mbedtls_md5_update(&_ctx, data, len);
  // mbedtls_md5_finish(&_ctx, _buf);

(see here) Compiles, but does not connect to wifi ;-( Seems there are some changes in wifi handling.

Edit: With fixed address wifi is connected, removing the WiFi.config(INADDR_NONE..) for DHCP results in connection, but without getting a dhcp address and esp is not reachable.

Uart seems to ignore the register settings and do not detect the breaks, i don't know what triggers the interrupt, but incomming telegrams have arbitrary length starting somewhere in the middle of normal telegrams.

proddy commented 2 years ago

we should lock the arduino core version in the platformio.ini to prevent the builds from failing, and then create a branch with these changes which we can work on for the next major release. Still need to get the damn 3.4 out first!

MichaelDvP commented 2 years ago

I've made a branch with first changes, LittleFS, Dallas, etc. It comples, but some things not working as mentioned. Wifi dhcp get the right address, but emsesp is not reachable. With fixed address it works. ETH also works with dhcp. Uart is very strange, it receives data, but the irq seems not to be called on break. I could not find the changes in arduino or idf that can cause it. The idf seems mainly unchanged, using the idf-driver read the fifo only on timeout/bufferfull, break generates a message, but does not read the fifo to the buffer. I have also a changed LittleFS library with compatible names (LittleFS instead of LITTLEFS). Changing to framework 3.5.0 only needs changing the ARDUINO_EVENTS back.

proddy commented 2 years ago

thanks for making the first start. I'll scout the web forums to see if anyone else is experiencing similar issues with the wifi/dhcp and also some of the examples. It may be just the sequence it's initiated. As for the uart that is going to take some more work. I'm wondering if we can now use C++19 instead of 17 which would offer some further code optimizations.

MichaelDvP commented 2 years ago

I've changed the uart to idf-driver, for me it's working now. But my boiler accepts nearly any timing and all tx-modes. I have not checked the timing with logic analyser. Please check on your boiler. The logic to make it work is bad, i have to set fifo-full to one byte to read every incoming byte with irq, which copys to transfer buffer and generates the event, this is readout by event-task and copy to telegram buffer. A lot of calls/copys for a single telegram to receive. The driver-rx-buffer with 256 bytes seems to large for single byte receive, but the driver crashes with smaller buffer size. The ems-tx-mode checks now for a new queue-entry, generated from interrupt after receving a byte.

Funny side effect: wifi dhcp works also now without any change in wifi code. But the wifi issue affects also other people, see here.

proddy commented 2 years ago

I'll check this weekend, I've been out on business these last 2 weeks. I did notice a new core version which may resolve some of the wifi issues https://github.com/espressif/arduino-esp32/releases/tag/2.0.2

MichaelDvP commented 2 years ago

I've tested all with E32, because the ETH connection works with new arduino core, but i have no ethernet near boiler and need to have wifi to check uart. The E32 now also have stable wifi, but on a MH-ET i can not connect Wifi AP and STA, it's fluctuating and disconnects after a few seconds. Same software as on E32! I'm not sure the Arduino 2.0.2 is in platformio, but platform=develop have the same issue.

MichaelDvP commented 2 years ago

I've seen that here a different platform from tasmota is used. I have to change the OneWire as mentioned here. This OneWire works on all platforms. The tasmota platform gives much smaller filesize (~350kB less), but crashes on boot.

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0030,len:184
load:0x40078000,len:12596
load:0x40080400,len:2916
entry 0x400805c4

┌──────────────────────────────────────┐
│ EMS-ESP version 3.4.0b15rc1           │
│ https://github.com/emsesp/EMS-ESP32  │
│                                      │
│ type help to show available commands │
└──────────────────────────────────────┘
ems-esp:$ Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.
Core  1 register dump:
PC      : 0x4012eafd  PS      : 0x00060533  A0      : 0x8010410b  A1      : 0x3ffb26e0
A2      : 0xffffffff  A3      : 0xffffff7e  A4      : 0x00000000  A5      : 0x3ffbdc6c
A6      : 0x00000020  A7      : 0x00000000  A8      : 0x00000000  A9      : 0x3ffb26a0
A10     : 0x3ffbdc70  A11     : 0x3ffc5c20  A12     : 0x3ffc5c24  A13     : 0x0000abab
A14     : 0x00060523  A15     : 0x00060520  SAR     : 0x00000013  EXCCAUSE: 0x0000001c
EXCVADDR: 0x000000b0  LBEG    : 0x4008a084  LEND    : 0x4008a09a  LCOUNT  : 0xffffffff

Backtrace:0x4012eafa:0x3ffb26e00x40104108:0x3ffb2700 0x400f6dcc:0x3ffb2720 0x40101f3a:0x3ffb2760 0x400f823a:0x3ffb27a0 0x400fa0eb:0x3ffb2800 0x4012c6c6:0x3ffb2820
ELF file SHA256: 0000000000000000
Rebooting...

But now with actual develop platform the MH-ET boots and connects with dhcp or static address.

proddy commented 2 years ago

I just tried your latest branch with WiFi and haven't seen any issues yet. What would you like me to test/ try out? I have both ETH and WiFi here

MichaelDvP commented 2 years ago

Yes wifi also works for me with development platform. ETH was always working. Is the uart in EMS mode working for you? The EMS+ and HT3 have fixed timings, but EMS reads back the master echo, and this is now a bit different. On my boiler all modes and timings are working and i prefere the hardware-mode.

proddy commented 2 years ago

looks ok, 43 minutes and only 6 failed Rx, using TxMode EMS with ETH. I would need to compare against the previous 3.4 but it looks solid enough.

image

MichaelDvP commented 2 years ago

Good, the few more rx-fails are by design, the old uart ignores first telegrams after start and telegrams not ending with break(zero). For this test i wanted to filter less to see what's coming in. We can add those filters again if we want to reduce rx-fail counts to bad-crc.

proddy commented 2 years ago

I'm getting restarts every 1-3hrs though. Need's some more debugging...I'll leave it running and try to catch the reason code

MichaelDvP commented 2 years ago

I dont see restarts (11h uptime), but uart buffer is not checked for overflow, I'll update.

My MH-ET shows ~25k less free heap, a memory leak? But it seems to be stable. I'll check ETH, i think the heap is in same range as before.

MichaelDvP commented 2 years ago

I've updated the uart and merged your latest dev.

With this and the dev i checked the free heap on different esp32: Filesize with new idf is lower, for E32 (wifi connected) free heap increases a bit, but MH-ET/S32 has less free heap. The difference MH-ET to S32 is due to OTA was disabled on S32 (seen it later). (heap from web-system-page after all ems entities are detected). I think this is due to changes in the framework and nothing to worry.

Framework 3.5: Filesize: 1758 kB MH-ET: Heap: 190148 / 113792 (standalone, ems/mqtt not connected) MH-ET: Heap: 174620 / 108114 (ems/mqtt connected) E32: Heap: 135012 / 70864 (ems connected, wifi) S32: Heap: 195872 / 113792 (standalone, ems/mqtt not connected)

Framework 4.4: Filesize: 1723 kB MH-ET: Heap: 172148 / 110580 (standalone, ems/mqtt not connected) MH-ET: Heap: 155536 / 102388 (ems/mqtt connected) E32: Heap: 139712 / 90100 (ems connected, wifi) S32: Heap: 176556 / 110580 (standalone, ems/mqtt not connected)

proddy commented 2 years ago

I'm running your dev build now and will report back in a few hours.

proddy commented 2 years ago

Just had the first restart after 7hrs

MichaelDvP commented 2 years ago

Sad, any usefull reset reason information? I have uptime 9h for MH-ET and 10h for E32, i'll switch the E32 to tx-mode 1-now and leave the other on tx-mode 4.

proddy commented 2 years ago

no error, just "Last system reset reason Core0: Software reset CPU, Core1: Software reset CPU". Free Mem is constant around 170K and not falling. This is with TxMode 1

MichaelDvP commented 2 years ago

My E32 is now uptime 30h, 10h with tx-mode 4, 20h with tx-mode 1, no tx-errors (~19.000 reads), 19 rx-fails within ~194.000 receives (0,01%). I can not reproduce the reboots. (btw: SDK shown as v4.4-beta1-189-ga79dc75f0a, do you have the same?)

proddy commented 2 years ago

SDK is the same. I'm running it again, if it crashes I'll start turning off the services (NTP, MQTT, AP)

MichaelDvP commented 2 years ago

I think it's the uart buffer. yesterday i've added a bufffer-check, but forgot to readout rx-buffer, so after an overflow the uart only throws garbage. This happend after ~3-5 h. Try again with actual code.

proddy commented 2 years ago

testing now...

proddy commented 2 years ago

looking good, no glitches in 6hrs...

image

proddy commented 2 years ago

it's been running now for 24hrs without any crashes. Rx 82825/26 fail and Tx Read is 19138/16. Which is good enough. I should compare against the 3.4b to see if those Tx Read failures are normal

MichaelDvP commented 2 years ago

I've updated the idf4-branch to latest dev and changed uart code for rx and tx-mode 1. I have a bit less rx-fails and no tx-fails. Please check.

proddy commented 2 years ago

I've updated the idf4-branch to latest dev and changed uart code for rx and tx-mode 1. I have a bit less rx-fails and no tx-fails. Please check.

impressive. been running for 1hr+ with 0 fails

proddy commented 2 years ago

I've updated the idf4-branch to latest dev and changed uart code for rx and tx-mode 1. I have a bit less rx-fails and no tx-fails. Please check.

impressive. been running for 1hr+ with 0 fails

after 20hrs only 14 failed Tx and 12 failed Tx Reads. At 100% quality for both. The Rx fails is about half it was in the previous idf4 dev release. So all good.

MichaelDvP commented 2 years ago

Do you have logged the tx fails?

proddy commented 2 years ago

I'll do some tracing over the weekend to see why the Tx errors are high. On v3.4 I'm getting 0 at the moment:

image

proddy commented 2 years ago

react18 was upgraded in 3.4b18

proddy commented 2 years ago

lets get 3.4.1 out with the latest fixes and make 3.4.2 based on espressif arduino v2

MichaelDvP commented 2 years ago

ok.

proddy commented 2 years ago

I'll do some tracing over the weekend to see why the Tx errors are high. On v3.4 I'm getting 0 at the moment:

image

With the latest 3.4.2b I'm still seeing UART errors on both Rx/Tx. With the previous espressif 3.5 which I had running for 12 days I had zero fails. I'll do some tracing and debugging.

image

MichaelDvP commented 2 years ago

Have you checked what tx errors this are? Is it random or is there any time or telegram sytematic? I can not reproduce, my ems-master in not timing critical and any tx-mode works (i mostly use tx-mode 4). I'm curious about feedback from ems+ and ht3 users.grafik BTW: I've added a syslog count/fail, is this usefull, should i add this to dev?

proddy commented 2 years ago

it's hard to find the Tx errors, without adding some extra debug code. They happen randomly every few hours and difficult to reproduce and capture without flooding the logs with raw telegrams.

proddy commented 2 years ago

syslog is good to add, although it'll show a lot of messages depending on the Level

MichaelDvP commented 2 years ago

You should see the tx telegram (to_string) in error-log-level with log-time: https://github.com/emsesp/EMS-ESP32/blob/794b3c04712ccdb0e278e4118670557fb5edda42/src/telegram.cpp#L596-L599

proddy commented 2 years ago

I'm more concerned about the Rx fails, it's one every 50mins. Thats on txmode=1. I'll try 4 (hardware) now

proddy commented 2 years ago

I'm more concerned about the Rx fails, it's one every 50mins. Thats on txmode=1. I'll try 4 (hardware) now

No Tx errors with TxMode 4 (Hardware) after 1d16h on latest dev build. Rx has 22 fails from 134,883 which isn't bad. Still not as solid as 3.4.1 but close.

proddy commented 1 year ago

all done. works fine