esphome / issues

Issue Tracker for ESPHome
https://esphome.io/
293 stars 36 forks source link

Regular reboot of ESPHome device #2299

Closed jason-but closed 3 years ago

jason-but commented 3 years ago

The problem

I have two ESP Devices, one Shelly1 and a small ESP8266 board with a relay on it. Both are configured with a fully functional yaml file but also regularly experience crashes and reboots approximately every 20-25 minutes

Which version of ESPHome has the issue?

1.20.4 (also 1.19.x)

What type of installation are you using?

Home Assistant Add-on

What platform are you using?

ESP8266

Board

Shelly1 and Geekcreit® ESP8266 Development Board

Component causing the issue

Unknown

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Stack trace:

Exception 29: StoreProhibited: A store referenced a page mapped with an attribute that does not permit stores
PC: 0x4000df64
EXCVADDR: 0x00000000

Decoding stack results
0x4022c265: do_memp_malloc_pool at core/memp.c line 255
0x4022c20d: ethernet_output at netif/ethernet.c line 312
0x4022c207: ethernet_output at netif/ethernet.c line 312
0x402295f6: __ssputs_r at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf.c line 233
0x40225455: _printf_i at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf_i.c line 194
0x402295f6: __ssputs_r at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf.c line 233
0x402295f6: __ssputs_r at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf.c line 233
0x40225580: _printf_i at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf_i.c line 246
0x402295f6: __ssputs_r at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf.c line 233
0x4022952c: __ssputs_r at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf.c line 180
0x402295f6: __ssputs_r at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf.c line 233
0x40233970: mem_malloc at core/mem.c line 210
0x4022b2dc: std::string::_Rep::_S_create(unsigned int, unsigned int, std::allocator  const&) at /workdir/arena.x86_64/gcc/xtensa-lx106-elf/libstdc++-v3/include/bits/basic_string.tcc line 621
0x4022f182: pbuf_alloc_LWIP2 at core/pbuf.c line 284
0x402301d8: tcp_write at core/tcp_out.c line 263
0x4022b2f4: std::string::_Rep::_M_destroy(std::allocator  const&) at /workdir/arena.x86_64/gcc/xtensa-lx106-elf/libstdc++-v3/include/bits/basic_string.tcc line 452
0x4022b10d: operator delete(void*) at /workdir/repo/gcc/libstdc++-v3/libsupc++/del_op.cc line 48
0x4022b2f4: std::string::_Rep::_M_destroy(std::allocator  const&) at /workdir/arena.x86_64/gcc/xtensa-lx106-elf/libstdc++-v3/include/bits/basic_string.tcc line 452
0x4022b316: std::string::_Rep::_M_dispose(std::allocator  const&) at /workdir/arena.x86_64/gcc/xtensa-lx106-elf/libstdc++-v3/include/bits/basic_string.h line 254
0x4022b332: std::basic_string  , std::allocator  >::~basic_string() at /workdir/arena.x86_64/gcc/xtensa-lx106-elf/libstdc++-v3/include/bits/basic_string.h line 541

Additional information

An alternate stack trace implied a crash in ip4.c and eventually in utoa.c

The trace is not from the Shelly, that is in the roof. I have previously had my own code running on this ESP8266 running non-trivial tasks for months with no crash or similar error

jason-but commented 3 years ago

This is the board:

https://au.banggood.com/Geekcreit-ESP8266-Development-Board-WIFI-Relay-Module-220V-10A-Relay-p-1436275.html

oxan commented 3 years ago

Can you share your YAML?

jason-but commented 3 years ago

When pasting yaml code with comments, it screws up the formatting, how to make this easier?

jason-but commented 3 years ago

While I can't figure out how to add the code here nicely, I have created a pastebin

oxan commented 3 years ago

Is there anything about a failed allocation below the stack dump in the serial logs? I suspect this is an out of memory issue.

jason-but commented 3 years ago

So the actual serial logs don't say much

`[0;36m[D][api:067]: Disconnecting Home Assistant 2021.8.3 (192.168.128.6) [D][api:067]: Disconnecting Home Assistant 2021.8.3 (192.168.128.6) [D][api:067]: Disconnecting Home Assistant 2021.8.3 (192.168.128.6)

--------------- CUT HERE FOR EXCEPTION DECODER ---------------

Exception (29): epc1=0x40219eba epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000008 depc=0x00000000

stack>>>

ctx: sys sp: 3fffe8e0 end: 3fffffb0 offset: 0190 3fffea70: 3fffeaa0 76726573 3fffeba2 00000000
3fffea80: 3fffedc0 3fffede8 00000000 4021ec30

.........

3fffff90: 3fffdad0 00000000 3fff0edc 4020f4e8
3fffffa0: 3fffdad0 00000000 3fff0edc 4021f0c4
<<<stack<<<

last failed alloc call: 4021B18E(544)

--------------- CUT HERE FOR EXCEPTION DECODER ---------------

ets Jan 8 2013,rst cause:1, boot mode:(3,7)

load 0x4010f000, len 3584, room 16 tail 0 chksum 0xb0 csum 0xb0 v2843a5ac ~ld [I][logger:170]: Log initialized [C][ota:378]: There have been 0 suspected unsuccessful boot attempts. [I][app:029]: Running through setup()...

`

jason-but commented 3 years ago

But that was only on the last crash, the previous two had nothing logged after the

<<<stack<<<

line

jason-but commented 3 years ago

And more recently

last failed alloc call: 4021B14E(540)

oxan commented 3 years ago

Yeah, you're running out of memory. v1.21 will improve this (the last failed alloc call sometimes is missing, even when the issue is still out-of-memory).

jason-but commented 3 years ago

OK, thanks for debugging that down

I only recently switched to HomeAssistant+ESPHome. Prior to that I developed my own code and used Blynk. But their new model drove me away. When I discovered ESPHome I thought it was awesome.

I moved from my code primarily for the easy management and updating of existing devices. My code had http flash update built in as well but required manual mangement from my end

In some ways it is a little disappointing that this runs out of memory. My old code did much the same thing plus Blynk protocol support, running a web site with web based configuration/update, plus logging back to Blynk and it never ran out of memory. I wonder what is chewing it up?

I tried disabling logging but it did not help, still rebooting, only now I have no logs

I presume the same is happening on the Shelly 1?

My yaml for that is the Basic Configuration from (https://www.esphome-devices.com/devices/Shelly-1)

I can't see it being more simple/basic than that yet still regular crashes :-(

jason-but commented 3 years ago

Thinking a little about this. The fact that the devices manage to run for 20-30 minutes before failing appears to indicate a memory leak type situation rather than just running close to the edge

jason-but commented 3 years ago

Mistakenly closed due to pushing wrong button

jason-but commented 3 years ago

@oxan Were you ever able to find out what specifically was causing the system to run out of memory (or memory leak) or is it just a generic leak?

Also, is there something I could add to the YAML that would cause the current memory usage to get output to the log periodically so I could use this approach to see if the Shelly is suffering the same problems?

oxan commented 3 years ago

No, I haven't been able to look into detail at this yet.

You can try e.g. this:

interval:
  - interval: 1min
    then:
      - lambda: |-
          ESP_LOGD("heap", "Free heap: %d bytes", ESP.getFreeHeap());
jason-but commented 3 years ago

Thanks

Here is the logging with that line and default log level (Shelly 1 with basic YAML from esphome site). You can see a slow drop in available memory, speeding up over time. I am having trouble uploading a firmware with a higher debug setting at present

[19:31:40][D][api.connection:730]: Client 'Home Assistant 2021.8.7 (192.168.128.6)' connected successfully! [19:32:05][D][heap:042]: Free heap: 39120 bytes [19:33:05][D][heap:042]: Free heap: 39256 bytes [19:34:05][D][heap:042]: Free heap: 39232 bytes [19:35:05][D][heap:042]: Free heap: 39184 bytes [19:36:05][D][heap:042]: Free heap: 38776 bytes [19:36:29][I][ota:046]: Boot seems successful, resetting boot loop counter. [19:37:05][D][heap:042]: Free heap: 39000 bytes [19:38:05][D][heap:042]: Free heap: 38944 bytes [19:39:05][D][heap:042]: Free heap: 38744 bytes [19:40:05][D][heap:042]: Free heap: 38584 bytes [19:41:05][D][heap:042]: Free heap: 38008 bytes [19:42:05][D][heap:042]: Free heap: 37912 bytes [19:43:05][D][heap:042]: Free heap: 37648 bytes [19:44:05][D][heap:042]: Free heap: 36512 bytes [19:45:05][D][heap:042]: Free heap: 35880 bytes [19:46:05][D][heap:042]: Free heap: 34824 bytes [19:47:05][D][heap:042]: Free heap: 34056 bytes [19:48:05][D][heap:042]: Free heap: 32720 bytes [19:49:05][D][heap:042]: Free heap: 31536 bytes [19:50:05][D][heap:042]: Free heap: 29904 bytes [19:51:05][D][heap:042]: Free heap: 28296 bytes [19:52:05][D][heap:042]: Free heap: 26168 bytes [19:53:05][D][heap:042]: Free heap: 23784 bytes [19:54:05][D][heap:042]: Free heap: 21144 bytes [19:55:05][D][heap:042]: Free heap: 17992 bytes [19:56:05][D][heap:042]: Free heap: 14856 bytes [19:57:05][D][heap:042]: Free heap: 11192 bytes [19:58:05][D][heap:042]: Free heap: 7272 bytes WARNING Disconnected from API: Timeout while waiting for message response! INFO Connecting to light-kitchen-bench.local:6053 (192.168.192.128) WARNING Couldn't connect to API (Timeout while waiting for message response!). Trying to reconnect in 1 seconds INFO Connecting to light-kitchen-bench.local:6053 (192.168.192.128) WARNING Couldn't connect to API (Timeout while waiting for message response!). Trying to reconnect in 1 seconds INFO Connecting to light-kitchen-bench.local:6053 (192.168.192.128) WARNING Couldn't connect to API (Timeout while waiting for message response!). Trying to reconnect in 2 seconds INFO Connecting to light-kitchen-bench.local:6053 (192.168.192.128) INFO Successfully connected to light-kitchen-bench.local [19:59:43][D][api.connection:730]: Client 'Home Assistant 2021.8.7 (192.168.128.6)' connected successfully! [20:00:07][D][heap:042]: Free heap: 38992 bytes

oxan commented 3 years ago

I tried, but I can't actually reproduce this (using the block under "Basic Configuration" from this page). I don't have any Shelly hardware though, so tried it on a NodeMCU, but can't imagine that making a difference. Free memory stays constant at 37320 bytes after HA connects.

Can you try setting enable_mdns: false under wifi?

jason-but commented 3 years ago

I'll wait till the next failure. I ran with verbose logging and no different output, currently running with debug level logging

jason-but commented 3 years ago

About to start the upload now, very strangely while a firmware with default logging and verbose logging always ran out of memory and rebooted after 20-25 minutes, running logging in debug mode lost no memory for the first five minutes before falling into the normal pattern. This time it stayed up for 30 minutes, no different output

jason-but commented 3 years ago

Just started running with enable_mdns = false

Will I need to do some black magic to upload a new firmware with mdns disabled?

jason-but commented 3 years ago

On a side note, here is the Validated YAML to confirm the settings

substitutions:
  device_name: light-kitchen-bench
  light_name: Kitchen Bench Light
  wifi_ssid: !secret 'wifi_ssid'
  wifi_password: !secret 'wifi_password'
  ap_name: Kitchen Bench AP
  ap_password: !secret 'ap_password'
esphome:
  name: light-kitchen-bench
  platform: ESP8266
  board: esp01_1m
  arduino_version: platformio/espressif8266@2.6.2
  build_path: light-kitchen-bench
  platformio_options: {}
  esp8266_restore_from_flash: false
  board_flash_mode: dout
  includes: []
  libraries: []
  name_add_mac_suffix: false
ota:
  password: <redacted>
  safe_mode: true
  port: 8266
  reboot_timeout: 5min
  num_attempts: 10
interval:
- interval: 1min
  then:
  - lambda: !lambda |-
      ESP_LOGD("heap", "Free heap: %d bytes", ESP.getFreeHeap());
logger:
  level: DEBUG
  baud_rate: 115200
  tx_buffer_size: 512
  deassert_rts_dtr: false
  hardware_uart: UART0
  logs: {}
  esp8266_store_log_strings_in_flash: true
api:
  port: 6053
  password: ''
  reboot_timeout: 15min
wifi:
  enable_mdns: false
  ap:
    ssid: Kitchen Bench AP
    password: !secret 'ap_password'
    ap_timeout: 1min
  domain: .local
  reboot_timeout: 15min
  power_save_mode: NONE
  fast_connect: false
  output_power: 20.0
  networks:
  - ssid: !secret 'wifi_ssid'
    password: !secret 'wifi_password'
    priority: 0.0
  use_address: light-kitchen-bench.local
captive_portal: {}
binary_sensor:
- platform: gpio
  id: manual_switch
  pin:
    number: 5
    mode: INPUT
    inverted: false
  on_state:
  - then:
    - light.toggle:
        id: lightid
  disabled_by_default: false
  name: manual_switch
  internal: true
output:
- platform: gpio
  id: relay
  pin:
    number: 4
    mode: OUTPUT
    inverted: false
light:
- platform: binary
  name: Kitchen Bench Light
  output: relay
  id: lightid
  disabled_by_default: false
  restore_mode: RESTORE_DEFAULT_OFF
oxan commented 3 years ago

About to start the upload now, very strangely while a firmware with default logging and verbose logging always ran out of memory and rebooted after 20-25 minutes, running logging in debug mode lost no memory for the first five minutes before falling into the normal pattern. This time it stayed up for 30 minutes, no different output

The default log level is DEBUG. I suspect this bug is triggered by network activity, so that could be the difference.

Will I need to do some black magic to upload a new firmware with mdns disabled?

You might need to set use_address under wifi to the IP address of the device.

On a side note, here is the Validated YAML to confirm the settings

Please place code within code blocks, as this is unreadable.

oxan commented 3 years ago

Also, I see a fallback AP & captive portal in that config, so you're not actually running the basic configuration from the page you linked, but you modified it. That's pretty important to mention.

jason-but commented 3 years ago

OK, did not help

[21:49:34][D][heap:042]: Free heap: 40800 bytes [21:50:34][D][heap:042]: Free heap: 40752 bytes [21:51:34][D][heap:042]: Free heap: 40728 bytes [21:52:34][D][heap:042]: Free heap: 40632 bytes [21:53:34][D][heap:042]: Free heap: 40608 bytes [21:53:47][I][ota:046]: Boot seems successful, resetting boot loop counter. [21:54:34][D][heap:042]: Free heap: 40416 bytes [21:55:34][D][heap:042]: Free heap: 40368 bytes [21:56:34][D][heap:042]: Free heap: 40008 bytes [21:57:34][D][heap:042]: Free heap: 39888 bytes [21:58:34][D][heap:042]: Free heap: 39240 bytes [21:59:34][D][heap:042]: Free heap: 38928 bytes [22:00:34][D][heap:042]: Free heap: 37640 bytes [22:01:34][D][heap:042]: Free heap: 36944 bytes [22:02:34][D][heap:042]: Free heap: 35600 bytes [22:03:34][D][heap:042]: Free heap: 34448 bytes [22:04:34][D][heap:042]: Free heap: 32936 bytes [22:05:34][D][heap:042]: Free heap: 31328 bytes [22:06:34][D][heap:042]: Free heap: 29336 bytes [22:07:34][D][heap:042]: Free heap: 27104 bytes [22:08:34][D][heap:042]: Free heap: 24440 bytes [22:09:34][D][heap:042]: Free heap: 22064 bytes [22:10:34][D][heap:042]: Free heap: 18752 bytes [22:11:34][D][heap:042]: Free heap: 15632 bytes [22:12:34][D][heap:042]: Free heap: 11624 bytes [22:13:34][D][heap:042]: Free heap: 8096 bytes [22:14:34][D][heap:042]: Free heap: 3320 bytes

Then the remote logger would not re-attach because it could not find the MDNS device

jason-but commented 3 years ago

I'll try the use_address, and disable the captive portal next

The code blocks are not working for me, when I use the code-block/back-ticks with the yaml it is still a mess (tried to edit the last message)

jason-but commented 3 years ago

I've just uploaded a firmware with enable_mdns FALSE, and the captive portal and fallback AP disabled

jason-but commented 3 years ago

No change with the Fallback AP and captive portal disabled. Started with more free memory but ran out just as quickly :-(

[22:30:33][D][api.connection:730]: Client 'Home Assistant 2021.8.7 (192.168.128.6)' connected successfully! [22:31:16][D][heap:042]: Free heap: 42784 bytes [22:32:16][D][heap:042]: Free heap: 42760 bytes [22:33:16][D][heap:042]: Free heap: 42712 bytes [22:34:16][D][heap:042]: Free heap: 42640 bytes [22:35:16][D][heap:042]: Free heap: 42568 bytes [22:35:18][I][ota:046]: Boot seems successful, resetting boot loop counter. [22:36:16][D][heap:042]: Free heap: 42400 bytes [22:37:16][D][heap:042]: Free heap: 42328 bytes [22:38:16][D][heap:042]: Free heap: 41968 bytes [22:39:16][D][heap:042]: Free heap: 41848 bytes [22:40:16][D][heap:042]: Free heap: 41128 bytes [22:41:16][D][heap:042]: Free heap: 40888 bytes [22:42:16][D][heap:042]: Free heap: 39432 bytes [22:43:16][D][heap:042]: Free heap: 38928 bytes [22:44:16][D][heap:042]: Free heap: 37248 bytes [22:45:16][D][heap:042]: Free heap: 36360 bytes [22:46:16][D][heap:042]: Free heap: 34224 bytes [22:47:16][D][heap:042]: Free heap: 32928 bytes [22:48:16][D][heap:042]: Free heap: 30432 bytes [22:49:16][D][heap:042]: Free heap: 28536 bytes [22:50:16][D][heap:042]: Free heap: 25632 bytes [22:51:16][D][heap:042]: Free heap: 23064 bytes [22:52:16][D][heap:042]: Free heap: 19368 bytes [22:53:16][D][heap:042]: Free heap: 16272 bytes [22:54:16][D][heap:042]: Free heap: 11880 bytes [22:55:16][D][heap:042]: Free heap: 8496 bytes [22:56:16][D][heap:042]: Free heap: 3624 bytes WARNING Disconnected from API: Timeout while waiting for message response! INFO Connecting to 192.168.192.128:6053 (192.168.192.128)

jason-but commented 3 years ago

In case the network setup environment is of any use:

The network has two Wireless Access Points, both running on unique channels (6 and 11) with a channel width of 20MHz, and the same SSID and WiFi WAP2 password

The system has a central server (Intel i5) which has a bridged interface on the same subnet as the esp devices

The server is running a QEMU virtualisation environment hosting a single VM running the HASS OS Image bridged through directly to the local LAN

The HASS image has ESPHome installed within the same system

oxan commented 3 years ago

OK, did not help

That's a bummer.

The code blocks are not working for me, when I use the code-block/back-ticks with the yaml it is still a mess (tried to edit the last message)

See GitHub's manual, you've to use triple backticks.

No change with the Fallback AP and captive portal disabled. Started with more free memory but ran out just as quickly :-(

I'm starting to run out of ideas at this point. You can still try to remove the output, light and binary_sensor, and if you're able to flash/read logs over serial the api and ota, but I don't think it'll make a difference. It might also be worth trying to see if v1.18 works, and/or try it with an older Arduino version.

I suspect it's some kind of network traffic that triggers an issue somewhere in the TCP/IP stack. However, that's really hard to debug, let alone without being able to reproduce it.

jason-but commented 3 years ago

Fixed code snippet

That page on version implies the default is 2.7.4, yet my previous validate yaml states platformio/espressif8266@2.6.2

In either case I've tried to drop to 2.5.2

My system snapshots have ESPHome 1.19.4, 1.20.4 and the new 2021.8.0. This problem has persisted from the first version I used. I presume when you state v1.18 you are talking older than what I already have

oxan commented 3 years ago

That page on version implies the default is 2.7.4, yet my previous validate yaml states platformio/espressif8266@2.6.2

The version numbering between the platformio packages and Arduino versions is off. 2.7.4 is actually a shorthand for platformio/espressif8266@2.6.2, which is platform 2.6.2 with Arduino v2.7.4. This table has the matching versions.

My system snapshots have ESPHome 1.19.4, 1.20.4 and the new 2021.8.0. This problem has persisted from the first version I used. I presume when you state v1.18 you are talking older than what I already have

Yeah, I specificially said v1.18 because some people claim this problem started with v1.19.x series. Not too familar with how the HA addon works to know whether it's possible to downgrade further.

jason-but commented 3 years ago

OK, this may have to wait until another day when I have more time, you can't specify esphome_version in 2021.8.0

I tried to restore both my 1.20.4 and 1.19.4 snapshots and both logged an error message for no such file or directory in /data/tmp

None of them were able to actually start complaining they couldn't write any options

Maybe I'll find some time to try again on the weekend

dougle03 commented 3 years ago

you can't specify esphome_version in 2021.8.0

That's true, but you can install older versions when you use a standalone Docker container if you've a spare Linux box hanging around, or a Windows PC.

jason-but commented 3 years ago

OK, my normal HA runs in a VM with the ESPHome add-on

To try again, I used my desktop running on real hardware, and installed ESPHome via PIP install. I installed and tried on both v1.18.0 and the current 2021.8.0

I retried on the relay-board which is sitting on my desk, and I pared the yaml file down to the absolute minimum as per below:

esphome:
  name: garagedoor
  platform: ESP8266
  board: esp12e

logger:

api:

ota:
  password: <redacted>
wifi:
  ssid: <redacted>
  password: <redacted>

interval:
  - interval: 1min
    then:
      - lambda: |-
          ESP_LOGD("heap", "Free heap: %d bytes", ESP.getFreeHeap());

Uploading this yaml using BOTH versions of ESPHome resulted in a memory leak that caused the device to run out of memory after 25-26 minutes

I really can't understand why this doesn't seem to be a problem that affects other people, I can't imagine any way to make this yaml any simpler

jason-but commented 3 years ago

And just to be sure it wasn't the HA API causing problems, I cut the YAML back to just:

esphome:
  name: garagedoor
  platform: ESP8266
  board: esp12e

logger:

ota:
  password: <redacted>
wifi:
  ssid: <redacted>
  password: <redacted>

interval:
  - interval: 1min
    then:
      - lambda: |-
          ESP_LOGD("heap", "Free heap: %d bytes", ESP.getFreeHeap());

Lasted a bit longer, likely because removing the API resulted with more memory to start with, but overall still lasted only 29 minutes before running out of memory

oxan commented 3 years ago

I really can't understand why this doesn't seem to be a problem that affects other people, I can't imagine any way to make this yaml any simpler

I left the exact config from this comment running overnight and heap usage hasn't increased with a single byte. So it's something specific either to your environment or your device. As I said before, I suspect it's some kind of specific network traffic that triggers this, but that's just a gut feeling.

If you're looking at the logs over serial, it might be interesting to see if this is still reproducable when the wifi is turned off.

jason-but commented 3 years ago

OK, I'll have to do that late at night when everyone else is in bed :-)

I'll give it a go as soon as I can

I had earlier run tcpdump to capture traffic to the device and the only traffic to/from that IP address was:

However this was only captured over a minute or so as I was curious of the ESPHome API line protocol

jason-but commented 3 years ago

I have two wireless access points at home (same SSID/password, different channels - channel 6 + channel 11) and I did try switching one of them off to no avail

jason-but commented 3 years ago

OK, wow, it is a network problem, turning the access point off fixed everything

Except now I have no network!!!

Is there some way to turn debugging on within ESPHome on the network stack?

oxan commented 3 years ago

Setting the log level to very verbose enables some debugging, but I don't think that's enough to debug this.

logger:
    level: VERY_VERBOSE

It seems further debugging would require rebuilding the network stack from Arduino, and I'm not sure how to do that.

jason-but commented 3 years ago

No, I tried that earlier and it didn't output anything useful :-(

I've now tried with just my main access point on with the secondary AP off and with just the secondary AP on and main AP off and free memory is dropping again

Currently capturing all traffic to ESP MAC address and broadcast address to see if I can find anything useful

jason-but commented 3 years ago

So nothing intersting in the TCPDUMP files, lots and lots of ARP requests flying around, regular MDS cache flushes from the ESP devices, followed by hosts always re-ARPing the ESPHome device

Also, semi-regular NTP-client packets from the ESPHome to the NTP Server

jason-but commented 3 years ago

OK, I disabled my NTP server and it was causing the problems.

I don't really want to get rid of my NTP server, can NTP be disabled on ESPHome?

jason-but commented 3 years ago

To advise why it may be an issue, I advertise my NTP server via DHCP to all hosts in the network

jason-but commented 3 years ago

Time to go to bed but final debugging comments for the day

  1. The DHCP Server advertises my local time server as part of IP assignment
  2. As long as my NTP server is disabled, the devices do NOT leak memory, as soon as I enable it, memory starts leaking
  3. I have NOT configured any time component or source in the YAML files
  4. The logs from ESPHome do not print any time or NTP messages to the screen
  5. The ESP8266 is continuously sending out NTPv4 Client messages to my time server
  6. When the time server responds with an NTPv4 Server message, memory leaks
oxan commented 3 years ago

What software + exact version does your NTP server use? Might be useful to reproduce this.

jason-but commented 3 years ago

Server information: Gentoo Linux

DHCP: ISC Dynamic Host Configuration Protocol (DHCP) client/server version 4.4.2

NTP: Version 4.2.8p15 from http://www.ntp.org/

Wireshark dump of NTP Client packet from ESP8266

Frame 190: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
Ethernet II, Src: Espressi_0b:83:57 (dc:4f:22:0b:83:57), Dst: ASRockIn_f7:29:a4 (00:25:22:f7:29:a4)
Internet Protocol Version 4, Src: 192.168.192.16, Dst: 192.168.128.1
User Datagram Protocol, Src Port: 52542, Dst Port: 123
Network Time Protocol (NTP Version 4, client)
    Flags: 0x23, Leap Indicator: no warning, Version number: NTP Version 4, Mode: client
    [Response In: 191]
    Peer Clock Stratum: unspecified or invalid (0)
    Peer Polling Interval: invalid (0)
    Peer Clock Precision: 1.000000 seconds
    Root Delay: 0.000000 seconds
    Root Dispersion: 0.000000 seconds
    Reference ID: NULL
    Reference Timestamp: (0)Jan  1, 1970 00:00:00.000000000 UTC
    Origin Timestamp: (0)Jan  1, 1970 00:00:00.000000000 UTC
    Receive Timestamp: (0)Jan  1, 1970 00:00:00.000000000 UTC
    Transmit Timestamp: (0)Jan  1, 1970 00:00:00.000000000 UTC

Wireshark Dump of NTP Server packet from server

Frame 191: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
Ethernet II, Src: ASRockIn_f7:29:a4 (00:25:22:f7:29:a4), Dst: Espressi_0b:83:57 (dc:4f:22:0b:83:57)
Internet Protocol Version 4, Src: 192.168.128.1, Dst: 192.168.192.16
User Datagram Protocol, Src Port: 123, Dst Port: 52542
Network Time Protocol (NTP Version 4, server)
    Flags: 0xe4, Leap Indicator: unknown (clock unsynchronized), Version number: NTP Version 4, Mode: server
    [Request In: 190]
    [Delta Time: 0.000124000 seconds]
    Peer Clock Stratum: unspecified or invalid (0)
    Peer Polling Interval: invalid (3)
    Peer Clock Precision: 0.000000 seconds
    Root Delay: 0.000000 seconds
    Root Dispersion: 0.000000 seconds
    Reference ID: Unidentified reference source ''
    Reference Timestamp: Jul 31, 2021 06:20:15.456986556 UTC
    Origin Timestamp: (0)Jan  1, 1970 00:00:00.000000000 UTC
    Receive Timestamp: Aug 23, 2021 13:33:55.825646792 UTC
    Transmit Timestamp: Aug 23, 2021 13:33:55.825745308 UTC
oxan commented 3 years ago

Do you also have a (hex)dump of the raw bytes instead of the parsed message? Bit easier to replay that way ;)

jason-but commented 3 years ago

NTP Client message Would you prefer format A 002522f729a4dc4f220b835708004500004c00520000ff11f9ebc0a8c010c0a88001cd3e007b00384d61230000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 Or format B

0000   00 25 22 f7 29 a4 dc 4f 22 0b 83 57 08 00 45 00
0010   00 4c 00 52 00 00 ff 11 f9 eb c0 a8 c0 10 c0 a8
0020   80 01 cd 3e 00 7b 00 38 4d 61 23 00 00 00 00 00
0030   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0040   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0050   00 00 00 00 00 00 00 00 00 00
oxan commented 3 years ago

Either way is fine, thanks.

jason-but commented 3 years ago

NTP Packet from server dc4f220b8357002522f729a4080045b8004c12884000401165fec0a88001c0a8c010007bcd3e0038c1ace40003e900000000000000007f000001e4af6a1f74fd122a0000000000000000e4ce2243d35d9694e4ce2243d3640b68

alternate

0000   dc 4f 22 0b 83 57 00 25 22 f7 29 a4 08 00 45 b8
0010   00 4c 12 88 40 00 40 11 65 fe c0 a8 80 01 c0 a8
0020   c0 10 00 7b cd 3e 00 38 c1 ac e4 00 03 e9 00 00
0030   00 00 00 00 00 00 7f 00 00 01 e4 af 6a 1f 74 fd
0040   12 2a 00 00 00 00 00 00 00 00 e4 ce 22 43 d3 5d
0050   96 94 e4 ce 22 43 d3 64 0b 68