ESP32 devices show offline, mDNS not working correctly

tomm1ed commented 1 month ago

The problem

Running ESPHome 2024.9.2 in a docker. 9 devices connected. The ones based on ESP8266 stay online/mDNS keeps on working, the ESP32 all drop off after some amount of time. When I reboot the ESP32 they work for some time again but then drop off again. Connection to HA keeps on working and when I put the [name].local in my /etc/hosts file the ESP32 are Updateable and Logs also work. There seems to be a difference between the ESP8266 and the ESP32 mDNS code because ESP8266 based boards stay online.

Running aioesphomeapi-discover from within the docker appears to show working mDNS:

ONLINE |vindriktning-keuken             |192.168.12.50  |e8db84d079c4|2024.8.3        |ESP8266   |d1_mini
ONLINE |p1-reader                       |192.168.12.14  |98cdac3191aa|2024.8.3        |ESP8266   |d1_mini
ONLINE |vindriktning-garage             |192.168.12.49  |e8db84dc374e|2024.8.3        |ESP8266   |d1_mini
ONLINE |everything-presence-lite-4f0df0 |192.168.12.241 |08a6f74f0df0|2024.9.2        |ESP32     |esp32dev

5 other ESP32 boards are offline and the EPL will be offline soon enough (was booted about 15 minutes ago), any idea how to fix (bar using the PING setting) Happens with both framework arduino and esp-idf. Devices do not go down, are pingable and usable in Home Assistant without issue.

Which version of ESPHome has the issue?

2024.9.2

What type of installation are you using?

Docker

Which version of Home Assistant has the issue?

2024.9.3

What platform are you using?

ESP32

Board

Mini D1 ESP8266, Custom Onju Voice PCB, Custom Everything Presence Lite board, M5 Stamp S3,

Component causing the issue

No response

Example YAML snippet

substitutions:
  name: btproxy-1
  friendly_name: btproxy-1

esphome:
  name: ${name}
  friendly_name: ${friendly_name}
  project:
    name: esphome.bluetooth-proxy
    version: "1.0"
  platformio_options:
    board_build.f_flash: 40000000L
    board_build.flash_mode: dio
    board_build.flash_size: 4MB

esp32:
  board: esp32-s3-devkitc-1
  framework:
    type: esp-idf
    sdkconfig_options:
      CONFIG_BT_BLE_42_FEATURES_SUPPORTED: y
      CONFIG_BT_BLE_50_FEATURES_SUPPORTED: n

dashboard_import:
  package_import_url: github://esphome/firmware/bluetooth-proxy/esp32-generic.yaml@main

esp32_ble_tracker:
  scan_parameters:
    # We currently use the defaults to ensure Bluetooth
    # can co-exist with WiFi In the future we may be able to
    # enable the built-in coexistence logic in ESP-IDF
    active: true

bluetooth_proxy:
  active: true

button:
  - platform: safe_mode
    name: Safe Mode Boot
    entity_category: diagnostic

# Enable logging
logger:
  level: INFO

# Enable Home Assistant API
api:
  encryption:
    key: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx="

ota:
  platform: esphome
  password: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password

  # Enable fallback hotspot (captive portal) in case wifi connection fails
  ap:
    ssid: "Btproxy-1 Fallback Hotspot"
    password: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

captive_portal:

Anything in the logs that might be useful for us?

INFO ESPHome 2024.9.2
INFO Reading configuration /config/onju-voice-woonkamer.yaml...
INFO Starting log output from onju-voice-woonkamer.local using esphome API
WARNING Can't connect to ESPHome API for onju-voice-woonkamer.local: Error resolving IP address: [Errno -2] Name or service not known (APIConnectionError)

Directly after reboot of Onju Voice it shows up in the aioesphomeapi-discover
ONLINE |onju-voice-woonkamer            |192.168.12.16  |ecda3b5480e0|2024.8.3        |ESP32     |esp32-s3-devkitc-1

and logs work:
INFO ESPHome 2024.9.2
INFO Reading configuration /config/onju-voice-woonkamer.yaml...
INFO Starting log output from 192.168.12.16 using esphome API
INFO Successfully connected to onju-voice-woonkamer @ 192.168.12.16 in 0.060s

Additional information

No response

ssieb commented 1 month ago

Can you resolve the mdns name from another computer on the network?

tomm1ed commented 1 month ago

Can you resolve the mdns name from another computer on the network?

Tested from Ubuntu 22.04, Ubuntu 24.04, MacOS 15.0.1 and Windows Server 2019. All with the same result: ESP8266 work, ESP32 don't unless shortly after reboot.

Example from MacOS:

tom@MacBook-Air-van-Tom ~ % ping p1-reader.local
PING p1-reader.local (192.168.12.14): 56 data bytes
[...]
tom@MacBook-Air-van-Tom ~ % ping btproxy-1.local
ping: cannot resolve btproxy-1.local: Unknown host

Immediately after reboot of btproxy-1:

tom@MacBook-Air-van-Tom ~ % ping btproxy-1.local
PING btproxy-1.local (192.168.12.13): 56 data bytes

ssieb commented 1 month ago

I've heard of this happening before, but it was a wifi or router firewall issue in that case. I've never seen it happen and it hasn't come up on discord. What is your wifi?

tomm1ed commented 1 month ago

I've heard of this happening before, but it was a wifi or router firewall issue in that case. I've never seen it happen and it hasn't come up on discord. What is your wifi?

I use Aruba InstantAP for WiFi and Fortigate as a firewall.

Doesn't the fact that the ESP8266's (that are on the exact same network as the ESP32's) don't have the same issue suggest that there is a definitive difference between the two implementations and that the ESP32 one is somehow 'flawed' (for lack of a better term)?

(Also FWIW; my HomePods, AppleTVs, my Macs, my Shairport-sync Raspberry Pi's and my printer all also work without issue on the same WiFi using Bonjour/mDNS)

ssieb commented 1 month ago

Clearly there's some sort of difference, but if no one else can reproduce the issue, it suggests something specific to your network. Can you do any network sniffing to see what's happening? I suggest coming to the esphome discord server for further discussion.

tomm1ed commented 1 month ago

Clearly there's some sort of difference, but if no one else can reproduce the issue, it suggests something specific to your network. Can you do any network sniffing to see what's happening? I suggest coming to the esphome discord server for further discussion.

Definitely not the only one that has this, see amongst others here: https://github.com/esphome/issues/issues/3003#issuecomment-1047746276

Here someone with the same WiFi brand as me: https://github.com/esphome/issues/issues/3003#issuecomment-1341571840

Anyway; tcpdumping on the docker host of the esphome installation shows me that all WiFi mDNS queries come from the AP that is the virtual controller.

When booting the device announces itself (through the AP with IP 192.168.2.53) 25.333819 192.168.2.53 224.0.0.251 MDNS 511 Standard query response 0x0000 SRV, cache flush 0 0 6053 onju-voice-zolder.local A, cache flush 192.168.12.17 TXT, cache flush A, cache flush 192.168.12.17 A, cache flush 192.168.12.17

It never is mentioned ever again so it indeed could be possible that the AP is somehow dropping the answers that it deems Invalid. Following the link to https://github.com/espressif/esp-idf/issues/7453 Is there any way to force esphome to enable the CONFIG_MDNS_STRICT_MODE option?

ssieb commented 1 month ago

It looks to me like all the problems listed in that issue were due to network equipment. As far as I can tell from the esp-idf link, it is already fixed in the current version.

tomm1ed commented 1 month ago

It looks to me like all the problems listed in that issue were due to network equipment. As far as I can tell from the esp-idf link, it is already fixed in the current version.

So it happens on Ubiquity, FritzBox and Aruba WiFi networks and only with ESPhome ESP32’s but the problem is with the network? (my network is riddled with mDNS devices and only the ESPHome ESP32s stop working) This logically cannot be the conclusion. Is there anything I can do to supply you with info to get to the bottom of this?

randybb commented 1 month ago

Was not reading linked issues, but on Ubiquity Unify is mDNS working fine, even between VLAN's - have been using only mDNS names (HA is connecting to esp8266 and esp32 MCUs via mDNS names, everything is connecting to mqtt and other services via their mDNS names too) and never had any issues.

ssieb commented 1 month ago

I think everyone in that linked issue resolved their issues by fixing their network equipment or at least conclusively proved that it was the cause. We don't know of any issues and can't reproduce it, so you'll have to find some proof that it's an esphome issue because there's nothing we can do without that.

tomm1ed commented 1 month ago

I think everyone in that linked issue resolved their issues by fixing their network equipment or at least conclusively proved that it was the cause. We don't know of any issues and can't reproduce it, so you'll have to find some proof that it's an esphome issue because there's nothing we can do without that.

So as I have also a lot of Tasmota devices I enabled mDNS on one of the ESP32 versions and they don't have this issue:

11:26:36.160  Add        2  11 local.               _esphomelib._tcp.    onju-voice-woonkamer
11:29:30.316  Rmv        0  11 local.               _esphomelib._tcp.    onju-voice-woonkamer

After the Remove nothing is heard from this device until I reboot it.

Tasmota32 works:

11:26:36.160  Add        2  11 local.               _http._tcp.          M5stack-temp-woonkamer
11:29:30.316  Rmv        0  11 local.               _http._tcp.          M5stack-temp-woonkamer
11:30:27.308  Add        2  11 local.               _http._tcp.          M5stack-temp-woonkamer

As do the ESP8266 ESPHome devices (and every other device that uses mDNS, from my Hikvision cams to my OTGW to my Laserjet to my Apple devices):

11:26:36.159  Add        3  11 local.               _esphomelib._tcp.    vindriktning-garage
11:29:30.316  Rmv        1  11 local.               _esphomelib._tcp.    vindriktning-garage
11:30:27.308  Add        3  11 local.               _esphomelib._tcp.    vindriktning-garage

Other examples:

11:26:36.160  Add        3  11 local.               _http._tcp.          HP LaserJet 500 MFP M525 [3DFDE6]
11:29:30.316  Rmv        1  11 local.               _http._tcp.          HP LaserJet 500 MFP M525 [3DFDE6]
11:30:27.308  Add        3  11 local.               _http._tcp.          HP LaserJet 500 MFP M525 [3DFDE6]

11:26:36.160  Add        3  11 local.               _http._tcp.          OTGW
11:29:30.316  Rmv        1  11 local.               _http._tcp.          OTGW
11:30:27.308  Add        3  11 local.               _http._tcp.          OTGW

So it is literally just ESPHome ESP32 devices that stop working after one announcement, so how can the conclusion not be that they do something different from every other device on my network.

Was not reading linked issues, but on Ubiquity Unify is mDNS working fine, even between VLAN's - have been using only mDNS names (HA is connecting to esp8266 and esp32 MCUs via mDNS names, everything is connecting to mqtt and other services via their mDNS names too) and never had any issues.

FYI HA has no issues for me connecting to the devices. It is just ESPHome that shows them as offline plus the mDNS name gets removed and never added again.

ssieb commented 1 month ago

Did you add the devices to HA by name or IP address?

I guess I need to clarify my previous comment. It's not enough to know that esphome seems to be doing something different on your network, you need to know why or what. Without that, there's nothing we can do.

tomm1ed commented 1 month ago

Did you add the devices to HA by name or IP address?

I guess I need to clarify my previous comment. It's not enough to know that esphome seems to be doing something different on your network, you need to know why or what. Without that, there's nothing we can do.

Some by IP, some automatically detected.

I am more than happy to try and get more info, tell me how I can help to get it the bottom of it and I’ll do my best.

ssieb commented 1 month ago

If you added the device to HA by IP, then of course it's going to work because it doesn't need to do the MDNS lookup. I don't really know how to test this with your hardware. I use openwrt on all my APs, so I can just run tcpdump on the AP and find out what's happening. Can another wifi device see the MDNS requests? Can you take one of the devices to another network to see if it still happens there? Do you have another AP you can setup independently to test with?

KodinLanewave commented 1 month ago

I'd start by grabbing packet capture of your network when the ESPHome devices hop on your wifi via a tool such as wireshark; this will allow you to compare the mDNS hostname registration to see if it is different between esphome and other devices; If we weren't using mDNS, my best guess here based on symptoms is that your network, like mine, is using dynamic dns (very short TTL's for DHCP lease, as well as DNS,) and esphome isn't periodically updating the DNS server with it's hostname once the IP lease expires. Since we are (supposedly) using mDNS... I guess it depends on if your network has multicast enabled or dropped. In my case, my modem is the primary DNS server, and it is unable to resolve my esphome devices. I honestly don't know enough about mDNS to know how to diagnose it yet... None of my esphome devices resolve via docker container, however they do resolve via command-line on my linux server. Curious...

ssieb commented 1 month ago

esphome doesn't update the DNS server, which is the same as any other client that isn't configured for dynamic DNS updates. If there's to be a DNS update, it will be done by the DHCP server. But MDNS isn't DNS, so that's not even related.

DavesCodeMusings commented 3 weeks ago

For what it's worth, I am experiencing the same issue with ESPHome and 8266-based Sonoff S31 smart plugs. Here's what I'm seeing:

The smart plug shows online in ESPHome dashboard after it is first plugged in and powered up.
It shows up in the output of avahi-browse (e.g. avahi-browse -alrt | grep s31).
ESPHome can no longer resolve the smart plug name after a short time (maybe ten or fifteen minutes) when I try to fetch logs or update firmware, though it still shows as "online".
After that time, avahi-browse shows the device (cached?), but complains "failed to resolve" after a "timeout reached".
The smart plug is always controllable from home assistant. Automations and manual control work regardless of avahi-browse not resolving or ESPHome dashboard showing it as offline.
When the smart plug is offline in the ESPHome dashboard, it no longer appears in the avahi-browse output, not even with a "timeout reached".
All other non-ESPHome mdns devices show up in avahi-browse.

My setup:

I am running ESPHome in a docker container, like the author of this issue.
ESPHome and Home Assistant are both running with "network_mode: host"
ESPHome and Home Assistant containers are from :latest, pulled Oct 24, 2024.
Smart plugs updated to latest firmware as of Oct 24, 2024.
Networking gear is Ubiquiti Unifi.
Docker and avahi-browse are running on Alpine Linux x86_64 3.20.3 (apk updated on Oct 20, 2024.)
ESPHome, Home Assistant, and smart plugs are all on the same default VLAN.
No firewall between any of the devices.

After plugging in the smart plug, there is a short window of time (10 to 15 minutes) that I can interact with it normally in the ESPHome dashboard. After that...

The smart plug still shows as online in the ESPHome dashboard for a while longer (cached info, maybe?)
Attempting to fetch logs results in: "WARNING Can't connect to ESPHome API for s31-4.local: Error resolving IP address: [Errno -2] Name or service not known (APIConnectionError)"
Output from avahi-browse shows the device, but fails to resolve (see below.)

$ avahi-browse -alrt | grep s31
+   eth0 IPv4 s31-4                                         _esphomelib._tcp     local
Failed to resolve service 's31-4' of type '_esphomelib._tcp' in domain 'local': Timeout reached

Interesting aside concerning the avahi-browse output above... I have five Sonoff s31 smart plugs. Only the s31-4 that I recently unplugged and re-inserted shows up. Any of the others are not in the avahi-browse output, show offline in ESPHome dashboard, but are still controllable in Home Assistant.

Current workaround:

Unplugging the smart plug and re-powering seems to work and the device appears "online" long enough to get logs or OTA update the firmware.

Other thoughts:

This seems like it's related to the firmware on the device, though why Home Assistant is not affected, I'm not sure. Perhaps it caches last known addresses for a longer period of time than avahi-browse.

I do not remember this happening before my recent docker pull of the latest ESPHome container and subsequent firmware updates on the devices. Though I did both updates in a short time span, so I can't say for sure which it was, if either.

If there's anything you would like me to try, I'm happy to post results here and help get this resolved. Otherwise, just chiming in to confirm the original poster's statements, let you know it's happening on ESP8266 as well, and suggest a short-term workaround of unplugging/replugging.

DavesCodeMusings commented 3 weeks ago

Additional information and another workaround...

As a troubleshooting step, I rolled my ESPHome docker container back to image: 2024.9.2

Everything is looking good in the ESPHome dashboard. Devices are showing up as online. avahi-browse -alrt | grep s31 also shows all my Sonoff smart plug devices now.

I have not changed the device firmware at all, only the docker container image. I have not unplugged and re-plugged any devices. Why this would affect avahi-browse (running outside of the container), I have no idea.

Again, these are all 8266 devices. Perhaps if @tomm1ed were to roll back to an even earlier Docker container, it may fix the issue for their ESP32 devices.

UPDATE... it's not completely fixed. After the 10 or 15 minute interval, the devices go back to "Failed to resolve service" with "timeout reached" Both in ESPHome's dashboard and avahi-browse.

DavesCodeMusings commented 3 weeks ago

UPDATE: And now I'm back to using container image :latest with results similar to what I got by pinning it to 2024.9.2. So perhaps it is the act of restarting the container that makes the difference and not necessarily the version. Apologies to anyone who read this far expecting a solution.

tomm1ed commented 3 weeks ago

For what it's worth, I am experiencing the same issue with ESPHome and 8266-based Sonoff S31 smart plugs. Here's what I'm seeing:

The smart plug shows online in ESPHome dashboard after it is first plugged in and powered up.

It shows up in the output of avahi-browse (e.g. avahi-browse -alrt | grep s31).

ESPHome can no longer resolve the smart plug name after a short time (maybe ten or fifteen minutes) when I try to fetch logs or update firmware, though it still shows as "online".

After that time, avahi-browse shows the device (cached?), but complains "failed to resolve" after a "timeout reached".

The smart plug is always controllable from home assistant. Automations and manual control work regardless of avahi-browse not resolving or ESPHome dashboard showing it as offline.

When the smart plug is offline in the ESPHome dashboard, it no longer appears in the avahi-browse output, not even with a "timeout reached".

All other non-ESPHome mdns devices show up in avahi-browse.

My setup:

I am running ESPHome in a docker container, like the author of this issue.

ESPHome and Home Assistant are both running with "network_mode: host"

ESPHome and Home Assistant containers are from :latest, pulled Oct 24, 2024.

Smart plugs updated to latest firmware as of Oct 24, 2024.

Networking gear is Ubiquiti Unifi.

Docker and avahi-browse are running on Alpine Linux x86_64 3.20.3 (apk updated on Oct 20, 2024.)

ESPHome, Home Assistant, and smart plugs are all on the same default VLAN.

No firewall between any of the devices.

After plugging in the smart plug, there is a short window of time (10 to 15 minutes) that I can interact with it normally in the ESPHome dashboard. After that...

The smart plug still shows as online in the ESPHome dashboard for a while longer (cached info, maybe?)

Attempting to fetch logs results in: "WARNING Can't connect to ESPHome API for s31-4.local: Error resolving IP address: [Errno -2] Name or service not known (APIConnectionError)"

Output from avahi-browse shows the device, but fails to resolve (see below.)
$ avahi-browse -alrt | grep s31
+   eth0 IPv4 s31-4                                         _esphomelib._tcp     local
Failed to resolve service 's31-4' of type '_esphomelib._tcp' in domain 'local': Timeout reached
Interesting aside concerning the avahi-browse output above... I have five Sonoff s31 smart plugs. Only the s31-4 that I recently unplugged and re-inserted shows up. Any of the others are not in the avahi-browse output, show offline in ESPHome dashboard, but are still controllable in Home Assistant.

Current workaround:

Unplugging the smart plug and re-powering seems to work and the device appears "online" long enough to get logs or OTA update the firmware.

Other thoughts:

This seems like it's related to the firmware on the device, though why Home Assistant is not affected, I'm not sure. Perhaps it caches last known addresses for a longer period of time than avahi-browse.

I do not remember this happening before my recent docker pull of the latest ESPHome container and subsequent firmware updates on the devices. Though I did both updates in a short time span, so I can't say for sure which it was, if either.

If there's anything you would like me to try, I'm happy to post results here and help get this resolved. Otherwise, just chiming in to confirm the original poster's statements, let you know it's happening on ESP8266 as well, and suggest a short-term workaround of unplugging/replugging.

For me it is just the ESP32s, ESP8266 always shows up on ESPHome, avahi on Linux, macOS and windows. Have not had time to deep dive yet (my workaround for the time being is add the .local addresses of the ESPHome ESP32 devices to /etc/hosts so I can at least update them wirelessly and see the logs) but glad that I am not alone in this.

esphome / issues