esphome / issues

Issue Tracker for ESPHome
https://esphome.io/
290 stars 36 forks source link

Devices Show Offline Regardless of Real Status #4732

Closed z3liff closed 1 year ago

z3liff commented 1 year ago

The problem

As of update 2023.7.0, devices stop showing up as online. All ESPHome devices on my network now show as offline for me, even though they immediately pull up logs when I go to pull up the logs. Additionally, all the data is being reported in a regular timely fashion. Once I updated all their firmware to the newest version, the devices always show up as "offline", even though they are working and providing data + logs are readily accessible.

My ESPHome config hasn't changed. It is set to status_use_ping: true.

My network config hasn't changed, and I confirmed I can ping the devices. The problem appears to be related with the status ping functionality of ESPHome.

Which version of ESPHome has the issue?

2023.7.0

What type of installation are you using?

Home Assistant Add-on

Which version of Home Assistant has the issue?

Home Assistant 2023.7.3

What platform are you using?

ESP32

Board

ESP32-POE-ISO

Component causing the issue

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

Applies to ESP32 and ESP8266 devices. Appears to be something specific to the status_use_ping setting.

DarwinData commented 1 year ago

I have exactly the same problem after updating to 2023.7.3

z3liff commented 1 year ago

problem is consistent across reboots, etc. Confirmed it has nothing to do with network setup. Given a consistent network topology/setup, merely updating to newest HA and ESPHome version causes the checks to fail.

sermayoral commented 1 year ago

Same problem here with all my esps-32 (esp32, esp32-poe). I have the mdns disabled because I have my own .lan domain defined in my router, and I have the status_use_ping property activated. It has been working perfectly from years, until 2023.7.

The show log button connects and show the esp's log well. The problem is in the main dashboard, where they all appear offline.

Curiously, the only esp8266 I have that can be seen online...

imagen

ckglobalroaming commented 1 year ago

same problem here same version. All devices static IP. MDNS disabled. No problem prior to this update. All devices log wirelessly.

andrewjswan commented 1 year ago

Some problem, ESPHome 2023.7.1

uefik commented 1 year ago

Some problem, update is ok

mwolter805 commented 1 year ago

Same problem, all ESPs except for one are offline. No issue with 2023.6.2, started with 2023.7.1. @jesserockz

Screenshot 2023-08-04 at 8 54 40 AM

Here is an example config

substitutions:
  esp32_name: "ble-esp32"
  prefix: ble_esp32
  friendly_name: BLE ESP32

  # XIAOMI Mijia Bluetooth Thermometer 2 A4:C1:38:E9:9C:C4
  mijia1_friendly_name: Mijia Temp E99CC4

esphome:
  name: ${esp32_name}
  comment: BLE receiver for Xiaomi Temp Humidity Sensor and iTag

esp32:
  board: esp32-poe
  framework:
    type: esp-idf

preferences:
  # the default of 1min is far too short--flash chip is rated
  # for approx 100k writes.
  flash_write_interval: "48h"

ethernet:
  type: LAN8720
  mdc_pin: GPIO23
  mdio_pin: GPIO18
  clk_mode: GPIO17_OUT
  phy_addr: 0
  power_pin: GPIO12
  use_address: 10.10.21.10

mdns:
  disabled: true

# Enable logging, only one level should be enabled at a time
logger:
  level: warn  # normal level for logger
#  level: debug # only use this level when debugging an issue
#  level: VERY_VERBOSE # only use this level if debug is not producing enough info to debug an issue
  logs:
    esp32_ble_tracker: error # normal level for this function
#    esp32_ble_tracker: debug # enable when discovering BLE devices

# Enable Home Assistant API
api:
  encryption:
    key: !secret encryption_key

# Enable over-the-air flashing
ota:
  password: !secret ota_password

# web_server:

time:
  - platform: homeassistant

ble_client:
  # Replace with the MAC address of your device.
  - mac_address: <MAC_ADDY>
    id: itag_test

switch:
  - platform: restart
    name: "${friendly_name} ESP Restart"
    id: ${prefix}_restart_switch

# iTag Test - Enable BLE
  - platform: ble_client
    ble_client_id: itag_test
    id: itag_test_enable
    name: iTag Test Enable BLE
    icon: mdi:bluetooth-connect

sensor:
# This entry registers and awaits notifications for the
# characteristic that signals button presses. Each time
# a notification is received, the corresponding binary_sensor
# is briefly toggled.
  - platform: ble_client
    ble_client_id: itag_test
    type: characteristic
    id: itag_test_raw
#    name: "Test iTag btn"
    service_uuid: 'ffe0'
    characteristic_uuid: 'ffe1'
    notify: true
    update_interval: never
    on_notify:
      then:
        - binary_sensor.template.publish:
            id: test_button
            state: ON
        - binary_sensor.template.publish:
            id: test_button
            state: OFF

# This entry queries the battery level. Some tags may not
# support this characteristic, you will see 'Unknown' in the
# HA frontend.
  - platform: ble_client
    ble_client_id: itag_test
    type: characteristic
    name: "iTag Test Battery"
    service_uuid: '180f'
    characteristic_uuid: '2a19'
    icon: 'mdi:battery'
    unit_of_measurement: '%'
    filters:
      - throttle: 1min

# Uptime Sensor    
  - platform: uptime
    name: "${friendly_name} ESP Uptime"

# XIAOMI Mijia Bluetooth Thermometer 2 A4:C1:38:E9:9C:C4
  - platform: xiaomi_lywsd03mmc
    mac_address: A4:C1:38:E9:9C:C4
    bindkey: "05e04076be48f427f3d90e166d0fbd5e"
    temperature:
      name: "${mijia1_friendly_name} Temperature"
      id: ${prefix}_temperature
      filters:
        - or:
          - throttle: 1min
          - delta: .5
    humidity:
      name: "${mijia1_friendly_name} Humidity"
      id: ${prefix}_humidity
      filters:
        - or:
          - throttle: 1min
          - delta: 1
    battery_level:
      name: "${mijia1_friendly_name} Battery Level"
      filters:
        - or:
          - throttle: 1min
          - delta: 1

  # - platform: heapmon
  #   id: heapspace
  #   name: "${friendly_name} Free Space"

binary_sensor:
  - platform: template
    id: test_button
    name: "iTag Test Button"
    filters:
      delayed_off: 200ms
z3liff commented 1 year ago

is this somehow related to the MQTT lookup features?

WhyDoYouMakeUsDoThis commented 1 year ago

Looking at a tcpdump from both the machine with my esphome dashboard and another on the same wifi, the devices do not respond to the mDNS queries. I have other devices (volumio, printer, etc) which do respond so, it's not the network.

I have one esp32 which will respond initially on boot, but very shortly after it stops responding to the queries.

andrewjswan commented 1 year ago

the devices do not respond to the mDNS queries.

This does not explain the problem when the Dashboard uses Ping and devices in different networks, and mDNS is disabled for them, everything worked before the update, now it doesn't.

DarwinData commented 1 year ago

the devices do not respond to the mDNS queries.

This does not explain the problem when the Dashboard uses Ping and devices in different networks, and mDNS is disabled for them, everything worked before the update, now it doesn't.

Agreed, I also do not use mDNS.

WhyDoYouMakeUsDoThis commented 1 year ago

The symptom of the issue is identical. If a new ticket was opened it would just be pointed to this one. How does "use_ping" get the address of the device to ping? Do you have "use_address" for each device?(just curious)

andrewjswan commented 1 year ago

How does "use_ping" get the address of the device to ping?

Static IP address in the config (manual ip)

DAVe3283 commented 1 year ago

How does "use_ping" get the address of the device to ping? Do you have "use_address" for each device?(just curious)

In my case, via DNS. The hostname + domain are read from the YAML, and a DNS lookup returns the IP address of the device.

Looking at a packet capture, it appears the problem is ESPHome is not actually doing DNS lookups and pinging most of the devices. Of the 25 I have, it only tries 3, for example:

tcpdump -i tap102i0
...
21:23:04.091915 IP HAss.ReaperLegion.net.51198 > DC1.ReaperLegion.net.domain: 51756+ [1au] A? network-closet-fan-controller.InternetOfShit.ReaperLegion.net. (90)
21:23:04.091948 IP HAss.ReaperLegion.net.45283 > DC1.ReaperLegion.net.domain: 24879+ [1au] AAAA? network-closet-fan-controller.InternetOfShit.ReaperLegion.net. (90)
21:23:04.092183 IP DC1.ReaperLegion.net.domain > HAss.ReaperLegion.net.51198: 51756* 1/0/1 A 192.168.0.117 (106)
21:23:04.092229 IP DC1.ReaperLegion.net.domain > HAss.ReaperLegion.net.45283: 24879* 0/1/1 (141)
21:23:04.092487 IP HAss.ReaperLegion.net > network-closet-fan-controller.InternetOfShit.ReaperLegion.net: ICMP echo request, id 513, seq 1, length 64
21:23:04.093101 IP network-closet-fan-controller.InternetOfShit.ReaperLegion.net > HAss.ReaperLegion.net: ICMP echo reply, id 513, seq 1, length 64
21:23:04.093335 IP HAss.ReaperLegion.net.45283 > DC1.ReaperLegion.net.domain: 45558+ [1au] PTR? 117.0.168.192.in-addr.arpa. (55)
21:23:04.093646 IP DC1.ReaperLegion.net.domain > HAss.ReaperLegion.net.45283: 45558* 1/0/1 PTR network-closet-fan-controller.InternetOfShit.ReaperLegion.net. (130)

You can see it requests the IPv4 and IPv6 addresses of network-closet-fan-controller.InternetOfShit.ReaperLegion.net, gets an IPv4 back (no IPv6 support by my ISP, so I don't run it internally either), pings the device, gets a reply, does a reverse DNS lookup, and gets the same hostname back. This is all working as expected, and the device shows online: image

But then it just ignores the majority of the devices, so they show as offline in the dashboard. If I click Logs or update them over the air, it works perfectly. Just fails to do the lookup for the dashboard.

z3liff commented 1 year ago

I also do not use mDNS -- just DNS. Worked flawlessly before the update, hasn't worked at all for any of my devices since.

@DAVe3283's post seems to summarize the issue precisely.

jmarevans commented 1 year ago

Where does the 'status_use_ping' setting go? I have HA installed on docker on top of debian running on an Intel NUC. The following containers are also installed: portainer, mariadb, influxdb, adminer, and mosquitto.

randybb commented 1 year ago

@jmarevans esphome page is your friend :) https://esphome.io/guides/faq.html?highlight=status_use_ping#docker-reference

mwolter805 commented 1 year ago

Updated from 2023.7.1 to 2023.8.1 and ESPs are still offline. This is a container installation and all ESPs were online with version 2023.6.2. Still can ping the ESPs from the container console with no issues. Not a major issue as all devices still operate properly via MQTT or API.

mwolter805 commented 1 year ago

With ssieb's help on discord I was able to get a little more info. Did a tcpdump and found no pings were sent by the esphome container, except for one ESP. Checked the config for this ESP and it does not have mdns disabled. All the other ESPs have have disabled.

As soon as

mdns:
  disabled: true

is added to the ESP's config and the container is restarted no pings are sent by the esphome container and the ESP shows offline in the dashboard. This is with ESPHOME_DASHBOARD_USE_PING=true as an environment variable in the docker compose.

So it appears the dashboard is either not honoring ESPHOME_DASHBOARD_USE_PING=true or the disabling of mdns in the ESP config and subsequently not performing a ping. After removing the disabling of mdns in the ESP config the ESP is shown as online in the dashboard.

mwolter805 commented 1 year ago

Connected to the console of the container and edited the following file

nano /esphome/esphome/dashboard/dashboard.py

Removed lines lines 871 and 872 which were

                    if entry.no_mdns is True:
                        continue

Restarted the container and ESPs using mdns and not now show online.

ckglobalroaming commented 1 year ago

I can confirm that with my devices all being manual IP, ver 2023.8.1 and mdns: disabled: false, the dashboard does work. I dont want to compile mdns code into the binaries if i dont use it. Will this get corrected so mdns disabled is possible again??

Thanks. C.

jdwhite commented 1 year ago

esphome 2023.8.2 via podman (ghcr.io/esphome/esphome:latest), Fedora 38 x86_64, did not fix the ping issue for me. My workarond:

> podman exec -ti esphome /bin/bash
root@eefbec05beb7:/config# ping 192.168.1.3
bash: /bin/ping: Operation not permitted
root@eefbec05beb7:/config# getcap /bin/ping
/bin/ping cap_net_raw=ep
root@eefbec05beb7:/config# setcap cap_net_raw+p /bin/ping
root@eefbec05beb7:/config# getcap /bin/ping
/bin/ping cap_net_raw=p
root@eefbec05beb7:/config# ping 192.168.1.3
PING 192.168.1.3 (192.168.1.3) 56(84) bytes of data.
64 bytes from 192.168.1.3: icmp_seq=1 ttl=64 time=0.036 ms
64 bytes from 192.168.1.3: icmp_seq=2 ttl=64 time=0.070 ms
[...]

Within a couple seconds of running setcap my devices showed ONLINE in the dashboard. Problem solved.

jdwhite commented 1 year ago

This issue persists in 2023.8.3 and appears to be a result of running the container unprivileged. If I add the podman-run --privileged flag then I don't need to tweak the file capabilities for /bin/ping. However, since the esphome container does not otherwise need to run privileged and I'm starting the container with a systemd service file, I added this to the esphome service file I'd previously created with podman-generate-systemd:

ExecStartPost=/usr/bin/podman exec esphome /sbin/setcap cap_net_raw+p /bin/ping

mwolter805 commented 1 year ago

Not sure how the permissions need to be set with podman but running docker version 24.0.2 with the following compose file, the setcap permissions do not need to be changed when running unprivileged.

x-logging: 
      &default-logging
      driver: local
      options:
        max-size: 5m
        max-file: 2

services:

  esphome:
    image: ghcr.io/esphome/esphome:2023.8.2
    container_name: esphome
    hostname: esphome
    logging: *default-logging
    environment:
      - ESPHOME_DASHBOARD_USE_PING=true
      - TZ=America/Los_Angeles
    restart: always
    volumes:
      - /config/.esphome/build # create a named volume for the build dir to reduce backup size and move it out of the config dir mapped to the host
      - ./config:/config
      - /cache # create a named volume for the cache to reduce backup size and move it out of the config dir mapped to the host

networks:
  default:
    name: traefik-network
jdwhite commented 1 year ago

@mwolter805 - Thanks for your response. Like the OP reported, something changed between the 2023.6.5 and 2023.7.0 container as pinging broke for me in 2023.7.0 as well. I've been using the same podman/systemd setup for almost a year with no issues until 2023.7.0. My esphome.service file:

# container-esphome.service
# autogenerated by Podman 4.1.1
# Mon Jul 11 13:20:31 CDT 2022

[Unit]
Description=Podman container-esphome.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=%t/containers

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=70
ExecStartPre=/bin/rm -f %t/%n.ctr-id
ExecStart=/usr/bin/podman run \
        --cidfile=%t/%n.ctr-id \
        --cgroups=no-conmon \
        --rm \
        --sdnotify=conmon \
        -d \
        --replace \
        -i \
        -it \
        -v /spool/containers/esphome:/config:Z \
        -e ESPHOME_DASHBOARD_USE_PING=true \
        -p 192.168.1.22:80:6052 \
        --name esphome \
        ghcr.io/esphome/esphome:latest dashboard /config
ExecStartPost=/usr/bin/podman exec esphome /sbin/setcap cap_net_raw+p /bin/ping
ExecStop=/usr/bin/podman stop --ignore --cidfile=%t/%n.ctr-id
ExecStopPost=/usr/bin/podman rm -f --ignore --cidfile=%t/%n.ctr-id
Type=notify
NotifyAccess=all

[Install]
WantedBy=default.target

To add another oddity, I discovered today that with the the 2023.6.5 container I also get the same Operation not permitted error with ping, but the devices show as online!

> podman exec -ti esphome /bin/bash
root@a6cef2ceb336:/config# /bin/ping
bash: /bin/ping: Operation not permitted

So I'm still eyeing a change in the container between 2023.6.5 and 2023.7.0 as the culprit. I know how to work around it, but would be a nicety if it didn't need to be worked around and I figure someone else has to be using podman (but perhaps not with systemd unit files) to run esphome containers.

Plus, I'd like to know if I'm doing something fundamentally wrong here and somehow it's "just worked" prior to 2023.7.0.

To rule out systemd being a factor I ran the container like so from a root shell:

podman run --rm --replace -i -it -v /spool/containers/esphome:/config:Z -e ESPHOME_DASHBOARD_USE_PING=true -p 192.168.1.22:80:6052 --name esphome ghcr.io/esphome/esphome:latest dashboard /config

but the results are the same:

I think I can rule out using systemd as a factor. I will look more closely at what changed between 2023.6.5 and 2023.7.0.