Zigbee2mqtt High-Availability controller prototype

Introduction

High Availability controller prototype for two zigbee2mqtt instances with independent USB Zigbee dongles (specifically implemented for Sonoff ZBDongle-P). The final purpose is enabling that one of the zigbee2mqtt instances could play the "active" role, meanwhile the second one is in "stand-by" status for a single Zigbee network (enabling also an automated switchover between active and stand-by nodes in case of detecting problems in the active instance). Please note that Zigbee standard only allows to define a single coordinator node, so the only available high-availability model that could be used is active-stand by. Detailed features:

Automatic synchronization of zigbe2mqtt configuration and data files from active to stand-by instance, to ensure an up-to-date Zigbee and MQTT interactions in case of active to stand-by switch over is triggered.
Automatic synchronization of Sonoff ZDongle-P coordinator's Non-Volatile RAM, from active to stand-by instance, in each zigbee2mqtt node where the dongle is connected to. It enables a seamless change of Zigbee coordinator from active to stand-by node when required (avoiding any impact on the rest of Zigbee mesh nodes, not requiring to re-join them to the network).
Control point enabled in a third node, where MQTT broker is working, to detect issues with active zigbmee2mqtt instance and trigger the switchover to change the service to use the stand-by node. The prototype includes the automated management of this control point in a Home Assistant instance with MQTT component configured.

The overall design is shown in the following diagram:

zigbee2mqtt HA architecture

Environment details

Find below some relevant details about the prototype environment used. In case of using a different environment, it could be required to make some minor adaptations in the scripts provided:

Zigbee coordinators (2x): Sonoff ZBDongle-P, with Z-Stack-firmware (https://github.com/Koenkk/Z-Stack-firmware/tree/master/coordinator/Z-Stack_3.x.0/). In case of using a different one, it could involve to change the commands to read and write non-volatile memory in the dongles.
Zigbee2mqtt nodes (2x): usage of up-to-date Alpine Linux distribution to execute zigbee2mqtt service (https://www.zigbee2mqtt.io/). Zigbee2mqtt application is deployed and started and stopped using Alpine service commands defined for that purpose. In case of using a different distribution, some minor changes could be required for remote start and stop of zigbee2mqtt from controlling scripts.
- Example of Alpine service content of /etc/init.d/zigbee2mqtt to start and stop the application with rc-service command: https://github.com/chemadh/zigbee2mqtt_ha/blob/main/zigbee2mqtt_alpine_service_example . The zigbee2mqtt service should NOT be launched on node startup, otherwise the HA control prototype will not work properly.
- Alpine Linux user is expected to allow sudo, to allow access to Linux service scripts to start and stop zigbee2mqtt, as well as allowing zigpy-znp to access to dongle USB port.
MQTT broker node: Usage of Linux Debian distribution for High-Availability control script execution. Home assistant containing the MQTT broker component communicating with Zigbee2mqtt deployed in Docker mode in the same Linux node. No impact expected by using a different Linux distribution or directly Home Assistant OS. Some reference instructions:
- Home Assistant supervised setup: https://community.home-assistant.io/t/installing-home-assistant-supervised-using-debian-12/200253
- Installation of Mosquitto MQTT broker Home Assitant component: https://www.home-assistant.io/integrations/mqtt

Common components used in the nodes previously defined:

Linux packages: rsync, snmp (net-snmp-tools for Alpine; snmpd, snmp, libsnmp-dev, for Debian).
- In addition to this, specifically for zigbee2mqtt nodes: python3, py3-pip. Installation of zigpy-znp python component (pip install zigpy-znp). Aditional info in this link: https://github.com/zigpy/zigpy-znp/blob/0cacf7a51d205ac3a19acde10a8115cf5ac36ce1/TOOLS.md
NTPD or Timesyncd time synchronization service to be active in each node to ensure a correct synchronization of most recent files. It can be skipped if the nodes are virtualized and obtaining time reference from a hypervisor cluster (like Proxmox Virtual Environment - https://pve.proxmox.com/wiki/Main_Page -).
SNMP MIBs: The scripts notify about execution results using SNMP traps. The MIBs to be included in each Linux node using the proposed scripts are stored in https://github.com/chemadh/zigbee2mqtt_ha/tree/main/MIBs . In case of no monitoring system available in your installation, the scripts can be configured to not use SNMP, so these packages and MIB installation could be skipped.
Enable remote ssh connection between components (MQTT broker and zigbee2mqtt nodes) without interactive credentials. Some example instructions here: https://www.thegeekdiary.com/how-to-run-scp-without-password-prompt-interruption-in-linux/

Zigbee dongle configuration:

Zigbee IEEE address of active Zigbee coordinator (Sonoff ZBDongle-P) must be flashed as secondary IEEE address of the stand-by Zigbee coordinator. It is required to be identified as the same node by the rest of the Zigbee network when the service is switched-over between coordinators. Please note that some of the flashing tools only shows the primary IEEE address, but it doesn't mean that the secondary address is not effectively updated. Some guides below to update firmware, read and write IEEE address in Sonoff ZBDongle-P:

Scripts for synchronization of configuration files from active to stand-by zigbee2mqtt

Each zigbee2mqtt High-Availability instance needs to synchronize the configuration files from active to stand-by instances. The following scripts to be deployed in each zigbee2mqtt instance are shared to achieve it:

Both files include the same logic, with different example configuration parameters to define local or remote zigbee2mqtt node variables. The first lines in each script contains the configuration variables to be updated to each environment. Explanation of each parameter, below:

local_aux_dir: Path of the local zigbee2mqtt instance auxiliary directory where the up-to-date active configuration files will be stored. Example value: /home/zigbee/scripts/zigbee2mqtt_config/
remote_conf_dir: SSH path (including IP and username) to the remote zigbee2mqtt instance directory where the Zigbee2mqtt application reads and updates the configuration, when it is active. Example value: zigbee@192.168.34.124:/opt/zigbee2mqtt/data/
local_conf_dir: Path of the local zigbee2mqtt instance directory where the Zigbee2mqtt application reads and updates the configuration, when it is active. Example value: /opt/zigbee2mqtt/data/
local_user_name: Linux username to execute rsync command. It should be enabled to execute SSH over remote zigbee2mqtt node without requiring interactive credentials. Example value: zigbee
__snmp_server__: SNMP server where the script will send SNMP traps (v2) notifying about the result of the execution. If this functionality is not required, the value of the variable should be leaved empty. Example value: 192.168.34.103:1234
__snmp_host__: Source SNMP host of the trap, following the MIB defined for this purpose. It can be left empty if the SNMP functionality is not required. Example value: zigbee2mqtt1
__local_zigbee2mqtt_frontend_ip__: Zigbee2mqtt normally uses a web frontend to enable application management. Since the web frontend IP is defined in the configuration files, it is required to be updated when synchronizing the configuration files between different nodes. This parameter defines de local zigbee2mtt instance web frontend IP. Example value: 192.168.34.123
__remote_zigbee2mqtt_frontend_ip__: The same as the parameter above, but in this case, referring to the remote zigbee2mqtt instance web frontend IP. Example value: 192.168.34.124

The scripts are intended to be run with cron or timer services (i.e. each 5 minutes), to ensure an automatic zigbmee2mqtt configuration synchronization. Example of this type of config in Alpine Linux with root user (more instructions in https://lucshelton.com/blog/cron-jobs-with-alpine-linux-and-docker/):

include the following content in /etc/crontabs/root: /5 * run-parts /etc/periodic/5min
execute mkdir /etc/periodic/5min
include the script content in /etc/periodic/5min/syncZigbee2mqttConfig
- No .sh extension should be added to the service script.

Scripts for Zigbee2mqtt High-Availability centralized control

A third node will control the active to stand-by failover action between zigbee2mqtt coordinators (whose configuration is synchronized using the scripts defined above). It makes sense that this third node should be the MQTT broker node (like Mosqitto running in Home Assistant environment), since this element will be notified in case the communication with active zigbee2mqtt fails. The following set of scripts are provided for this purpose:

ping_mqtts.sh

Script to check connectivity from controller to both zigbee2mqtt instances. It is intended to check connectivity with both nodes before making an scheduled active to stand-by switchover. The first lines in the script contains the configuration variables to be updated for each environment. Explanation of each parameter, below:

__zigbee2mqtt1_IP__: IP of first zigbee2mqtt instance. Example value: 192.168.34.123
__zigbee2mqtt2_IP__: IP of second zigbee2mqtt instance. Example value: 192.168.34.124
__snmp_server__: SNMP server where the script will send SNMP traps (v2) notifying about the result of the execution. If this functionality is not required, the value of the variable should be leaved empty. Example value: 192.168.34.103:1234
__snmp_host__: Source SNMP host of the trap, following the MIB defined for this purpose. It can be left empty if the SNMP functionality is not required. Example value: homeassistant

The script can be directly executed from the Linux prompt. No command-line parameters are required.

stopZigbee2mqtt1.sh / stopZigbee2mqtt2.sh

Couple of scripts to stop remotely each zigbee2mqtt instance. Used by the controller node to initiate a manual switchover between active and stand-by nodes, when MQTT can detect service interruption in the active zigbee2mqtt node. The first lines in the script contains the configuration variables to be updated for each environment. Explanation of each parameter, below:

zigbee2mqtt1_remote_ssh / zigbee2mqtt2_remote_ssh (Depending on the script for each zigbee2mqtt instance): SSH user and IP to use for remote connection with the zigbee2mqtt node. The user should be previously configured to not require interactive login. Example value: "zigbee@192.168.34.123"
zigbee2mqtt_stop_cmd: Remote stop command to execute in the remote SSH session to stop the zigbee2mqtt instance. In the case of using an Alpine Linux distribution for zigbee2mqtt (like in this prototype), the following can be used: "sudo rc-service zigbee2mqtt stop"
__snmp_server__: SNMP server where the script will send SNMP traps (v2) notifying about the result of the execution. If this functionality is not required, the value of the variable should be leaved empty. Example value: 192.168.34.103:1234
__snmp_host__: Source SNMP host of the trap, following the MIB defined for this purpose. It can be left empty if the SNMP functionality is not required. Example value: homeassistant

The script can be directly executed from the Linux prompt. No command-line parameters are required.

activeZigbee2mqtt1.sh / activeZigbee2mqtt2.sh

Couple of scripts to perform zigbee2mqtt switchover, changing the active service in the first or second node respectively. Sequence of activities followed by each script:

activeZigbee2mqtt1.sh: it applies switchover to active the service in zigbee2mqtt first node.
- Stop zigbee2mqtt second node, that should be in active status.
- Stop zigbee2mqtt first node, that should be in stand-by status (to ensure it is not previously active).
- Extract Zigbee USB dongle coordinator NVRAM memory content from zigbee2mqtt second node, copying it to zigbee2mqtt first node.
- Load coordinator NVRAM memory content in zigbee2mqtt first node's dongle, load the synchronized zigbee2mqtt configuration files and start the first node instance.
- Restart zigbee2mqtt second node, to leave it clean in case of scheduling a new switchover further. Zigbee2mqtt should not start on system startup avoid interactions with this mechanism.
activeZigbee2mqtt2.sh: it applies switchover to active the service in zigbee2mqtt second node.
- Stop zigbee2mqtt first node, that should be in active status.
- Stop zigbee2mqtt second node, that should be in stand-by status (to ensure it is not previously active).
- Extract Zigbee USB dongle coordinator NVRAM memory content from zigbee2mqtt first node, copying it to zigbee2mqtt second node.
- Load coordinator NVRAM memory content in zigbee2mqtt second node's dongle, load the synchronized zigbee2mqtt configuration files and start the second node instance.
- Restart zigbee2mqtt first node, to leave it clean in case of scheduling a new switchover further. Zigbee2mqtt should not start on system startup avoid interactions with this mechanism.

As described above, both scripts define a similar logic, with different details to apply the switchover to a different node. The first lines in the script contains the configuration variables to be updated for each environment. Explanation of each parameter, below:

zigbee2mqtt1_remote_ssh: SSH user and IP to use for remote connection with the first zigbee2mqtt node. The user should be previously configured to disable interactive login. Example value:"zigbee@192.168.34.123"
zigbee2mqtt2_remote_ssh: SSH user and IP to use for remote connection with the second zigbee2mqtt node. The user should be previously configured to disable interactive login. Example value:"zigbee@192.168.34.124"
zigbee2mqtt_stop_cmd: Remote stop command to execute in the remote SSH session to stop the zigbee2mqtt instance. In the case of using an Alpine Linux distribution like in this prototype, the following can be used: "sudo rc-service zigbee2mqtt stop"
zigbee2mqtt_start_cmd: Remote start service command to execute in the remote SSH session to launch zigbee2mqtt instance. In the case of using an Alpine Linux distribution like in this prototype, the following can be used: "sudo rc-service zigbee2mqtt start"
__snmp_server__: SNMP server where the script will send SNMP traps (v2) notifying about the result of the execution. If this functionality is not required, the value of the variable should be leaved empty. Example value: 192.168.34.103:1234
__snmp_host__: Source SNMP host of the trap, following the MIB defined for this purpose. It can be left empty if the SNMP functionality is not required. Example value: homeassistant
coordinator_port_zigbee2mqtt1: USB port where Zigbee USB dongle is available in zigbee2mqtt first node. Example value: "/dev/ttyUSB0"
coordinator_port_zigbee2mqtt2: USB port where Zigbee USB dongle is available in zigbee2mqtt second node. Example value: "/dev/ttyUSB0"
zigbee2mqtt_nvram_path: Directory where the remote NVRAM dump file from USB coordinator dongle will be stored. Example value: "/home/zigbee/scripts/"
zigbee2mqtt_nvram_file: Name of the remote NVRAM dump json file from USB coordinator dongle. Example value: "backup_coordinator_nvram.json"
local_aux_dir: Path of the remote zigbee2mqtt node directory where the up-to-date active configuration files are be stored. Example value: /home/zigbee/scripts/zigbee2mqtt_config/
local_conf_dir: Path of the remote zigbee2mqtt node directory where the Zigbee2mqtt application reads and updates the configuration when it is active. Example value: /opt/zigbee2mqtt/data/

The scripts can be directly executed from the Linux prompt. No command-line parameters are required.

Usage of High-Availability control scripts from Home Assistant

This section assumes that the controller scripts defined previously are deployed in a Linux system where Home Assistant is also available. The two zigbee2mqtt remote nodes should be also running with automatic zigbee2mqtt config synchronization up and running in the same LAN.

In addition to this, there are some limitations related to the Home Assistant's shell script environment that needs to be addressed.

Environment preparation for Home Assistant installed in Docker

Running shell scripts in Home Assistant under Docker environment is problematic in case of requiring additional packages, like in our case, SNMP. The reason for this is the Home Assistant upgrade process: it will not preserve the custom packages installed by the user under the Home Assistant Docker environment. In addition to this, there is a maximum time of script execution defined in Home Assistant by design: 1 minute. The proposed workaround to avoid these issues is executing the scripts remotely, connecting from Docker to de host machine where Home Assistant Docker is running (The control scripts defined above should be stored there).

To follow this approach, the Home Assistant Docker environment should enable the next points (it should be accessible from host machine executing sudo docker exec -it homeassistant bash). This parametrization is mainly needed to enable SSH connection without interactive credentials again:

Generate the RSA public file for the Docker host machine user to connect to via SSH from Home Assistant Docker. In order to avoid accidental removal during updates, it is recommended to be stored in /config/.ssh/id_rsa.pub path of Home Assistant Docker.
Define a custom "known hosts" file also in a path where Home Assistant Docker won't remove it. It is recommended to be stored in /config/.ssh/known_hosts.
Creation of "sh" directory in root dir of Home Assistant Docker. Local Docker scripts will be defined there to link with the High Availability control scripts in the Docker Host machine.

The scripts to define inside the Home Assistant Docker are defined below. The Parameters to replace directly in the contents of these scripts are:

< user >: User of Home Assistant host machine
< ip >: IP of host machine
< script_path >: Path directory where the controller scripts are available in the host machine

ping_mqtts.sh

#!/bin/bash

ssh -o UserKnownHostsFile=/config/.ssh/known_hosts -i /config/.ssh/id_rsa <user>@<ip> '<script_path>/ping_mqtts.sh'

stopZigbee2mqtt1.sh

#!/bin/bash

ssh -o UserKnownHostsFile=/config/.ssh/known_hosts -i /config/.ssh/id_rsa <user>@<ip> '<script_path>/stopZigbee2mqtt1.sh'

stopZigbee2mqtt2.sh

#!/bin/bash

ssh -o UserKnownHostsFile=/config/.ssh/known_hosts -i /config/.ssh/id_rsa <user>@<ip> '<script_path>/stopZigbee2mqtt2.sh'

activeZigbee2mqtt1.sh

This is where the maximum Home Assistant shell script execution time of 1 minute should be avoided: the full fallback procedure usually takes more than this (mainly because of time required to extract and write NVRAM dumps in Zigbee USB dongles). In consequence, the script below is defined to execute switchover operations in second plane (using nohup parameter). The results are stored in a file, just if it needs to be checked (result.log in the same script path).

#!/bin/bash

ssh -o UserKnownHostsFile=/config/.ssh/known_hosts -i /config/.ssh/id_rsa -f <user>@<ip> 'rm <script_path>/result.log; nohup <script_path>/activeZigbee2mqtt1.sh ><script_path>result.log 2>&1 </dev/null &'

activeZigbee2mqtt2.sh

Same comment applicable from activeZigbee2mqtt1, above.

#!/bin/bash

ssh -o UserKnownHostsFile=/config/.ssh/known_hosts -i /config/.ssh/id_rsa -f <user>@<ip> 'rm <script_path>/result.log; nohup <script_path>/activeZigbee2mqtt2.sh ><script_path>result.log 2>&1 </dev/null &'

configuration.yaml additions in Home Assistant

Once the previous steps are completed, the following content can be added to configuration.yaml of Home Assistant in order to enable zigbee2mqtt as a sensor that can be monitored. The final purposes are:

Identification of the Zigbee2mqtt instance that is currently running, identified by its IP.
Identification of the active Zigbee2mqtt instance status. It allows Home Assistant automations to trigger the active to stand-by instance failover, when a transition to stop status is detected in the active Zigbee2mqtt node.

mqtt:
  binary_sensor:
    - name: "Zigbee2mqtt bridge status"
      state_topic: "zigbee2mqtt/bridge/state"
      unique_id: "zigbee2mtt bridge status"
      value_template: '{{ value_json.state}}'
      payload_on: "online"
      payload_off: "offline"
      device_class: running
  sensor:
    - name: "Zigbee2mqtt bridge info"
      unique_id: "zigbee2mtt bridge info"
      state_topic: "zigbee2mqtt/bridge/info"
      value_template: >
        "{% if "192.168.34.123" in value %}
           {{ "192.168.34.123" }}
         {% endif %}
         {% if "192.168.34.124" in value %}
           {{ "192.168.34.124" }}
         {% endif %}"

Please update 192.168.34.123 and 192.168.34.124 IPs with your deployment zigbee2mqtt instances IPs.

In addition to this, The following content should be also added to configuration.yaml file of Home Assistant, in order to make available to Home Assistant's automations the High-Availability zigbee2mqtt controller scripts (previously stored inside the Home Assistant Docker):

shell_command:
  failover_to_mqtt1: "bash ./sh/activeZigbee2mqtt1.sh"
  failover_to_mqtt2: "bash ./sh/activeZigbee2mqtt2.sh"
  ping_mqtts: "bash ./sh/ping_mqtts.sh"
  stop_mqtt1: "bash ./sh/stopZigbee2mqtt1.sh"
  stop_mqtt2: "bash ./sh/stopZigbee2mqtt2.sh"

Scripts' path should be updated in case of using a different location in Home Assistant storage (using Docker or directly OS distribution).

Home Assistant needs to be rebooted to reload configuration.yaml with new contents. If everything goes OK, the summary panel should show something similar to the following capture:

Home Assistant capture

Home Assistant Scripts

The following extract of Home assistant scripts.yaml integrates the zigbee2mqtt functionalities with the Home Assistant control system. Zigbee2mqtt nodes are identified by their IPs (192.168.34.123 / .124), so it should be updated to the IPs in use in each deployment.

The following entry checks the running status and applies the switchover to stand-by instance if the active is not in running status.

zigbee2mqtt_check:
alias: zigbee2mqtt check
sequence:
- variables:
  shellscriptres:
- alias: Control de estado zigbee2mqtt
if:
- condition: state
  entity_id: binary_sensor.zigbee2mqtt_bridge_status
  state: 'on'
then:
- service: notify.persistent_notification
  data:
    message: zigbee2mqtt running
    title: estado zigbee2mqtt
  alias: Notificamos zigbee2mqtt running
else:
- service: notify.persistent_notification
  data:
    message: zigbee bridge is not running
    title: zigbee2mqtt
  alias: Notification about zigbee2mqtt status is not running
- alias: If zigbee2mqtt node is 192.168.34.123, applying swichover to .124
  if:
  - condition: template
    value_template: '{{ ''192.168.34.123'' in states(''sensor.zigbee2mqtt_bridge_info'')  }} '
  then:
  - alias: Notification of failover to .124 zigbee2mqtt instance
    service: notify.persistent_notification
    data:
      message: 'Zigbee2mqtt node stopped: 192.168.34.123. Applying switchover to 192.168.124'
      title: zigbee2mqtt failover
  - delay:
      hours: 0
      minutes: 0
      seconds: 10
      milliseconds: 0
  - service: shell_command.failover_to_mqtt2
    data: {}
    response_variable: shellscriptres
- alias: If zigbee2mqtt node is 192.168.34.124, applying swichover to .123
  if:
  - condition: template
    value_template: '{{ ''192.168.34.124'' in states(''sensor.zigbee2mqtt_bridge_info'')  }} '
  then:
  - alias: notificamos failover a 123
    service: notify.persistent_notification
    data:
      title: zigbee2mqtt failover
      message: 'Zigbee2mqtt node stopped: 192.168.34.124. Applying switchover to 192.168.123'
  - delay:
      hours: 0
      minutes: 0
      seconds: 10
      milliseconds: 0
  - service: shell_command.failover_to_mqtt1
    data: {}
    response_variable: shellscriptres
mode: single
icon: mdi:home-clock

The following entry apply the switchover by stopping the active zigbee2mqtt node in a periodic manner. It should force the execution of the automation detecting that the active node is not running, and a switchover needs to be done.

periodic_zigbe2mqtt_switchover:
  alias: periodic zigbee2mqtt switchover
  sequence:
  - variables:
      ping_nodes_result:
      mqtt_change_result:
  - service: shell_command.ping_mqtts
    data: {}
    response_variable: ping_nodes_result
    alias: Call to script ping_mqtts
  - alias: Checking pings result
    if:
    - condition: template
      value_template: '{{ ping_nodes_result[''stdout''] == ''success'' }}'
    then:
    - service: notify.persistent_notification
      data:
        message: Ping OK
        title: Ping MQTTs result
    - alias: If zigbee2mqtt node is 192.168.34.123, making switchover to .124
      if:
      - condition: template
        value_template: '{{ ''192.168.34.123'' in states(''sensor.zigbee2mqtt_bridge_info'')  }} '
      then:
      - alias: Failover to .124 instance notification
        service: notify.persistent_notification
        data:
          message: 'identified node: 192.168.34.123. switching over to 192.168.124'
          title: zigbee2mqtt periodic switchover
      - service: shell_command.stop_mqtt1
        data: {}
        response_variable: mqtt_change_result
    - alias: If zigbee2mqtt node is 192.168.34.124, making switchover to .123
      if:
      - condition: template
        value_template: '{{ ''192.168.34.124'' in states(''sensor.zigbee2mqtt_bridge_info'')  }} '
      then:
      - alias: Failover to .123 instance notification
        service: notify.persistent_notification
        data:
          message: 'identified node: 192.168.34.124. switching over to 192.168.123'
          title: zigbee2mqtt periodic switchover
      - service: shell_command.stop_mqtt2
        data: {}
        response_variable: mqtt_change_result
    else:
    - service: notify.persistent_notification
      data:
        message: Ping Error
        title: Result of Ping MQTTs
  mode: single
  icon: mdi:arrow-oscillating

Home Assistant automations

Finally, the following Home Assistant automations.yaml snippets defines the upper layer of the high-availability Zigbee2mqtt control system.

Automation to detect the zigbee2mqtt status change leaving the running status, to trigger the switchover to stand-by node script.
id: '1702239851380' alias: zigbee2mqtt stop detection description: '' trigger:
- platform: state entity_id:
- binary_sensor.zigbee2mqtt_bridge_status from: 'on' to: 'off' condition: [] action:
- service: script.zigbee2mqtt_check data: {} mode: single
Automation to perform a periodic switchover between zigbee2mqtt instances, to ensure a healthy status of both instances and dumping a reasonable fresh version of active USB coordinator dongle NVRAM content (it could be required to manually flush a new instance of the coordinator in case of active system failure).

- id: '1702417248508'
  alias: Weekly zigbee2mqtt switch-over
  description: ''
  trigger:
  - platform: time
    at: 03:31:00
  condition:
  - condition: time
    weekday:
    - wed
    - sun
  action:
  - service: script.periodic_zigbe2mqtt_switchover
    data: {}
  mode: single

Validation scenario of the solution

The prototype is up and running during the last months applying two switchovers per week. during this time, the Zigbee network has been expanded from few nodes to use more than 20 devices between sensors and actuators of different types and brands. No issues found up to now.

It has also been tested by directly disconnecting the active USB dongle from the environment several times (spaced a reasonable time to ensure a full resynchronization). It has always leaded to a correct service interruption detection in Home Assistant and switchover to stand-by node (with the caveat of not being able to load the last up-to-date NVRAM dump from active node, but not causing a service disruption meanwhile the last dump would be reasonably recent -by previously applying scheduled switchovers-).

In order to achieve a full High Available deployment, Home Assistant itself (or the controller selected) should be deployed also in a fault tolerant manner. Since this component is not depending on hardware devices, it is easier to achieve than in the case of Zigbee2mqtt (that depends on the Zigbee coordinator hardware). In example, it can be achieved by deploying the solution on top of a redundant virtualized infrastructure, like Proxmox Virtual Environment.

Possible further improvements

Hopefully a similar functionality would be included natively in the future to Zigbee2mqtt and Mosqitto MQTT broker, accessible via configuration instead of requiring external scripts like these ones.
The main caveat of the present prototype approach is the lack of the functionality of extracting NVRAM dump file of USB Dongle without stopping the Zigbee2mqtt coordinator. IT seems to be due to Sonoff ZBDongle-P with Z-Stack-firmware / Zigbee2mqtt limitations that are not allowing to make any dump meanwhile Zigbee2mqtt is using the coordinator USB port. Unfortunately, it only leaves as alternative to schedule a periodic switchover to have a reasonable fresh dump to apply to the stand-by coordinator in case of active node failure, with the cost of an around 1 minute of Zigbee service interruption (required to complete the switchover).

chemadh / zigbee2mqtt_ha

readme

Zigbee2mqtt High-Availability controller prototype

Introduction

Environment details

Scripts for synchronization of configuration files from active to stand-by zigbee2mqtt

Scripts for Zigbee2mqtt High-Availability centralized control

ping_mqtts.sh

stopZigbee2mqtt1.sh / stopZigbee2mqtt2.sh

activeZigbee2mqtt1.sh / activeZigbee2mqtt2.sh

Usage of High-Availability control scripts from Home Assistant

Environment preparation for Home Assistant installed in Docker

ping_mqtts.sh

stopZigbee2mqtt1.sh

stopZigbee2mqtt2.sh

activeZigbee2mqtt1.sh

activeZigbee2mqtt2.sh

configuration.yaml additions in Home Assistant

Home Assistant Scripts

Home Assistant automations

Validation scenario of the solution

Possible further improvements