Danielhiversen / PyXiaomiGateway

PyXiaomiGateway
MIT License
152 stars 55 forks source link

3 gateways - after certain time cannot communicate #146

Open andriej opened 5 years ago

andriej commented 5 years ago

Home Assistant release with the issue (if applicable): since few releases, I'm currently running 0.87.1 Last working Home Assistant release (if known):

PyXiaomiGateway library version (if known): latest? installed by pip3 install together with homeassistant

Operating environment (Hass.io/Docker/Windows/etc.): Linux / venv / python 3.6.8

Traceback (if applicable): n/a Description of the bug: I have 3 gateways with properly configured keys and running in one LAN (no vlans or so) with 2-3 AP. I'd say it's rather stable conditions, at least for all other >15 tasmotas etc. running. No multicast filtering inside the network.

The problem: after some period of time (usually quite longer than at least half of day) some of gateways (2 of 3 or 3 all together) go 'out' of network, but partially. Messages are received from them (ngrep and also debug log) but HA can't interact with them and neither change their light status etc.

Read if from bottom to top (it's from log aggregation):

09:32:00 DEBUG (Thread-5) [xiaomi_gateway] MCAST (report) << {'cmd': 'report', 'model': 'sensor_motion.aq2', 'sid': '158d000224xxx', 'short_id': 24341, 'data': '{"no_motion":"180"}'}
09:31:55 DEBUG (Thread-5) [xiaomi_gateway] MCAST (heartbeat) << {'cmd': 'heartbeat', 'model': 'switch', 'sid': '158d0001e8xxx', 'short_id': 57690, 'data': '{"voltage":3032}'}
09:31:54 DEBUG (Thread-5) [xiaomi_gateway] MCAST (heartbeat) << {'cmd': 'heartbeat', 'model': 'sensor_wleak.aq1', 'sid': '158d000237xxxx', 'short_id': 38708, 'data': '{"voltage":3035}'}
09:31:54 DEBUG (Thread-5) [xiaomi_gateway] MCAST (report) << {'cmd': 'report', 'model': 'motion', 'sid': '158d00015xxxxx', 'short_id': 31824, 'data': '{"no_motion":"180"}'}
09:31:18 DEBUG (Thread-5) [xiaomi_gateway] MCAST (report) << {'cmd': 'report', 'model': 'sensor_motion.aq2', 'sid': '158d0001e5xxxx', 'short_id': 27926, 'data': '{"no_motion":"180"}'}
09:31:14 ERROR (SyncWorker_12) [xiaomi_gateway] No data in response from hub None
09:31:14 DEBUG (SyncWorker_12) [xiaomi_gateway] write_ack << None
09:31:14 ERROR (SyncWorker_12) [xiaomi_gateway] Cannot connect to Gateway
09:31:10 ERROR (SyncWorker_1) [xiaomi_gateway] No data in response from hub None
09:31:10 DEBUG (SyncWorker_1) [xiaomi_gateway] write_ack << None
09:31:10 ERROR (SyncWorker_1) [xiaomi_gateway] Cannot connect to Gateway
09:31:04 ERROR (SyncWorker_4) [xiaomi_gateway] No data in response from hub None
09:31:04 DEBUG (SyncWorker_4) [xiaomi_gateway] write_ack << None
09:31:04 ERROR (SyncWorker_4) [xiaomi_gateway] Cannot connect to Gateway
09:31:04 DEBUG (SyncWorker_12) [xiaomi_gateway] _send_cmd >> b'{"cmd": "write", "sid": "34ce008c0674", "data": {"rgb": 0, "key": "313bea267644edab7bxxxxxxxxxxxxxx"}}'
09:31:00 DEBUG (SyncWorker_1) [xiaomi_gateway] _send_cmd >> b'{"cmd": "write", "sid": "7811dcdad7a3", "data": {"rgb": 0, "key": "d52a2dcf56b182865bfxxxxxxxxxxxxx"}}' 

During that there was no issue with Wifi (can't see any Wifi disconnections or DHCP messages) - I have logs alltogether combined for debug reasons. They would be shown if any other message appeared during the issue.

What's important - at least for now (few hours after first error in message): image sensors still report their values (temp, motion, door) and HA see changes immidiately, even while I still can't turn on/off the light. if I restart HA, I loose also sensor readings as all entities will become 'unavailable' and there will be error with setting up the Aqara component.

For the current moment I don't know any reliable way of getting it all back to work - restarting HA almost never helps. What usually helped was restarting (remove plug and plug back in), each of it which had problem and then restart HA. It's not always successfull but can get back situation back to normal for next xx hours.

Is it key exchange problem? Key rotation? Communication? Socket? Please, help me provide data for debug and get things solved.

andriej commented 5 years ago

All of 3 gateways 'went out' after almost 24h. When I came back home I removed each one of them from plug and put back in for power. It made all of them to work without HA restart. [read from bottom to top]

[working as light in HA again]
Feb 18 19:14:11 ha hass[25891]: 2019-02-18 19:14:11 DEBUG (Thread-5) [xiaomi_gateway] MCAST (report) << {'cmd': 'report', 'model': 'gateway', 'sid': '34ce008c136a', 'short_id': 0, 'data': '{"rgb":0,"illumination":106}'}
Feb 18 19:14:11 router arpwatch: bogon 0.0.0.0 34:ce:0:8c:13:6a
Feb 18 19:14:10 router arpwatch: bogon 0.0.0.0 34:ce:0:8c:13:6a
Feb 18 19:14:10 router arpwatch: bogon 0.0.0.0 34:ce:0:8c:13:6a
Feb 18 19:14:10 router arpwatch: bogon 0.0.0.0 34:ce:0:8c:13:6a
Feb 18 19:14:09 router dnsmasq-dhcp[11539]: DHCPACK(br-lan) 192.168.x.x 34:ce:00:8c:13:6a Xiaomi-GW2_korytarz
Feb 18 19:14:09 router dnsmasq-dhcp[11539]: DHCPREQUEST(br-lan) 192.168.x.x 34:ce:00:8c:13:6a
Feb 18 19:14:09 router dnsmasq-dhcp[11539]: DHCPOFFER(br-lan) 192.168.x.x 34:ce:00:8c:13:6a
Feb 18 19:14:09 router dnsmasq-dhcp[11539]: DHCPDISCOVER(br-lan) 34:ce:00:8c:13:6a
Feb 18 19:14:09 ap1 hostapd: wlan1-1: STA 34:ce:00:8c:13:6a WPA: pairwise key handshake completed (RSN)
Feb 18 19:14:09 ap1 hostapd: wlan1-1: AP-STA-CONNECTED 34:ce:00:8c:13:6a
Feb 18 19:14:09 ap1 hostapd: wlan1-1: STA 34:ce:00:8c:13:6a IEEE 802.11: associated (aid 1)
Feb 18 19:14:09 ap1 hostapd: wlan1-1: STA 34:ce:00:8c:13:6a IEEE 802.11: authenticated
[gateway plugged back in]

of course while GW light wasn't working, all of the sensors connected to gateways were working without problem or delays, so gateway was connected to wifi

andriej commented 5 years ago

For third gateway same model (turning off and on) worked also. It seems like somewhere in the middle HA lost the message which made HA think there's no GW in network? or do they stop sending it...

Feb 18 19:20:40 ha hass[25891]: 2019-02-18 19:20:40 DEBUG (Thread-5) [xiaomi_gateway] MCAST (report) << {'cmd': 'report', 'model': 'gateway', 'sid': '34ce008c0674', 'short_id': 0, 'data': '{"rgb":0,"illumination":417}'}
Feb 18 19:20:39 ap3 hostapd: wlan0: STA 64:a2:f9:81:4a:6a WPA: pairwise key handshake completed (RSN)
Feb 18 19:20:39 ap3 hostapd: wlan0: AP-STA-CONNECTED 64:a2:f9:81:4a:6a
Feb 18 19:20:39 ap3 hostapd: wlan0: STA 64:a2:f9:81:4a:6a IEEE 802.11: associated (aid 2)
Feb 18 19:20:39 ap3 hostapd: wlan0: STA 64:a2:f9:81:4a:6a IEEE 802.11: authenticated

will monitor situation further on.

andriej commented 5 years ago

Meanwhile - IGMP Snooping is OFF on every device I have (including managed switch). Gateways are cut-off from the internet via MAC address (access to internet didn't change behavior so I decided to let it be without).

ap1# swconfig dev switch0 show |grep igmp
        igmp_snooping: 0
        igmp_v3: 0
        igmp_snooping: 0
        igmp_snooping: 0
        igmp_snooping: 0
        igmp_snooping: 0
        igmp_snooping: 0
        igmp_snooping: 0
        igmp_snooping: 0
syssi commented 5 years ago

Please make sure there are periodic heartbeat messages in your logs. The heartbeat message (of the gateway) is a token update (key rotation). Furthermore there the gateway provides a heartbeat message per zigbee device. If HA doesn't receive heartbeat messages for a specific device it's marked as unavailable after a specific time. cp. https://github.com/home-assistant/home-assistant/pull/11631

andriej commented 5 years ago

None of entities went "unavailable" in HA (including gateway). So all sensors reports were working and gateways available. Only not possible to turn their light on/off (if lost while on).

Of course also all packets sent by sensors are in debug log with their states.

andriej commented 5 years ago

Also what I usually tried first was restart of HA and almost always it finished with all gateways unavailable and config error od component.

andriej commented 5 years ago

Breakthrough!

After some time I came back home again and 2 of 3 gateways were - again - dead (for light-change commands) but reporting sensors all the time.

What I did this time was nmap scan and...:

# nmap -sU -p 9898 192.168.7.x0

Starting Nmap 7.40 ( https://nmap.org ) at 2019-02-18 22:28 CET
Nmap scan report for Xiaomi-GW3.lan (192.168.x.x0)
Host is up (0.034s latency).
PORT     STATE         SERVICE
9898/udp open|filtered monkeycom
MAC Address: 78:11:DC:DA:D7:A3 (Unknown)

Nmap done: 1 IP address (1 host up) scanned in 0.95 seconds
# nmap -sU -p 9898 192.168.7.x1

Starting Nmap 7.40 ( https://nmap.org ) at 2019-02-18 22:28 CET
Nmap scan report for Xiaomi-GW2.lan (192.168.x.x1)
Host is up (0.10s latency).
PORT     STATE SERVICE
9898/udp open  monkeycom
MAC Address: 34:CE:00:8C:13:6A (Unknown)

Nmap done: 1 IP address (1 host up) scanned in 1.13 seconds
# nmap -sU -p 9898 192.168.7.x2

Starting Nmap 7.40 ( https://nmap.org ) at 2019-02-18 22:28 CET
Nmap scan report for Xiaomi-GW1.lan (192.168.x.x2)
Host is up (0.069s latency).
PORT     STATE         SERVICE
9898/udp open|filtered monkeycom
MAC Address: 34:CE:00:8C:06:74 (Unknown)

Nmap done: 1 IP address (1 host up) scanned in 1.23 seconds

As you can easily guess - the only one working is the middle one! Which state is not filtered.

andriej commented 5 years ago

Maybe gateway get somehow bad token and start to filter incoming messages?

andriej commented 5 years ago

Those 2 gateways hace ICMP ping scheme of:

This GW still report it's door window sensor with no issues!

Feb 18 22:56:27 ha hass[25891]: 2019-02-18 22:56:27 DEBUG (Thread-5) [xiaomi_gateway] MCAST (report) << {'cmd': 'report', 'model': 'sensor_magnet.aq2', 'sid': '158d0001b7b7f5', 'short_id': 38761, 'data': '{"status":"close"}'}
Feb 18 22:56:23 ha hass[25891]: 2019-02-18 22:56:23 DEBUG (Thread-5) [xiaomi_gateway] MCAST (report) << {'cmd': 'report', 'model': 'sensor_magnet.aq2', 'sid': '158d0001b7b7f5', 'short_id': 38761, 'data': '{"status":"open"}'}
andriej commented 5 years ago

Please make sure there are periodic heartbeat messages in your logs. The heartbeat message (of the gateway) is a token update (key rotation). Furthermore there the gateway provides a heartbeat message per zigbee device. If HA doesn't receive heartbeat messages for a specific device it's marked as unavailable after a specific time. cp. home-assistant/home-assistant#11631

Those kind of heartbeat?

Feb 18 23:24:44 ha hass[25891]: 2019-02-18 23:24:44 DEBUG (Thread-5) [xiaomi_gateway] MCAST (heartbeat) << {'cmd': 'heartbeat', 'model': 'sensor_wleak.aq1', 'sid': '158d000239ed27', 'short_id': 53619, 'data': '{"voltage":2815}'}
Feb 18 23:24:33 ha hass[25891]: 2019-02-18 23:24:33 DEBUG (Thread-5) [xiaomi_gateway] MCAST (heartbeat) << {'cmd': 'heartbeat', 'model': 'sensor_motion.aq2', 'sid': '158d0001e50f2a', 'short_id': 53313, 'data': '{"voltage":3015}'}
syssi commented 5 years ago

@andriej I guess your gateways crashs or drops out of the wifi network. In this case it resets to an initial state (undiscovered?) and needs to be discovered (by homeassistant) again. If you restart HA does it work again for some time without pluging off any other device of your network?

andriej commented 5 years ago

No, yesterday I had an oppoturnity to next crash to occur. I've switched my whole WiFi to 40MHz without forcing whole band (so it can drop back to 20MHz).

@syssi: Answering your question: restart of HA would not help, even after longer time. What happens when gateways go light-blind (and still reporting sensors, it think it's quite important) is after I restart HA, that gateway is not discovered anymore and makes it totally unavailable together with entities (as in yellow in lovelace). If I just leave it, sensors (movement, door/window etc.) will be updated but not the light. Some delay may occur.

From what I've checked yesterday, restart of wifi network on APs (down & up) makes gateways properly reconnect and rediscover in HA. Right now I'm wondering what could cause only gateways having problem, while all other WiFi devices work flawlessly. Also it's in my opinion not really wifi problem, as messages from gateway still appear on multicast.

For now: rebooting of gateway or making it re-connect to wifi launches some kind of routine which makes it available back again for next period of time (much longer than 2,5h described in other issues). Token keys are 101% proper, as I've didn't change them since quite some time and they worked flawlessly.

PaulAnnekov commented 5 years ago

We need to have all incoming and outcoming tcp and udp (tcp/udpdump) traffic to each gateways + periodic nmap. E.g. Mikrotik routers have Packet sniffer which can catch all incoming/outgoing traffic for specific IP. Regarding nmap, you can run it in a loop each 1-2 seconds on a separate machine. After that we will have enough data to find the cause.

andriej commented 5 years ago

@PaulAnnekov do you have by any chance scripts/commands on hand for that? So I tcpdump exactly what's wanted.

As it's content will have keys and other sensitive data, I'd rather like to share it with you other way than publishing here. Would like to have it solved not only for me, but for others which have same problem too - as I've found some community issues same as mine on reddit/forums.

Also:

PaulAnnekov commented 5 years ago

Write me to paul.annekov@gmail.com, I need to know your network structure and what APs do you have. Dump should be done on a machine/AP through which all traffic of your gateways goes.

andriej commented 5 years ago

New finding about 'my case'. During weekend I've moved one of the AP to different place just to see if there could be any type of interference. Situation didn't change a lot, gateways and other devices still have connectivity.

What I can see now tho is the situation, when some gateway becomes 'light'-unreachable (so reports all it's sensors connected but can't be turned on/off as the light) and also the same gateway can become available after some time again. So it seems like somehow there could be token-message lost in meanwhile or it's not being sent at all. HA version: 0.88.2

Also one of gateways started to become unavailable (grayed out), but it could be connected to wifi range, will investigate further.

andriej commented 5 years ago

The gateway that became unavailable ended in AP logs as:

hostapd: wlan1-2: STA 78:11:dc:xx:xx:xx IEEE 802.11: authenticated
hostapd: wlan1-2: STA 78:11:dc:xx:xx:xx IEEE 802.11: associated (aid 4)
hostapd: wlan1-2: STA 78:11:dc:xx:xx:xx IEEE 802.11: deauthenticated due to local deauth request

And didn't become online after that. Will check it's RSSI later on

andriej commented 5 years ago

I'm moving my concerns to AP auth and key renegotiations of gateway. Still can't find other issue to be repeatable.

andriej commented 5 years ago

Another update: I didn't really do a lot in meanwhile except of moving one AP to a bit far place from both 2 out of 3 gateways. Also I've changed key-renegotiations for 1/day (in WPA2-AES it's not needed every hour as it was before). Also updated to 0.89.1

For today I couldn't repeat the error, but will still monitor.

andriej commented 5 years ago

Yesterday I've encountered one gateway out of order. It was (and all it's sensors) available in HA, although no state changes made to wifi and further to HA. It went out from wifi connection in 30-40 mins before. What's the timeout it should go unavailable in HA? 2,5h? Could it be set as a configuration parameter for shorter period? Would be easier to debug.

andriej commented 5 years ago

Another update regarding thread: I was testing automation for reporting unavailable gateway and pointing which one is it. I've unplugged one of them and... they are still available in HA! Is there any possibility to poll gateway itself or any other activity than waiting: TIME_TILL_UNAVAILABLE = timedelta(minutes=150) ?

PaulAnnekov commented 5 years ago

Yep, there is a way to poll gateway and devices. Create an issue in https://github.com/home-assistant/home-assistant repo, because it should be done there.

andriej commented 5 years ago

I've migrated today to Conbee stick - couldn't longer accept situation when leak sensors could possibly not report successfully their status (and smoke sensors coming to installation too).