Jc2k / aiohomekit

asyncio for homekit
Apache License 2.0
52 stars 19 forks source link

Out of order pairing steps when using HomeSpan (async race condition?) #343

Open phybros opened 10 months ago

phybros commented 10 months ago

HomeSpan is a library for ESP32 devices to enable them to speak HomeKit. After a huge amount of tests, it seems that when pairing these devices with Home Assistant via the HomeKit Device integration, pairing intermittently fails in a variety of ways.

I've included a massive amount of logs in my original HomeSpan bug report, but the description of what's happening from the devs makes it sound like there's some sort of race condition in aiohomekit which is causing the pairing steps to happen in a random order (nondeterministically).

The HomeSpan issue link is here: https://github.com/HomeSpan/HomeSpan/issues/680

but the important part I think is here:

Thanks -- these details are helpful, though I'm not sure there is much I can do. As you can see from your first log output, the first thing HA is doing is asking for a pair-verify. That makes no sense - it first needs to set up pairing before it can open a secure paired channel. However, this won't cause a problem for HomeSpan as it just reports the error and waits for the next input. Based on your log file, HA does send the correct pair-setup request, but waits for over a minute before sending. I have no idea why HA would wait so long. As you can see, it correctly send the 3 steps for pairing in the correct order (M1, M3, and then M5) and the device properly pairs.

However, 10 seconds later HA seems to open a new connection and tries to pair again. Since the device is already paired, HomeKit responds with an error in accordance with the HAP protocol. Next, HA tries to create a secure connection using the steps for a pair-verify. Since the device is paired, this works and the verification completes. I do not know why HA does not think the device is paired since it has gone as far to open a secure connection.

As you noted, the second log file shows a correct pairing and verification as HA sent all the right steps in the correct order.

In the third log file, HA seems to be trying to verify a secure connection before it has finished all the pairing steps. HAP does not specify what is supposed to happen if a device interrupts a pair-setup with a misplaced pair-verify. I coded HomeSpan to reset the pairing procedure whenever that occurs so things can start fresh, but it looks like HA tries to continue the pairing setup by sending an M3 (which is the second step of pairing) --- by that time HomeSpan has already reset the pairing sequence and is expected to see M1 (which is the first step).

Based on the above logs, it almost seems as if HA is running two parallel processes to try and pair/verify, such that depending on the exact timing, if one process completes everything works, but if they interfere, the behavior is indeterminant.

Could this be a bug in aiohomekit? Or maybe the Home Assistant integration itself? HomeSpan devices always pair flawlessly with "real" HomeKit controllers (e.g. via the Home app) and claim to be fully compliant with the spec, so I'm not sure what's going on

Jc2k commented 10 months ago

It sounds like pairing worked once and then you reset the esp but not the home assistant integration. So the existing pairing in ha is trying to establish a connection, but now you are trying to pair it again. So the device sees the results of 2 concurrent tasks - one trying to use the previous (stale) pairing, and one trying to make a new one.

phybros commented 10 months ago

That would make sense but I also tested it after removing the HA integration and even restarting HA, along with totally wiping the ESP and starting "from scratch" with completely different device IDs etc. Even when doing that process multiple times it still produces these random results.

The "already paired" issue is only one of the ways it fails. The other main one is the issue shown here

image

Again, this can happen after resetting, rebooting, removing the integration and starting over etc.

On Sat, Nov 11, 2023 at 10:56 AM Jc2k @.***> wrote:

It sounds like pairing worked once and then you reset the esp but not the home assistant integration. So the existing pairing in ha is trying to establish a connection, but now you are trying to pair it again. So the device sees the results of 2 concurrent tasks - one trying to use the previous (stale) pairing, and one trying to make a new one.

— Reply to this email directly, view it on GitHub https://github.com/Jc2k/aiohomekit/issues/343#issuecomment-1806850844, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPEU3FYQN7S3F2O3CX3FM3YD6NZJAVCNFSM6AAAAAA7HKZIDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBWHA2TAOBUGQ . You are receiving this because you authored the thread.Message ID: @.***>

phybros commented 10 months ago

Incidentally I did also try it on a completely different instance of HA on a different network at a different house and got the same issue the very first time trying to pair it.

Just to be sure, what are the steps you would suggest going through to get the HA integration to stop trying to contact the device after removing it from the integrations page?

Jc2k commented 10 months ago

The pair verify code path only happens when there is a pairing. It loads secrets from a config entry, and there is no config entry until pairing has happened.

It would probably fill your logs with exceptions if it tried.

That should mean that it'd be impossible to see pair verifys from ha if you removed the integration and then restarted ha to make sure it was gone. The restart shouldn't be needed but it's possible you've discovered an edge case where a faulty pairing can't be cleanly unloaded.

Jc2k commented 10 months ago

We know there are ~28k installs of this integration, and this is the first time I can remember this behaviour coming up, and you can recreate it in 2 distinct environments.

That makes me wonder if there is an environmental variable (eg we have had trouble with mdns repeaters in the past, and that's how paired/unpaired status is relayed, rather than via the api). We might want to examine logs from the zeroconf integration to make sure it's getting the right data there.

Duplicate device ids could also be a problem and cause this kind of confusion. Hopefully they are automated by homespan and fully random so I wouldn't expect that to be a problem.

phybros commented 10 months ago

Thanks for all this info. I also suspected network weirdness, which is why I tried it somewhere else with a super normal wifi/networking situation. And now that I'm thinking back maybe it did work fine the very first time in that environment and then had issues subsequent times.... hmm. I'll go away and try to do a more scientific set of tests and let you know!

phybros commented 10 months ago

I tried just un-pairing and re-pairing from within HA a few times. When unpairing, it all works well, the HomeSpan devices sets itself back to unpaired etc, the integration disappears from HA. The device is immediately re-discovered by HA and I can re-pair no problem (MOST of the time), however it does put two things into the HA logs:

Logger: homeassistant.components.homekit_controller.connection
Source: components/homekit_controller/connection.py:782
Integration: HomeKit Device (documentation, issues)
First occurred: 2:22:31 PM (1 occurrences)
Last logged: 2:22:31 PM

HomeKit device update skipped as previous poll still in flight: CF:97:67:3A:BC:A6

and

Logger: aiohomekit.utils
Source: /usr/local/lib/python3.11/site-packages/aiohomekit/utils.py:45
First occurred: 2:22:35 PM (1 occurrences)
Last logged: 2:22:35 PM

Failure running background task: Task-4125
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/utils.py", line 40, in _handle_task_result
    task.result()
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/controller/ip/connection.py", line 590, in _reconnect
    return await self._connect_once()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/controller/ip/connection.py", line 698, in _connect_once
    request, expected = state_machine.send(response)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/protocol/__init__.py", line 596, in get_session_keys
    handle_state_step(response_tlv, TLV.M4)
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/protocol/__init__.py", line 90, in handle_state_step
    error_handler(tlv_dict[TLV.kTLVType_Error], f"step {expected_state}")
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/protocol/__init__.py", line 64, in error_handler
    raise AuthenticationError(stage)
aiohomekit.exceptions.AuthenticationError: step bytearray(b'\x04')

But as I said, the device is configured fine, and everything works as it should MOST of the time. Occasionally it will give the step bytearray(b'\x04') but if I try again 2 or 3 more times, getting that error in the UI each time, it eventually works

One intersting thing I noticed while looking at HomeSpan's debug logs is that even after I un-pair the device, and the integration vanishes from HA, the device is still receiving HTTP requests periodically like this:

2023-11-11 14:34:04 --> ** Client #12 Connected: (87 sec) 192.168.86.201 on Socket 3/16
2023-11-11 14:34:04 --> 
2023-11-11 14:34:04 --> <<<<<<<<< 192.168.86.201 <<<<<<<<<
2023-11-11 14:34:04 --> POST /pair-verify HTTP/1.1
2023-11-11 14:34:04 --> Host: 192.168.86.46
2023-11-11 14:34:04 --> Content-Length: 37
2023-11-11 14:34:04 --> Content-Type: application/pairing+tlv8
2023-11-11 14:34:04 --> ------------ END BODY! ------------
2023-11-11 14:34:04 --> STATE(1) 01
2023-11-11 14:34:04 --> PUBKEY(32) 7D278B1D4A959D2C111F91B1D4ABB04220BCF7CFA38E29A0B14F16BA70DBBE59
2023-11-11 14:34:04 --> -------
2023-11-11 14:34:04 --> ----- END TLVS! ------------
2023-11-11 14:34:04 --> In Pair Verify #12 (192.168.86.201)...
2023-11-11 14:34:04 --> *** ERROR: Device not yet paired!
2023-11-11 14:34:04 --> 
2023-11-11 14:34:04 --> 
2023-11-11 14:34:04 --> >>>>>>>>>> 192.168.86.201 >>>>>>>>>>
2023-11-11 14:34:04 --> HTTP/1.1 200 OK
2023-11-11 14:34:04 --> Content-Type: application/pairing+tlv8
2023-11-11 14:34:04 --> Content-Length: 6
2023-11-11 14:34:04 --> 
2023-11-11 14:34:04 --> STATE(1) 02
2023-11-11 14:34:04 --> ERROR(1) 01
2023-11-11 14:34:04 --> ------------ SENT! --------------

Almost like HA is still doing things in the background even though the integration is gone. This still happens after I reboot the ESP device also. It wakes up and starts getting these requests. The IP ending 201 is my home assistant server BTW.

If I restart HA in this situation, the HTTP requests stop, and they don't start again once HA is back up.

After HA is back up, I try to pair and almost always get the "device is already paired", and I can see in HomeSpan's logs that HA is requesting to pair, pairing, and then requesting to pair a second time, resulting in the error.

The second set of tests I did also involved wiping the ESP device and having it re-generate a new unique device ID. Each time I did this, I also restarted my HA instance to get rid of any cached zeroconf discovery things. This seemed most successful when I wiped the ESP and also powered it totally off during the HA restart. Even then, sometimes there was a device "detected" by HA. Definitely seems like some mdns/zeroconf things maybe?

A couple of times I got this, if that helps.

Logger: aiohomekit.utils
Source: /usr/local/lib/python3.11/site-packages/aiohomekit/utils.py:45
First occurred: 2:16:39 PM (2 occurrences)
Last logged: 2:18:15 PM

Failure running background task: Task-2638
Failure running background task: Task-3279
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/controller/ip/pairing.py", line 140, in _ensure_connected
    await connection.ensure_connection()
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/controller/ip/connection.py", line 313, in ensure_connection
    await asyncio.shield(self._connector)
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/controller/ip/pairing.py", line 139, in _ensure_connected
    async with asyncio_timeout(10):
  File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/utils.py", line 40, in _handle_task_result
    task.result()
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/controller/ip/pairing.py", line 421, in _process_config_changed
    await self.list_accessories_and_characteristics()
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/controller/ip/pairing.py", line 177, in list_accessories_and_characteristics
    await self._ensure_connected()
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/controller/ip/pairing.py", line 146, in _ensure_connected
    raise AccessoryDisconnectedError(
aiohomekit.exceptions.AccessoryDisconnectedError: Timeout while waiting for connection to device 192.168.86.46:80
bdraco commented 10 months ago

It looks like HA isn't fully shutting down the connection when its unpaired.

phybros commented 10 months ago

Is there anything I can do to confirm that @bdraco ?

Jc2k commented 10 months ago

If it's still doing pair verify after removing the integration and stopping after a restart then it's pretty much confirmed.

It's probably not been a problem before because most users add it once and don't need to remove it and add it again.

phybros commented 10 months ago

OK should I report this to HA then? What's the next thing to do for this issue?

Jc2k commented 10 months ago

No reporting it there would just ping the same people that you are speaking to here :) Just need to wait for someone to have time to look into it.

phybros commented 10 months ago

Haha ok no problem! Thanks for all your responses

phybros commented 9 months ago

Hey there! I have definitively confirmed that this can happen if the device is brand new and has never been paired with HA. Also this was on a completely fresh HA that has never paired any devices before. So it seems like it's not just pair-unpair scenario from above.

phybros commented 9 months ago

To be clear I think that HA DOES continue to send messages to the device after unpairing it (sometimes), but it can ALSO fail on totally fresh pairing. I think they are related but separate bugs. The logs from the "fresh" scenario are identical to the ones I provided already - seems like multiple threads both trying to pair at the same time for some reason.

I've never had to debug anything this deep in HA before, but I'm going to try the dev env set up locally and see if I can do anything.