Jc2k / aiohomekit

asyncio for homekit
Apache License 2.0
52 stars 18 forks source link

Sleepy Thread devices #249

Open roysjosh opened 1 year ago

roysjosh commented 1 year ago

https://github.com/chrysn/aiocoap/issues/175#issuecomment-1310752470

I think a forum user is hitting this aiocoap issue. Average ping RTT to the sleepy accessory is ~2600ms, over the 2.0 aiocoap ACK_TIMEOUT.

2022-11-09 19:08:17.637 DEBUG (MainThread) [aiohomekit.controller.coap.connection] Pair verify uri=coap://[fdd8:9c7d:c2d1:0:d350:15d9:7a39:fc10]:5683/2
2022-11-09 19:08:17.638 DEBUG (MainThread) [aiohomekit.protocol.tlv] sending [
6 (State): (1 bytes/<class 'bytearray'>) 0x01
3 (PublicKey): (32 bytes/<class 'bytes'>) b'\xed\xc4\n\xeb\xf2E\xbe\r\x19P\xf3\x912\xf4\x1c\x0cq\xc8U\x9c\xdfP\xb1\x80\\\xc6\xcb\xc6\xcd~\x01+'
]

2022-11-09 19:08:20.009 INFO (MainThread) [coap-server] Retransmission, Message ID: 37481.

2022-11-09 19:08:20.745 ERROR (MainThread) [coap-server] Error received and ignored in this codepath: [Errno 113] Host is unreachable

Note that the retransmit triggers something (the BR?) to entirely kill the connection, probably since the first is still outstanding.

I'll put together a PR for aiocoap with a knob we'll have to adjust the CoAP controller here.

roysjosh commented 1 year ago

Well, changing ACK_TIMEOUT from 2.0 to 12.0 didn't seem to help this particular user. I need to order a few sleepy Thread devices so I can get this sorted out.

Jc2k commented 1 year ago

Wemo Stage and Eve Thermo are the sleepy HAP devices I'm testing with FWIW.

Jc2k commented 1 year ago

As I said in ^, bumping ACK_TIMEOUT to 10 has got rid of thousands of "Received Type.ACK from but could not match it to a running exchange" warnings in my logs.

Looking at https://github.com/chrysn/aiocoap/blob/419c88cb7bec693f3c33b188a8e394e6666983c5/aiocoap/messagemanager.py#L215, they were caused by original transmission and the retransmission sharing an entry in self._active_exchanges. If both are ACK'd, original will clear the entire in active exchanges first, then the retransmission won't be able to find itself in active exchanges any more, causing that log spam.

So I think even if its not the root cause for your user, its something we need to resolve as that error message is very common and comes up a lot in HA tickets.

jfroy commented 1 year ago

I need to read the Thread spec to see how much buffering is allowed for sleepy devices. Bumping the timeout for connections to sleepy Thread devices would probably be an OK fix. Handling re-transmit ACKs per the COAP protocol spec would also be a good fix.

Jc2k commented 1 year ago

I think we want both - it seems daft (and wasteful) to retransmit when we know sleepy thread devices will "timeout" half the time.

But equally if it's valid to ACK a retransmission and the original then aiocoap should be silently handling that.

Jc2k commented 1 year ago

(But yes, I plucked 10s out of the air, we should use something closer to the spec - I thought it was 5s but haven't checked)

Jc2k commented 1 year ago

This has been said in other tickets, but duping here.

Latest aiocoap should no longer RST in violation of spec and makes re-transmit dupe ACK's less noisy in the logs. They do however still generate noise, so we need to handle the timeouts and make them longer than the sleep interval of the device.

Sleepy devices I have do sleep for 5s, however it's set by a characteristic called "Sleep Interval". So we need to tune the retry interval per endpoint, respecting that char.

Jc2k commented 1 year ago

https://github.com/chrysn/aiocoap/pull/294 will let us avoid re-transmits during a sleepy devices sleep interval. We'll probably want a TransportTuning per connection, and we should use the "Sleep Interval" characteristic value if its available to configure it.