chrysn / aiocoap

The Python CoAP library
Other
264 stars 119 forks source link

Silently handle ACKs for confirmable message retransmits #288

Closed jfroy closed 1 year ago

jfroy commented 1 year ago

This is a followup to https://github.com/home-assistant/core/issues/80133#issuecomment-1278797804 regarding warnings coming out of aiocoap as used by aiohomekit in Home Assistant core.

Logger: coap-server
Source: runner.py:119
First occurred: 12:39:34 (34 occurrences)
Last logged: 13:35:56

Received Type.ACK from <UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7] (locally fd9e:af89:770f:40f8:dfd4:1c17:ed42:3138%eth0)>, but could not match it to a running exchange.

There warnings are repeatedly printed with a small number of Thread HomeKit accessories paired to Home Assistant. For Thread accessories advertised as _hap._udp mdns services, coap over IPv6 is used.

I have a suspicion (but no proof) that aiocoap is not handling a network with multiple ULA address ranges advertised by various routers. In my case, I have a Nest Hub (its built-in Thread border router), an Apple TV 4K 2nd gen (its built-in Thread border router), and my internet router (a UDM Pro, custom dnsmasq config) advertising prefixes. On top of that, the Apple TV 4K border router is advertising routes (RAs with RIO), which is a behavior introduced in the latest Thread spec.

I don't know if the warnings are harmless or what they mean, but since I am experiencing device disconnects, I am exploring potential root causes. I'm happy to provide more data and run experiments.

jfroy commented 1 year ago

Here's some debug logs.

2022-11-14 15:39:06.356 DEBUG (MainThread) [coap-server] Sending request - Token: a630, Remote: <UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7]>
2022-11-14 15:39:06.357 DEBUG (MainThread) [coap-server] Sending message <aiocoap.Message at 0x7f57a111e0: Type.CON POST (MID 10184, token a630) remote <UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7]>, 44 byte(s) payload>
2022-11-14 15:39:06.358 DEBUG (MainThread) [coap-server] Exchange added.
  key: (<UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7]>, 10184)
  message: <aiocoap.Message at 0x7f57a111e0: Type.CON POST (MID 10184, token a630) remote <UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7]>, 44 byte(s) payload>

A little bit later, it gets transmitted (first scheduled retransmission):

2022-11-14 15:39:09.135 INFO (MainThread) [coap-server] Retransmission, Message ID: 10184.

A little bit later, ACK gets received:

2022-11-14 15:39:10.939 DEBUG (MainThread) [coap-server] Incoming message <aiocoap.Message at 0x7f732ed840: Type.ACK 2.04 Changed (MID 10184, token a630) remote <UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7] (locally fd9e:af89:770f:40f8:dfd4:1c17:ed42:3138%eth0)>, 50 byte(s) payload>
2022-11-14 15:39:10.941 DEBUG (MainThread) [coap-server] Trying to remove exchange.
  key: (<UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7] (locally fd9e:af89:770f:40f8:dfd4:1c17:ed42:3138%eth0)>, 10184)
  message: <aiocoap.Message at 0x7f732ed840: Type.ACK 2.04 Changed (MID 10184, token a630) remote <UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7] (locally fd9e:af89:770f:40f8:dfd4:1c17:ed42:3138%eth0)>, 50 byte(s) payload>
2022-11-14 15:39:10.942 DEBUG (MainThread) [coap-server] Exchange removed, message ID: 10184.
2022-11-14 15:39:10.942 DEBUG (MainThread) [coap-server] Received Response: <aiocoap.Message at 0x7f732ed840: Type.ACK 2.04 Changed (MID 10184, token a630) remote <UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7] (locally fd9e:af89:770f:40f8:dfd4:1c17:ed42:3138%eth0)>, 50 byte(s) payload>
2022-11-14 15:39:10.942 DEBUG (MainThread) [coap-server] Response <aiocoap.Message at 0x7f732ed840: Type.ACK 2.04 Changed (MID 10184, token a630) remote <UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7] (locally fd9e:af89:770f:40f8:dfd4:1c17:ed42:3138%eth0)>, 50 byte(s) payload> matched to request <Pipe at 0x7f57818730 around <aiocoap.Message at 0x7f57a111e0: Type.CON POST (MID 10184, token a630) remote <UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7]>, 44 byte(s) payload> with 2 callbacks>

And a little later another ACK is received, but that one can't be matched to an exchange because it's already been removed:

2022-11-14 15:39:10.964 DEBUG (MainThread) [coap-server] Incoming message <aiocoap.Message at 0x7f6dc72350: Type.ACK 2.04 Changed (MID 10184, token a630) remote <UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7] (locally fd9e:af89:770f:40f8:dfd4:1c17:ed42:3138%eth0)>, 50 byte(s) payload>
2022-11-14 15:39:10.965 DEBUG (MainThread) [coap-server] Trying to remove exchange.
  key: (<UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7] (locally fd9e:af89:770f:40f8:dfd4:1c17:ed42:3138%eth0)>, 10184)
  message: <aiocoap.Message at 0x7f6dc72350: Type.ACK 2.04 Changed (MID 10184, token a630) remote <UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7] (locally fd9e:af89:770f:40f8:dfd4:1c17:ed42:3138%eth0)>, 50 byte(s) payload>
2022-11-14 15:39:10.966 WARNING (MainThread) [coap-server] Received Type.ACK from <UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7] (locally fd9e:af89:770f:40f8:dfd4:1c17:ed42:3138%eth0)>, but could not match it to a running exchange.
2022-11-14 15:39:10.979 DEBUG (MainThread) [coap-server] Received Response: <aiocoap.Message at 0x7f6dc72350: Type.ACK 2.04 Changed (MID 10184, token a630) remote <UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7] (locally fd9e:af89:770f:40f8:dfd4:1c17:ed42:3138%eth0)>, 50 byte(s) payload>
2022-11-14 15:39:10.980 INFO (MainThread) [coap-server] Response <aiocoap.Message at 0x7f6dc72350: Type.ACK 2.04 Changed (MID 10184, token a630) remote <UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7] (locally fd9e:af89:770f:40f8:dfd4:1c17:ed42:3138%eth0)>, 50 byte(s) payload> could not be matched to any request
2022-11-14 15:39:10.980 INFO (MainThread) [coap-server] Response not recognized - sending RST.
2022-11-14 15:39:10.981 DEBUG (MainThread) [coap-server] Sending message <aiocoap.Message at 0x7f70a44a90: Type.RST EMPTY (MID 10184, empty token) remote <UDP6EndpointAddress [fdbc:107:de33:0:2943:3a14:157:8cf7] (locally fd9e:af89:770f:40f8:dfd4:1c17:ed42:3138%eth0)>>

I have no idea why a second ACK is received. Is that potentially a bug in the device sending that ACK?

jfroy commented 1 year ago

Ah, I misunderstood the code.

2022-11-14 15:39:09.135 INFO (MainThread) [coap-server] Retransmission, Message ID: 10184.

indicates that the message was transmitted twice, so that explains the two ACKs. If two ACKs are received, I assume this means the first transmission and the second transmission were received and both ACKed.

I am not sure if it is a bug for the receiver to ACK the same message twice.

As for the root cause, the only device where I see this happen is a battery-powered Thread device, which means it periodically wakes up its radio to process messages then goes to sleep for quite a long time (on a networking time scale). aiocoap's default timeout of 2 seconds may be problematic for these kinds of devices.

jfroy commented 1 year ago

According to the spec (4.5. Message Deduplication), it is OK for the receiver to ACK each duplicate confirmable message within EXCHANGE_LIFETIME.

So perhaps aiocoap needs to handle receiving an ACK for each transmission of a confirmable message without issuing a reset, instead of the current behavior of sending a reset on the second ACK.

Jc2k commented 1 year ago

@jfroy @roysjosh is already looking at retransmissions for sleepy thread devices. See https://github.com/chrysn/aiocoap/issues/175#issuecomment-1310752470 and https://github.com/Jc2k/aiohomekit/issues/249.

FWIW, bumping ACK_TIMEOUT has "fixed" the "but could not match it to a running exchange" errors that were filling my logs. I wasn't seeing the device disconnects, so no comment on that.

jfroy commented 1 year ago

I've changed the title of the issue to better reflect the behavior change I am suggesting. The COAP spec allows receivers of confirmable messages to ACK each retransmit. For flows that are likely to experience long latencies due to batching or power saving, such as Thread networks, this is likely to happen. The ACKs for each of those retransmits should be silently handled, instead of sending a RST as the current implementation does after receiving one ACK.

chrysn commented 1 year ago

This is tricky -- on the usability side (the technical side is easy: We're receiving ACKs we don't care about, we don't do anything, and all is fine).

There are legitimate cases of receiving these ACKs. This can be due to large RTTs (as is here: we send the request twice, it is ACKed twice, and all is fine due to CoAP's message deduplication), or due to packet duplication (which is rare in practice but never ruled out). In these cases, this is at most worth an info-level message.

But there are also cases of broken servers that send from different addresses (say, 2001:db8::4000 and 2001:db8::1000:4000). Then the log looks about like this:

Sending message to 2001:db8::4000
Received Type.ACK from <UDP6EndpointAddress [2001:db8::1000:4000] (locally ...)>, but could not match it to a running exchange. 
Request to 2001:db8::4000 timed out

And here, the message that we got something we can't match is highly valuable.

Sure we could keep track of messages we've "recently" processed, look ACKs up there, and silence demote the message from warning to an info/debug message. But that'd mean that we'd start keeping state for relatively long time ("just" a few seconds, but when sending requests at line speed that's still a lot of data) just for the benefit of cleaner logs.

Given that in this case this has pointed to an RTT that exceeds the retransmission timeout, I take it as a weak data point that this is generally good as a warning. And probably aiocoap's raw log output can't be expected to do a perfectly sharp cut between "all is OK" and "something went wrong".

Assuming there is not more to it than the above (which thanks to your previous processing to this issue I don't see any indication any more), is there anything you'd expect aiocoap to do differently?

Jc2k commented 1 year ago

The current warning level output as it stands is a problem for Home Assistant, as it translates to lots of user support requests for something that is otherwise working well. Generally so far supporting HAP over CoAP/Thread has been 40% about helping people configure IPv6 and 40% "what is this Received Type.ACK message?" (and then 20% "oh I better ask @roysjosh about this one). Generally anything that is a warning is expected to be something that impacts them, where as IMO a missing ACK is like a missing ICMP ping - it's probably harmless, and if its a sign of a bigger problem you can probably tell at the application level anyway.

Note that the current ACK timeout is lower than the default Thread "sleepy end device" timeout, so with a handful of battery powered thread devices I see thousands of this error in Home Assistant in quite a short space of time. We have users with way more devices than that, and they get quite a lot of warnings showing up. A user not familiar with CoAP with thousands of warnings in their logs are likely to ask for help 😄

IMO it feels weird to repeat data when we know an endpoint is probably asleep. So as a data point, I bumped the ACK timeout to something like 10s and haven't had any of those errors since. To do this in Home Assistant "prod" we'd have to monkey patch, which I wouldn't get passed code review. I haven't yet been able to find anything in the open thread spec or issue tracker about this, but a somewhat related issue talked about notifying the sender when a message was queued so they didn't end up queueing the same message 5 times. I'll try to find a more concrete reference.

I personally would like to lower this to a debug log or have a way to set it at runtime or even both. As there are multiple integrations in Home Assistant using this, if we were in control of the timeout we'd need to not break other integrations and indeed as mains powered devices are not sleepy, we'd probably want to eventually have difference timeouts for different endpoints.

Jc2k commented 1 year ago

The exact sleep interval is available over the HAP protocol through characteristic 0000023A-0000-1000-8000-0026BB765291. For all my battery powered thread devices, its 5s. But thats probably not representitive.

chrysn commented 1 year ago

On the timeout side, let's keep this at #175 (replying there too).

On the level side ... I've taken the earlier comments as a hint to promote previously internal comments on log levels to documentation, which now says that log output is not on its own suitable for automated action. But users seeing this in their logs and taking your time for it isn't really nice either -- and the amount of reports you mention (and the absence of other errors) does indicate that it's overly strict.

So to resolve this, I'm now leaning toward just lowering this to info. This makes it harder for people with broken servers, but well, these hopefully turn up their log level, and I rather make things practical for large downstreams than going out of my way to make life easy for those with broken peer implementations. You may still get other warnings that are not precisely indicating errors, but given you haven't complained about any others, chances are they're on an OK-ish occurrence level :-)

jfroy commented 1 year ago

I see this is now closed, but any concerns with the RST packet that gets sent when an ACK that can't be matched to an active exchange is received? Changing the log level is not going to address that (if it needs addressing), but changing the timeout would.

chrysn commented 1 year ago

Thanks, the RST is something I should still look into -- the "silently" sent me on the wrong track here. Reopening for evaluation (and IIRC my implementation is just wrong).

The timeout issue is tracked in #175, which is still open.

chrysn commented 1 year ago

Now addressed in 96eeb675bd9a35fe04634c890b42a851b94145e8 (which had the wrong issue number, which of course I only noticed when I saw what was auto-closed).

jfroy commented 1 year ago

Thanks! And sorry for having a confusing issue title.

chrysn commented 1 year ago

I suppose you'll need that on pip, but can that wait for then time parameter issue?

Jc2k commented 1 year ago

I don't know how hard it is to release, but I suspect this and the logging commit would make a big difference to HA users. I'd also be happy to propose they are included in a bug fix release.

The timeouts will need more work and testing on our side and I'd want it to go in the monthly qa process.

So from my pov it'd be worth it if there was a release now.

chrysn commented 1 year ago

Really it's mainly about whether I write the NEWS file now or later. 0.4.5 is out, and as much a bugfix release as any ever was. Have fun!

Jc2k commented 1 year ago

Thanks!