apple / HomeKitADK

Apache License 2.0
2.56k stars 232 forks source link

DNS Service Discovery #75

Open d4rkmen opened 3 years ago

d4rkmen commented 3 years ago

Developing accessories for ESP8266/ESP32 I have faced a problem with DNS SD. Accessories goes offline and back online from time to time. When this happen during the Pairing Setup - accessory become paired, and controller does not even know it, because it already closed the pairing context (due to DNS SD monitoring) Investigating the problem I have traced all the MDNS packets and made a conclusion, the UDP packets drop cause this. Slow WiFI interfaces cant handle all the UDP milticast packets from the local network. And the bigger accessories give more UDP traffic and more drops. This is normal to UDP - because there is no confirmation of packet receiving. And thats why DNS SD standart was designed that way, every packet has time to live (TTL) value. In our case its 120 sec for PTR record. This means, last received packet should be considered as "true" diring this TTL period. So in case an accessory (commonly low power weak device) will miss some request, it has 120 seconds to remain present. But on practice we have other behaivour. The controller may recognize accessory goes offline even after few requests being missed. For speed LAN devices this is more or less OK situation, but not for WiFi for sure. The only current workaround I see is to constantly keep the accessory advertising itself to fit the most of MDNS queries from controller (at least during pairing process). But I understand this is not the way it should be done. Any sugestions welcome. Thanks and sorry for long reading.

Supereg commented 3 years ago

In our case its 120 sec for PTR record

When you got an implementation which is spec conform you wouldn't have a PTR record with a ttl of 120 seconds.

Citing RFC 6762 10. Resource Record TTL Values and Cache Coherency:

   As a general rule, the recommended TTL value for Multicast DNS
   resource records with a host name as the resource record's name
   (e.g., A, AAAA, HINFO) or a host name contained within the resource
   record's rdata (e.g., SRV, reverse mapping PTR record) SHOULD be 120
   seconds.

   The recommended TTL value for other Multicast DNS resource records is
   75 minutes.

   A querier with an active outstanding query will issue a query message
   when one or more of the resource records in its cache are 80% of the
   way to expiry.  If the TTL on those records is 75 minutes, this
   ongoing cache maintenance process yields a steady-state query rate of
   one query every 60 minutes.

   Any distributed cache needs a cache coherency protocol.  If Multicast
   DNS resource records follow the recommendation and have a TTL of 75
   minutes, that means that stale data could persist in the system for a
   little over an hour.  Making the default RR TTL significantly lower
   would reduce the lifetime of stale data, but would produce too much
   extra traffic on the network.  Various techniques are available to
   minimize the impact of such stale data, outlined in the five
   subsections below.

This means A, AAAA and SRV records SHOULD have a ttl of 120 and PTR and TXT records have 4500 seconds aka 75 minutes.

This is mandatory to keep the one query all 60 minutes working. When using 120 seconds for the PTR record, in case of a packet loss, the time to query again is sometimes too short and results in the services being removed.


The only current workaround I see is to constantly keep the accessory advertising itself to fit the most of MDNS queries from controller (at least during pairing process).

Please never never never ever do this.

For one citing RFC 6762 8.3. Announcing:

   A Multicast DNS responder MUST NOT send announcements in the absence
   of information that its network connectivity may have changed in some
   relevant way.  In particular, a Multicast DNS responder MUST NOT send
   regular periodic announcements as a matter of course.

Doing so completely defeats the point of multicast dns.

d4rkmen commented 3 years ago

Doing so completely defeats the point of multicast dns.

This is true, but what else? Currently TTL values ignored by the controller, and missing query answer recognizes service offline

Supereg commented 3 years ago

Currently TTL values ignored by the controller [...]

what do you mean with that? Didn't you say that the PTR ttl is set to 120 seconds?

missing query answer recognizes service offline

it shouldn't as the record is only invalidated if the ttl runs out(?) (or at least it shouldn't with respect to Observation of Failure Indication or Passive Observation Of Failures (POOF))

d4rkmen commented 3 years ago

what do you mean with that? Didn't you say that the PTR ttl is set to 120 seconds?

Sorry, its 75 min for PTR, u r right. But this changes nothing. In fact service appers offline after 10-30 sec of no answers. May be other users have different expirience.

d4rkmen commented 3 years ago

@Supereg what if we will check for other hosts responses matching our services if no query was detected during 2 sec before? So, even if we will miss the original query (due to massive responses flow) we still will know "someone" is looking for our services. And in case there are not much devices, we will never miss the original query.