espressif / esp-homekit-sdk

562 stars 105 forks source link

SDK very unresponsive (even for simple Lightbulb) #9

Closed mgoeller closed 3 years ago

mgoeller commented 3 years ago

We have access to your commercial HomeKit SDK. When flashing your simple Lightbulb example, the Lightbulb gets when you switch it very fast on and off, after a few times unresponsive. After some time it can even completely lose the connection stating "No Response". Even if that is not the case, the spinning wheel is often there, and the state changes only deploy slowly to the Hardware.

This is not the case with the open source ESP HomeKit Library from Maxim Kulkin.

What is causing this?

mgoeller commented 3 years ago

I got an email notification that @sampo3k replied to this issue as follows:

"I was having a similar issue with a garage door where after some period of time, the homekit device would be listed as "no response". This homekit implementation appears to have an issue in homes with multiple controllers (HomePod + appletv, multiple appletvs, etc.) where connections will occasionally migrate between controllers. However, upon dropping a connection to one controller, this sdk force-closes all connections to the controll"

Why was that deleted?

shahpiyushv commented 3 years ago

@mgoeller, yes even I had got that email. @sampo3k, any specific reason why it was deleted? Even if it may not be directly related to this issue, you may share your observations here and/or add a separate issue for that.

Additionally, can you try these options to get additional debug information?

1) Call hap_http_debug_enable() before hap_start() 2) Set the esp idf log level to DEBUG instead of INFO.

The second option should be enabled only if the first doesn't give sufficient information, primarily because enabling too many debug prints can also affect the behavior.

sampo3k commented 3 years ago

I deleted it because i wasn't sure it was accurate, still debugging when i have time. Definitely see the number of open sessions drop to zero, and no mdns re-announce, and device will sit there in that state as no-response for some period of time. trying to figure out what triggers it coming out of that state (which it usually will within an hour or so), if that's mdns or something else.

sampo3k commented 3 years ago

been poking into this a bit more; seems like there's at least two issues:

  1. Adding SO_KEEPALIVE to the http session socket in esp_http_server results in a substantially more stable accessory. With this change, I can reliably get to about 12 hours of operation, and the device always instantly responds / update state in the home app.
  2. After about 12 hours with keepalive, the ESP32S2 connections drop. Sometimes you get 113 errors out of accept(), sometimes not. Seems like there's some sort of resource leak or something, not sure if it's an issue with the http server or the IP stack in general. Next time it fails i'll try to gather more data.
mgoeller commented 3 years ago

Thank you for looking into this! I wonder how hardware manufacturers can work with this official SDK. Will following your progress!

shahpiyushv commented 3 years ago

Hi @sampo3k, tcp keep alive is indeed a useful option in this case which I had used much earlier. However, in the earlier days of HomeKit, Apple had explicitly asked to remove that option because their explanation was that the iPhone cannot stay in sleep mode because of this and so it affects their power consumption. We had tried to explain that this does not seem to be the case, but we had to eventually remove that option anyways. I am not sure what their stand would be now, but unfortunately, this is something we can't just enable by default directly.

As to the other point, can you tell what esp idf version you are using?

sampo3k commented 3 years ago

I'm using ESP-IDF v4.3-dev-1720

This morning when the accessory was offline again I tried the following:

  1. ping device -> still works, so TCP is alive and happy
  2. telnet to port 80 -> works, httpd is alive and happy
  3. browse bonjour services from host mac, and it's missing! this is despite a re-announce when the last connection drops, so now i'm suspecting something in the mdns stack on the esp32. going to go add turn on / add debugging there and get back to you.
sampo3k commented 3 years ago

Verified that it's mDNS records going away. service browser output:

16:20:38.708 Add 2 12 local. _hap._tcp. Philips hue - 2503CA 16:20:38.724 Add 2 12 local. _hap._tcp. Smart Bridge 2 16:20:38.796 Add 3 12 local. _hap._tcp. VOCOlinc-SmartBar-0d49d9 16:20:38.796 Add 2 12 local. _hap._tcp. VOCOlinc-SmartBar-0fe03d 16:20:38.827 Add 2 12 local. _hap._tcp. MyQ-C7C 16:20:38.937 Add 2 12 local. _hap._tcp. GateModem DATE: ---Sat 21 Nov 2020--- 4:48:18.633 Rmv 0 12 local. _hap._tcp. GateModem 4:49:15.697 Add 2 12 local. _hap._tcp. GateModem 4:49:27.801 Rmv 0 12 local. _hap._tcp. GateModem 8:24:52.828 Add 2 12 local. _hap._tcp. GateModem 8:25:39.448 Rmv 0 12 local. _hap._tcp. GateModem 8:27:26.117 Add 2 12 local. _hap._tcp. GateModem 8:27:37.198 Rmv 0 12 local. _hap._tcp. GateModem

Turned on the debug spew in esp-idf/components/mdns/mdns.c, will see if there's anything interesting when it starts trying to announce and almost immediately time out the records (missing answer packets, etc.)

sampo3k commented 3 years ago

Fingers crossed, but turning off wifi power save with esp_wifi_set_ps(WIFI_PS_NONE); and cranking down the mDNS interval in menuconfig from 50 to 40ms, and bumping MDNS_PACKET_QUEUE_LEN to 32 in esp-idf has collectively resulted in >48 hrs of stable operation.

Just before the modem would drop off, mDNS on my mac would actively remove the service saying it had missed 4 questions in a row. There wasn't anything obvious from the esp32 side of the log, and i'm not particularly skilled with wireshark to investigate further, but this seems like a suitable workaround for now.

shahpiyushv commented 3 years ago

@sampo3k , thanks for all the debugging and analysis! This is indeed helpful. Meanwhile, while you are at it, can you also check the RSSI and give some rough information of your network and environment, like number of APs in the vicinity, traffic (especially mDNS) on your network, etc?

sampo3k commented 3 years ago

Network is roughly 25 wifi clients across 3 APs; there's an AP fairly close to the ESP32, which claims the strength has historically more or less not deviated from -60dBm which is pretty good. DTIM is set to 3, which means that when in power save, pings to the board are typically between 0 and ~300ms. Negotiated rate is ~70Mbps. There is a quite a bit of bursty mDNS traffic on the network; there are 5 appletvs (wired), 2 computers, 2 iPhones and an iPad, plus 4 or 5 other homekit clients, a printer and some other IoT things advertising various services. I'd say there's typically 3-4 messages per second, with storms up to 30-40 depending on which computer/ipad has recently woken up and decided to ask a bunch of questions about services.

shahpiyushv commented 3 years ago

@mgoeller

We have access to your commercial HomeKit SDK. When flashing your simple Lightbulb example, the Lightbulb gets when you switch it very fast on and off, after a few times unresponsive. After some time it can even completely lose the connection stating "No Response"

I tried this by flashing the lightbulb example as is and continuously turning off/on from Home app for 200+ times, but I did not face any such issues. Can you enable additional logging as I had mentioned in a comment above and see if you can get any additional information?

shahpiyushv commented 3 years ago

@mgoeller @sampo3k , can you check if the latest commits help in some way, in addition to the esp_wifi_set_ps(WIFI_PS_NONE) call (which is not part of the SDK).

mgoeller commented 3 years ago

@shahpiyushv setting the mDNS interval in menu config to 40ms (it has been 100ms) removed the constant spinning wheel when fast switching the lightbulb accessory on and off.

In the official and commercial SDK on gitlab (master branch) I don't see any recent commits. Can you do the commits for the commercial SDK as well?

We are currently long term testing the official SDK while using the esp_wifi_set_ps(WIFI_PS_NONE) option. So far so good, I will let you know if something goes unresponsive again.