Azure / iotedge

The IoT Edge OSS project
MIT License
1.47k stars 462 forks source link

EdgeHub is unable to reauthenticated connected clients #7163

Closed MattCosturos closed 12 months ago

MattCosturos commented 1 year ago

Expected Behavior

Modules should remain running and connected to the iot hub

Current Behaviora

During the periodic task to reauthenticate connected clients, the connection fails

Steps to Reproduce

Unable to reproduce on demand. Failures are seemingly random. Sometimes it will fail on the first renewal (after 1 hour) sometimes it will fail on the 2nd renewal (after 2), etc

Context (Environment)

Output of iotedge check

Click here ``` Configuration checks (aziot-identity-service) --------------------------------------------- √ keyd configuration is well-formed - OK √ certd configuration is well-formed - OK √ tpmd configuration is well-formed - OK √ identityd configuration is well-formed - OK √ daemon configurations up-to-date with config.toml - OK √ identityd config toml file specifies a valid hostname - OK √ aziot-identity-service package is up-to-date - OK ‼ host time is close to reference time - Warning Could not query NTP server caused by: Could not query NTP server caused by: could not receive NTP server response: Resource temporarily unavailable (os error 11) caused by: Resource temporarily unavailable (os error 11) √ preloaded certificates are valid - OK √ keyd is running - OK √ certd is running - OK √ identityd is running - OK √ read all preloaded certificates from the Certificates Service - OK √ read all preloaded key pairs from the Keys Service - OK √ check all EST server URLs utilize HTTPS - OK √ ensure all preloaded certificates match preloaded private keys with the same ID - OK Connectivity checks (aziot-identity-service) -------------------------------------------- √ host can connect to and perform TLS handshake with iothub AMQP port - OK √ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK √ host can connect to and perform TLS handshake with iothub MQTT port - OK Configuration checks -------------------- √ aziot-edged configuration is well-formed - OK √ configuration up-to-date with config.toml - OK √ container engine is installed and functional - OK √ configuration has correct URIs for daemon mgmt endpoint - OK √ aziot-edge package is up-to-date - OK √ container time is close to host time - OK ‼ DNS server - Warning Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub. Please see https://aka.ms/iotedge-prod-checklist-dns for best practices. You can ignore this warning if you are setting DNS server per module in the Edge deployment. caused by: Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub. Please see https://aka.ms/iotedge-prod-checklist-dns for best practices. You can ignore this warning if you are setting DNS server per module in the Edge deployment. ‼ production readiness: logs policy - Warning Container engine is not configured to rotate module logs which may cause it run out of disk space. Please see https://aka.ms/iotedge-prod-checklist-logs for best practices. You can ignore this warning if you are setting log policy per module in the Edge deployment. caused by: Container engine is not configured to rotate module logs which may cause it run out of disk space. Please see https://aka.ms/iotedge-prod-checklist-logs for best practices. You can ignore this warning if you are setting log policy per module in the Edge deployment. √ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK √ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK √ Agent image is valid and can be pulled from upstream - OK √ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK Connectivity checks ------------------- √ container on the default network can connect to upstream AMQP port - OK √ container on the default network can connect to upstream HTTPS / WebSockets port - OK √ container on the default network can connect to upstream MQTT port - OK skipping because of not required in this configuration √ container on the IoT Edge module network can connect to upstream AMQP port - OK √ container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - OK √ container on the IoT Edge module network can connect to upstream MQTT port - OK skipping because of not required in this configuration 32 check(s) succeeded. 3 check(s) raised warnings. 2 check(s) were skipped due to errors from other checks. ```

Device Information

Runtime Versions

docker version ``` Client: Version: 20.10.18+azure-1 API version: 1.41 Go version: go1.18.6 Git commit: b40c2f6b5deeb11ac6c485c940865ee40664f0f0 Built: Thu Sep 8 08:19:02 UTC 2022 OS/Arch: linux/amd64 Context: default Experimental: true Server: Engine: Version: 24.0.7-1 API version: 1.43 (minimum version 1.12) Go version: go1.20.10 Git commit: 311b9ff0aa93aa55880e1e5f8871c4fb69583426 Built: Thu Oct 26 07:51:05 2023 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.5.13+azure-1 GitCommit: a17ec496a95e55601607ca50828147e8ccaeebf1 runc: Version: 1.1.4 GitCommit: 5fd4c4d144137e991c4acebb2146ab1483a97925 docker-init: Version: 0.19.0 GitCommit: de40ad0 ```

Please see edgeHub.txt for the edge hub logs showing all the connection failures.

Recovery happens when restarting the edgeHub module OR randomly, but possibly when I run iotedge check ( Perhaps the iot hub connection check restores the network connection for edgeHub? )

edgeAgent.txt azedged.txt azidentity.txt edgeHub.txt

nyanzebra commented 12 months ago

@MattCosturos from the logs it looks like what happens is there are connection issues and because of that, reauth cannot happen. Just to confirm, are you saying that it never recovers unless you restart edgeHub module / run iotedge check or that it does eventually restore on its own?

In fact, if it never recovers on its own, would you mind getting debug level logs if possible?

The iotedge check restoring connectivity is super weird and not expected... Would also be curious if any network connectivity works on the device when you see this problem, since you say you can restart EH I assume you can get to the device. So basically, if you see it happening, don't run iotedge check verify with curl or something that network still is okay and then see if EH recovers on its own.

MattCosturos commented 12 months ago

Hello @nyanzebra

Now this is getting weird. Device was in a disconnected state (has been disconnected for ~14 hours). I attempted to curl https://packages.microsoft.com/ubuntu/23.10/prod/pool/main/a/aspnetcore-runtime-6.0/aspnetcore-runtime-6.0_6.0.25-1_amd64.deb --output asp-runtime

It just hung (0 progress on the download) I tried several times.

Then I started logging packets with tcpdump and attempted to curl again. It suddenly worked, and edgeHub recovered on its own. I did nothing to restart iotedge daemons, or the system modules.

Everything I am debugging points to my network or network configuration being the cause of the issue. I am going to close this issue, I just wanted to get someone to look at the logs, and make sure there wasn't some edge runtime issue. But I am pretty sure every error in the logs stems from the initial connection issue.

nyanzebra commented 12 months ago

@MattCosturos sounds good, feel free to reopen anytime if have further questions or issues :)