edgeAgent hangs after a networking issue and cannot recover when the network is back. Restarting edgeAgent module fix the issue
Steps to Reproduce
I cannot reproduce manually it as it happens randomly (on the remote site). The remote site setup is: Edge on Ubuntu VM on Hyper-V on Windows 11 (default network switch, NAT).
Context (Environment)
Output of iotedge check
Click here
```
√ keyd configuration is well-formed - OK
√ certd configuration is well-formed - OK
√ tpmd configuration is well-formed - OK
√ identityd configuration is well-formed - OK
√ daemon configurations up-to-date with config.toml - OK
√ identityd config toml file specifies a valid hostname - OK
√ aziot-identity-service package is up-to-date - OK
√ host time is close to reference time - OK
√ preloaded certificates are valid - OK
√ keyd is running - OK
√ certd is running - OK
√ identityd is running - OK
√ read all preloaded certificates from the Certificates Service - OK
√ read all preloaded key pairs from the Keys Service - OK
√ check all EST server URLs utilize HTTPS - OK
√ ensure all preloaded certificates match preloaded private keys with the same ID - OK
Connectivity checks (aziot-identity-service)
--------------------------------------------
√ host can connect to and perform TLS handshake with iothub AMQP port - OK
√ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with iothub MQTT port - OK
Configuration checks
--------------------
√ aziot-edged configuration is well-formed - OK
√ configuration up-to-date with config.toml - OK
√ container engine is installed and functional - OK
√ configuration has correct URIs for daemon mgmt endpoint - OK
√ aziot-edge package is up-to-date - OK
√ container time is close to host time - OK
‼ DNS server - Warning
Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
You can ignore this warning if you are setting DNS server per module in the Edge deployment.
‼ production readiness: logs policy - Warning
Container engine is not configured to rotate module logs which may cause it run out of disk space.
Please see https://aka.ms/iotedge-prod-checklist-logs for best practices.
You can ignore this warning if you are setting log policy per module in the Edge deployment.
‼ production readiness: Edge Agent's storage directory is persisted on the host filesystem - Warning
The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem.
Data might be lost if the module is deleted or updated.
Please see https://aka.ms/iotedge-storage-host for best practices.
‼ production readiness: Edge Hub's storage directory is persisted on the host filesystem - Warning
The edgeHub module is not configured to persist its /tmp/edgeHub directory on the host filesystem.
Data might be lost if the module is deleted or updated.
Please see https://aka.ms/iotedge-storage-host for best practices.
√ Agent image is valid and can be pulled from upstream - OK
√ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK
Connectivity checks
-------------------
√ container on the default network can connect to upstream AMQP port - OK
√ container on the default network can connect to upstream HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to upstream AMQP port - OK
√ container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - OK
31 check(s) succeeded.
4 check(s) raised warnings. Re-run with --verbose for more details.
2 check(s) were skipped due to errors from other checks. Re-run with --verbose for more details.
```
Hi @Zurlys
Thank you for reporting the issue! Could you please open a support ticket via Azure portal and submit the support bundle along with the IcM ticket please?
Expected Behavior
edgeAgent hangs after a networking issue
Current Behavior
edgeAgent hangs after a networking issue and cannot recover when the network is back. Restarting edgeAgent module fix the issue
Steps to Reproduce
Context (Environment)
Output of
iotedge check
Click here
``` √ keyd configuration is well-formed - OK √ certd configuration is well-formed - OK √ tpmd configuration is well-formed - OK √ identityd configuration is well-formed - OK √ daemon configurations up-to-date with config.toml - OK √ identityd config toml file specifies a valid hostname - OK √ aziot-identity-service package is up-to-date - OK √ host time is close to reference time - OK √ preloaded certificates are valid - OK √ keyd is running - OK √ certd is running - OK √ identityd is running - OK √ read all preloaded certificates from the Certificates Service - OK √ read all preloaded key pairs from the Keys Service - OK √ check all EST server URLs utilize HTTPS - OK √ ensure all preloaded certificates match preloaded private keys with the same ID - OK Connectivity checks (aziot-identity-service) -------------------------------------------- √ host can connect to and perform TLS handshake with iothub AMQP port - OK √ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK √ host can connect to and perform TLS handshake with iothub MQTT port - OK Configuration checks -------------------- √ aziot-edged configuration is well-formed - OK √ configuration up-to-date with config.toml - OK √ container engine is installed and functional - OK √ configuration has correct URIs for daemon mgmt endpoint - OK √ aziot-edge package is up-to-date - OK √ container time is close to host time - OK ‼ DNS server - Warning Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub. Please see https://aka.ms/iotedge-prod-checklist-dns for best practices. You can ignore this warning if you are setting DNS server per module in the Edge deployment. ‼ production readiness: logs policy - Warning Container engine is not configured to rotate module logs which may cause it run out of disk space. Please see https://aka.ms/iotedge-prod-checklist-logs for best practices. You can ignore this warning if you are setting log policy per module in the Edge deployment. ‼ production readiness: Edge Agent's storage directory is persisted on the host filesystem - Warning The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem. Data might be lost if the module is deleted or updated. Please see https://aka.ms/iotedge-storage-host for best practices. ‼ production readiness: Edge Hub's storage directory is persisted on the host filesystem - Warning The edgeHub module is not configured to persist its /tmp/edgeHub directory on the host filesystem. Data might be lost if the module is deleted or updated. Please see https://aka.ms/iotedge-storage-host for best practices. √ Agent image is valid and can be pulled from upstream - OK √ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK Connectivity checks ------------------- √ container on the default network can connect to upstream AMQP port - OK √ container on the default network can connect to upstream HTTPS / WebSockets port - OK √ container on the IoT Edge module network can connect to upstream AMQP port - OK √ container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - OK 31 check(s) succeeded. 4 check(s) raised warnings. Re-run with --verbose for more details. 2 check(s) were skipped due to errors from other checks. Re-run with --verbose for more details. ```Device Information
Runtime Versions
Logs
edge-agent logs
``` <6> 2024-04-11 09:39:44.268 +00:00 [INF] - Edge agent attempting to connect to IoT Hub via Amqp_Tcp_Only... <6> 2024-04-11 09:40:17.614 +00:00 [INF] - Edge agent attempting to connect to IoT Hub via Amqp_WebSocket_Only... <4> 2024-04-11 09:40:50.959 +00:00 [WRN] - Error creating a device-to-cloud connection System.AggregateException: One or more errors occurred. (Transient network error occurred, please retry.) (Transient network error occurred, please retry.) ---> Microsoft.Azure.Devices.Client.Exceptions.IotHubCommunicationException: Transient network error occurred, please retry. ---> System.Net.Sockets.SocketException (11): Resource temporarily unavailable at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result) at Microsoft.Azure.Amqp.Transport.AmqpTransportInitiator.<>c.Additional Information