Azure / iotedge

The IoT Edge OSS project
MIT License
1.46k stars 458 forks source link

After a reboot, the edgeAgent won't start any module while using an unreliable connection #7318

Closed eanavalentin closed 2 months ago

eanavalentin commented 3 months ago

We have a couple hundred devices (odroid c4) that have a very bad and unreliable connection. This means up to sometime up to 80% packet loss and up to 2.5 seconds ping reply. The devices are provisioned through DPS with a symmetric key, while on a reliable network. Once everything is up and running, they are switched to the unreliable network.

During normal operation, for reasons out of our control, the devices might reboot. When they come back online, the edgeAgent attempts to connect to IoT Hub, but while doing this, it does not start any of the modules.

The device remains non functional for this time, since our modules are never started.

Sometimes, after starting, we see it trying to pull the containers again from the container registry, and throw timeout errors because of the slow connection. It is to our understanding that no containers should be pulled once created and no new deployment is received.

Additionally, on every restart, the module containers are stopped/removed/recreated. Should this happen? Shouldn't the containers, once created, remain the same, unless a new deployment is received.

Furthermore, the edgeAgent keeps logging Empty edge agent config was received. Attempting to read config from backup (/tmp/edgeAgent/edgeAgent/backup.json) instead every 10 seconds or so. There is no /tmp/edgeAgent/edgeAgent/backup.json file, even though the path is persisted, mounted inside the container, and the StorageFolder is set correctly.

image

image

Expected Behavior

The edgeAgent should start all modules regardless if it is able to connect to the cloud or not. Containers should not be recreated after a reboot.

Current Behavior

The edgeAgent fails to start the containers for a very long time until the connection get a little bit better, or maybe after some restarts. All containers, besides the edgeAgent container are recreated (docker ps -a returns different Container Id).

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. Provision, configure and deploy manifest to device. Make sure all is working and device is connected to cloud.
  2. Use a tool like tc-net to simulate a bar connection
  3. Reboot the device
  4. Tail the logs of edgeAgent and run iotedge list to see which modules have been started

Context (Environment)

Output of iotedge check

Click here ``` Configuration checks (aziot-identity-service) --------------------------------------------- √ keyd configuration is well-formed - OK √ certd configuration is well-formed - OK √ tpmd configuration is well-formed - OK √ identityd configuration is well-formed - OK √ daemon configurations up-to-date with config.toml - OK √ identityd config toml file specifies a valid hostname - OK ‼ aziot-identity-service package is up-to-date - Warning Installed aziot-identity-service package has version 1.4.7 but 1.4.8 is the latest stable version available. Please see https://aka.ms/aziot-update-runtime for update instructions. √ host time is close to reference time - OK √ preloaded certificates are valid - OK √ keyd is running - OK √ certd is running - OK √ identityd is running - OK √ read all preloaded certificates from the Certificates Service - OK √ read all preloaded key pairs from the Keys Service - OK √ check all EST server URLs utilize HTTPS - OK √ ensure all preloaded certificates match preloaded private keys with the same ID - OK Connectivity checks (aziot-identity-service) -------------------------------------------- √ host can connect to and perform TLS handshake with iothub AMQP port - OK √ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK √ host can connect to and perform TLS handshake with iothub MQTT port - OK √ host can connect to and perform TLS handshake with DPS endpoint - OK Configuration checks -------------------- √ aziot-edged configuration is well-formed - OK √ configuration up-to-date with config.toml - OK √ container engine is installed and functional - OK √ configuration has correct URIs for daemon mgmt endpoint - OK ‼ aziot-edge package is up-to-date - Warning Installed IoT Edge daemon has version 1.4.27 but 1.4.33 is the latest stable version available. Please see https://aka.ms/iotedge-update-runtime for update instructions. √ container time is close to host time - OK √ DNS server - OK √ production readiness: logs policy - OK √ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK √ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK √ Agent image is valid and can be pulled from upstream - OK √ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK Connectivity checks ------------------- √ container on the default network can connect to upstream AMQP port - OK √ container on the default network can connect to upstream HTTPS / WebSockets port - OK √ container on the IoT Edge module network can connect to upstream AMQP port - OK √ container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - OK 34 check(s) succeeded. 2 check(s) raised warnings. Re-run with --verbose for more details. 2 check(s) were skipped due to errors from other checks. Re-run with --verbose for more details. ```

Device Information

Runtime Versions

Note: when using Windows containers on Windows, run docker -H npipe:////./pipe/iotedge_moby_engine version instead

Logs

aziot-edged logs ``` [edged.txt](https://github.com/user-attachments/files/16026538/edged.txt) ```
edge-agent logs ``` [edgeAgent.txt](https://github.com/user-attachments/files/16026542/edgeAgent.txt) ```
edge-hub logs ``` edgeHub is not starting ```

Additional Information

We understand from here that the containers should not be recreated after a restart.

eanavalentin commented 3 months ago

We are also getting no modules starting when rebooting our devices with the network cable disconnected, or restarting VM's in a test environment with the NIC disabled. Is this expected behavior? It doesn't seem correct.

eanavalentin commented 3 months ago

After more testing, it seems the issue is caused by missing or invalid backup.json. The file contains a base64 encoded string, which makes no sense decoded. However, the edgeAgent sometime is able to use it to start correctly all modules, even when no connection, other times the behavior initially reported is observed.

bishal41 commented 3 months ago

@eanavalentin thanks for sharing the info and glad you were able to get the issue sorted out. Do you have any further question or shall we close this?

bishal41 commented 2 months ago

@eanavalentin closing this issue, please re-open or let us know if you have any further questions