Azure / iotedge

The IoT Edge OSS project
MIT License
1.46k stars 460 forks source link

While offline, edgeHub fails after automatic renewal of certificates #7321

Open sejonssonr opened 4 months ago

sejonssonr commented 4 months ago

Our company have a large number of IoT devices that rely on the offline capabilities of IoT Edge.

We recently discovered that devices can run offline at a maximum of ~25 days. The behaviour seems to be caused by the automatic renewal of device/workload certificates. The renewal interval can be specified by setting the edgeHub environment variable ServerCertificateRenewAfterInMs but maxes out at 25 days(int32.max).

When the certificate is renewed, the edgeHub is stopped and fails to start again if the device is offline. This causes both data loss and complete failure of downstream devices to run configured modules. The edgeHub does not recover when connectivity is restored.

In the documentation found here the following is stated: "While disconnected from IoT Hub, the IoT Edge device, its deployed modules, and any downstream devices can operate indefinitely."

Expected Behavior

IoT Edge modules including the edge hub can operate indefinitely in offline mode

Current Behavior

Edge Hub stops after being offline for ~25 days which causes dataloss at the devices

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. Configure an edge device to run with specified CA certificates by updating config.toml [edge_ca] cert = "file:///etc/pki/tls/certs/<mydeviceca>.full-chain.ca.cert.pem" pk = "file:///etc/pki/tls/private/<mydeviceca>.key.pem"
  2. For the edgeHub module set the environment variable ServerCertificateRenewAfterInMs to 60000 ms in order to enforce renewal every minute.
  3. When the device twin is downloaded by the device, simulate an offline situation by disconnecting the device from the internet.
  4. After a few minutes time, observe how the edgeHub fails to start after renewal of certificates.

Context (Environment)

Output of iotedge check

Click here ``` Configuration checks (aziot-identity-service) --------------------------------------------- √ keyd configuration is well-formed - OK √ certd configuration is well-formed - OK √ tpmd configuration is well-formed - OK √ identityd configuration is well-formed - OK √ daemon configurations up-to-date with config.toml - OK √ identityd config toml file specifies a valid hostname - OK × aziot-identity-service package is up-to-date - Error could not query https://aka.ms/latest-aziot-identity-service for latest available version ‼ host time is close to reference time - Warning Could not query NTP server √ preloaded certificates are valid - OK √ keyd is running - OK √ certd is running - OK √ identityd is running - OK √ read all preloaded certificates from the Certificates Service - OK √ read all preloaded key pairs from the Keys Service - OK √ check all EST server URLs utilize HTTPS - OK √ ensure all preloaded certificates match preloaded private keys with the same ID - OK Connectivity checks (aziot-identity-service) -------------------------------------------- × host can connect to and perform TLS handshake with iothub AMQP port - Error Could not connect to .azure-devices.net : could not complete TLS handshake × host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - Error Could not connect to .azure-devices.net : could not complete TLS handshake × host can connect to and perform TLS handshake with iothub MQTT port - Error Could not connect to .azure-devices.net : could not complete TLS handshake Configuration checks -------------------- √ aziot-edged configuration is well-formed - OK √ configuration up-to-date with config.toml - OK √ container engine is installed and functional - OK √ configuration has correct URIs for daemon mgmt endpoint - OK × aziot-edge package is up-to-date - Error Error while fetching latest versions of edge components: could not send HTTP request √ container time is close to host time - OK ‼ DNS server - Warning Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub. Please see https://aka.ms/iotedge-prod-checklist-dns for best practices. You can ignore this warning if you are setting DNS server per module in the Edge deployment. √ production readiness: logs policy - OK √ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK √ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK × Agent image is valid and can be pulled from upstream - Error Failed to login to ta01iotcrd01.azurecr.io √ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK Connectivity checks ------------------- × container on the default network can connect to upstream AMQP port - Error Container on the default network could not connect to .azure-devices.net:5671 × container on the default network can connect to upstream HTTPS / WebSockets port - Error Container on the default network could not connect to .azure-devices.net:443 × container on the default network can connect to upstream MQTT port - Error Container on the default network could not connect to .azure-devices.net:8883 × container on the IoT Edge module network can connect to upstream AMQP port - Error Container on the azure-iot-edge network could not connect to .azure-devices.net:5671 × container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - Error Container on the azure-iot-edge network could not connect to .azure-devices.net:443 × container on the IoT Edge module network can connect to upstream MQTT port - Error Container on the azure-iot-edge network could not connect to .azure-devices.net:8883 23 check(s) succeeded. 2 check(s) raised warnings. Re-run with --verbose for more details. 12 check(s) raised errors. Re-run with --verbose for more details. ```

Device Information

Runtime Versions

Logs

aziot-edged logs [iotedge_system_logs.txt](https://github.com/user-attachments/files/16100719/iotedge_system_logs.txt)
edge-agent logs [edgeAgent_logs.txt](https://github.com/user-attachments/files/16100711/edgeAgent_logs.txt)
edge-hub logs [edgeHub_logs.txt](https://github.com/user-attachments/files/16100674/edgeHub_logs.txt)

Additional Information

Logs supplied as files due to max character limit.

VGDev1 commented 4 months ago

I've encountered the same issue on my devices.

vipeller commented 4 months ago

Hi @sejonssonr , sorry for the late answer, I started investigating this. I looked at the logs you attached. I see the periodic certificate renewal and a bunch of errors about not able to connect to iot hub. I assume that is expected and caused by being offline. I also see a module connecting (roger-test-240424/dispatcher), although the related upstream connection fails - again, I assume this is expected as the upstream network connection was somehow disabled for this test.

The part I am not sure if understand is that the error description is "the edgeHub is stopped and fails to start again ". However, in the logs I see edgeHub restarting every few minutes and based on the logs, it accepts incoming connections.

I don't see this part from the log, but the expected behavior is that if that "dispatcher" module sends messages, once EdgeHub gets online again, those will be forwarded. Based on your description "This causes both data loss and complete failure of downstream devices to run configured modules." - this is what does not happen?

To repeat I want to clarify:

silvestropomestetrapak commented 4 months ago

@vipeller Sorry for a late reply. I'm a @sejonssonr colleague. I'll answer point by point:

  • the problem is that edgeHub does not restart - but the attached logs show it restarting (and accept connections, at least from that single module "dispatcher"

You can see that edgeHub logs stops at 12:08:51, after one last attempt to start without connectivity. You can also see in the edgeAgent logs, it stop trying to restart the edgeHub module at 12:10:14, after this error: Error getting edge agent twin from IoTHub From 12:08:51 on, the edgeHub remained stopped.

  • the problem is that messages sent during the offline session get lost
  • some other problem I missed (e.g. there are multiple modules and those cannot connect)

Messages get lost when edgeHub is down, of course, but the other important issue is that all IoTEdge downstream devices (nested child device) fail to start when the edgeHub on the parent device is down. We expect from a child device to be able to start the IoTEdge services even when the edgeHub on the parent device is not available.

It looks like, in this case, the Identity Service on child devices fails indefinitely. It's stuck on a restart loop.
This is from a child device in the same situation: image

We restored the system by manually restarting edgeHub on the parent once the connectivity has been restored. This made identity service on all child devices start again.

Lazer91 commented 3 months ago

Hi, is there any news on this?

bishal41 commented 3 months ago

@vipeller any update on this one ?

sejonssonr commented 2 months ago

Any updates?

jlian commented 1 week ago

Hey folks we are looking into prioritizing a fix. Currently we think it might be on SDK side. The hypothesis is that when edgeAgent is offline, then it makes a call (into the SDK) that get stuck (never returns), and because of that it stops restarting stopped modules. And because edgeHub stops to renew its cert, it stays stopped. But we still need to find some bandwidth on the team to get an isolated repro. Will update soon.