Open sejonssonr opened 4 months ago
I've encountered the same issue on my devices.
Hi @sejonssonr , sorry for the late answer, I started investigating this. I looked at the logs you attached. I see the periodic certificate renewal and a bunch of errors about not able to connect to iot hub. I assume that is expected and caused by being offline. I also see a module connecting (roger-test-240424/dispatcher), although the related upstream connection fails - again, I assume this is expected as the upstream network connection was somehow disabled for this test.
The part I am not sure if understand is that the error description is "the edgeHub is stopped and fails to start again ". However, in the logs I see edgeHub restarting every few minutes and based on the logs, it accepts incoming connections.
I don't see this part from the log, but the expected behavior is that if that "dispatcher" module sends messages, once EdgeHub gets online again, those will be forwarded. Based on your description "This causes both data loss and complete failure of downstream devices to run configured modules." - this is what does not happen?
To repeat I want to clarify:
@vipeller Sorry for a late reply. I'm a @sejonssonr colleague. I'll answer point by point:
- the problem is that edgeHub does not restart - but the attached logs show it restarting (and accept connections, at least from that single module "dispatcher"
You can see that edgeHub logs stops at 12:08:51, after one last attempt to start without connectivity.
You can also see in the edgeAgent logs, it stop trying to restart the edgeHub module at 12:10:14, after this error:
Error getting edge agent twin from IoTHub
From 12:08:51 on, the edgeHub remained stopped.
- the problem is that messages sent during the offline session get lost
- some other problem I missed (e.g. there are multiple modules and those cannot connect)
Messages get lost when edgeHub is down, of course, but the other important issue is that all IoTEdge downstream devices (nested child device) fail to start when the edgeHub on the parent device is down. We expect from a child device to be able to start the IoTEdge services even when the edgeHub on the parent device is not available.
It looks like, in this case, the Identity Service on child devices fails indefinitely. It's stuck on a restart loop.
This is from a child device in the same situation:
We restored the system by manually restarting edgeHub on the parent once the connectivity has been restored. This made identity service on all child devices start again.
Hi, is there any news on this?
@vipeller any update on this one ?
Any updates?
Hey folks we are looking into prioritizing a fix. Currently we think it might be on SDK side. The hypothesis is that when edgeAgent is offline, then it makes a call (into the SDK) that get stuck (never returns), and because of that it stops restarting stopped modules. And because edgeHub stops to renew its cert, it stays stopped. But we still need to find some bandwidth on the team to get an isolated repro. Will update soon.
Our company have a large number of IoT devices that rely on the offline capabilities of IoT Edge.
We recently discovered that devices can run offline at a maximum of ~25 days. The behaviour seems to be caused by the automatic renewal of device/workload certificates. The renewal interval can be specified by setting the edgeHub environment variable ServerCertificateRenewAfterInMs but maxes out at 25 days(int32.max).
When the certificate is renewed, the edgeHub is stopped and fails to start again if the device is offline. This causes both data loss and complete failure of downstream devices to run configured modules. The edgeHub does not recover when connectivity is restored.
In the documentation found here the following is stated: "While disconnected from IoT Hub, the IoT Edge device, its deployed modules, and any downstream devices can operate indefinitely."
Expected Behavior
IoT Edge modules including the edge hub can operate indefinitely in offline mode
Current Behavior
Edge Hub stops after being offline for ~25 days which causes dataloss at the devices
Steps to Reproduce
Provide a detailed set of steps to reproduce the bug.
[edge_ca]
cert = "file:///etc/pki/tls/certs/<mydeviceca>.full-chain.ca.cert.pem"
pk = "file:///etc/pki/tls/private/<mydeviceca>.key.pem"
Context (Environment)
Output of
iotedge check
Click here
``` Configuration checks (aziot-identity-service) --------------------------------------------- √ keyd configuration is well-formed - OK √ certd configuration is well-formed - OK √ tpmd configuration is well-formed - OK √ identityd configuration is well-formed - OK √ daemon configurations up-to-date with config.toml - OK √ identityd config toml file specifies a valid hostname - OK × aziot-identity-service package is up-to-date - Error could not query https://aka.ms/latest-aziot-identity-service for latest available version ‼ host time is close to reference time - Warning Could not query NTP server √ preloaded certificates are valid - OK √ keyd is running - OK √ certd is running - OK √ identityd is running - OK √ read all preloaded certificates from the Certificates Service - OK √ read all preloaded key pairs from the Keys Service - OK √ check all EST server URLs utilize HTTPS - OK √ ensure all preloaded certificates match preloaded private keys with the same ID - OK Connectivity checks (aziot-identity-service) -------------------------------------------- × host can connect to and perform TLS handshake with iothub AMQP port - Error Could not connect toDevice Information
Runtime Versions
Logs
aziot-edged logs
[iotedge_system_logs.txt](https://github.com/user-attachments/files/16100719/iotedge_system_logs.txt)edge-agent logs
[edgeAgent_logs.txt](https://github.com/user-attachments/files/16100711/edgeAgent_logs.txt)edge-hub logs
[edgeHub_logs.txt](https://github.com/user-attachments/files/16100674/edgeHub_logs.txt)Additional Information
Logs supplied as files due to max character limit.