Azure / iotedge

The IoT Edge OSS project
MIT License
1.46k stars 459 forks source link

edgeHub not starting with "Error: failed to deserialize state" error #5152

Open kvasdopil opened 3 years ago

kvasdopil commented 3 years ago

Expected Behaviour

After hard reboot iotedge services start normally.

Current Behaviour

After device start after a hard reboot, the edgeHub module does not start normally.

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. MQTT Broker should be enabled
  2. perform hard reboot on the device
  3. start the device

Context (Environment)

Output of iotedge check

Click here ``` Configuration checks -------------------- √ config.yaml is well-formed - OK √ config.yaml has well-formed connection string - OK √ container engine is installed and functional - OK × config.yaml has correct hostname - Error config.yaml has hostname zsolt-devbook but device reports hostname lexa-rpi. Hostname in config.yaml must either be identical to the device hostname or be a fully-qualified domain name that has the device hostname as the first component. √ config.yaml has correct URIs for daemon mgmt endpoint - OK √ latest security daemon - OK √ host time is close to real time - OK √ container time is close to host time - OK ‼ DNS server - Warning Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub. Please see https://aka.ms/iotedge-prod-checklist-dns for best practices. You can ignore this warning if you are setting DNS server per module in the Edge deployment. √ production readiness: identity certificates expiry - OK ‼ production readiness: certificates - Warning The Edge device is using self-signed automatically-generated development certificates. They will expire in 74 days (at 2021-09-06 12:22:20 UTC) causing module-to-module and downstream device communication to fail on an active deployment. After the certs have expired, restarting the IoT Edge daemon will trigger it to generate new development certs. Please consider using production certificates instead. See https://aka.ms/iotedge-prod-checklist-certs for best practices. √ production readiness: container engine - OK ‼ production readiness: logs policy - Warning Container engine is not configured to rotate module logs which may cause it run out of disk space. Please see https://aka.ms/iotedge-prod-checklist-logs for best practices. You can ignore this warning if you are setting log policy per module in the Edge deployment. ‼ production readiness: Edge Agent's storage directory is persisted on the host filesystem - Warning The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem. Data might be lost if the module is deleted or updated. Please see https://aka.ms/iotedge-storage-host for best practices. ‼ production readiness: Edge Hub's storage directory is persisted on the host filesystem - Warning The edgeHub module is not configured to persist its /tmp/edgeHub directory on the host filesystem. Data might be lost if the module is deleted or updated. Please see https://aka.ms/iotedge-storage-host for best practices. Connectivity checks ------------------- √ host can connect to and perform TLS handshake with DPS endpoint - OK √ host can connect to and perform TLS handshake with IoT Hub AMQP port - OK √ host can connect to and perform TLS handshake with IoT Hub HTTPS / WebSockets port - OK √ host can connect to and perform TLS handshake with IoT Hub MQTT port - OK √ container on the default network can connect to IoT Hub AMQP port - OK √ container on the default network can connect to IoT Hub HTTPS / WebSockets port - OK √ container on the default network can connect to IoT Hub MQTT port - OK √ container on the IoT Edge module network can connect to IoT Hub AMQP port - OK √ container on the IoT Edge module network can connect to IoT Hub HTTPS / WebSockets port - OK √ container on the IoT Edge module network can connect to IoT Hub MQTT port - OK 19 check(s) succeeded. 5 check(s) raised warnings. Re-run with --verbose for more details. 1 check(s) raised errors. Re-run with --verbose for more details. ```

Device Information

Runtime Versions

Note: when using Windows containers on Windows, run docker -H npipe:////./pipe/iotedge_moby_engine version instead

Logs

edge-hub logs ``` 2021-06-24 08:30:10 +00:00 Starting Edge Hub Jun 24 08:30:10.422 INFO watchdog: Starting Watchdog Jun 24 08:30:10.422 INFO watchdog: Registering shutdown signal listener Jun 24 08:30:10.425 INFO watchdog::child: Launched Edge Hub process with pid 10 Jun 24 08:30:10.429 INFO watchdog::child: Launched MQTT Broker process with pid 12 <6> 2021-06-24 08:30:10.444 +00:00 [INF] [mqttd::app::edgehub] - loading settings from a file /app/mqttd/broker.json <6> 2021-06-24 08:30:10.448 +00:00 [INF] [mqttd::app::edgehub] - loading state... <6> 2021-06-24 08:30:10.449 +00:00 [INF] [mqtt_broker::persist] - loading state from file /tmp/mqttd/state.dat. Error: failed to deserialize state Caused by: io error: failed to fill whole buffer 2021-06-24 08:30:10.626 +00:00 Edge Hub Main() Jun 24 08:30:11.430 INFO watchdog::child: MQTT Broker process has stopped Jun 24 08:30:12.426 INFO watchdog::child: Terminating Edge Hub process Jun 24 08:30:12.426 INFO watchdog::child: Sending SIGTERM signal to Edge Hub process Jun 24 08:30:13.426 INFO watchdog: Successfully stopped Edge Hub process Jun 24 08:30:13.426 INFO watchdog: Successfully stopped MQTT Broker process Jun 24 08:30:13.427 INFO watchdog: Stopped Watchdog process ```

Additional Information

The issue is gone after edgeHub container is deleted.

kvasdopil commented 3 years ago

Not sure if the issue has anything to do with edgeHub storage not being persisted, at a first glance it looks like that would have made things worse as the state would be corrupted permanently and removing the container would not help.

dmolokanov commented 3 years ago

Hi @kvasdopil ! Looks like we have a potential bug here. Thanks for sharing this!

github-actions[bot] commented 3 years ago

This issue is being marked as stale because it has been open for 30 days with no activity.