Azure / iotedge

The IoT Edge OSS project
MIT License
1.45k stars 457 forks source link

User module unable to connect to Hub with 'Communication_error' #7293

Closed chatnoir1982 closed 1 month ago

chatnoir1982 commented 1 month ago

We have a custom IoTEdge module that out of the blue has started experiencing loss of connectivity to the IoTHub, with _ConnectionStatus.Disconnected+ConnectionStatusChangeReason.Retry_Expired and ConnectionStatus.Disconnected+ConnectionStatusChangeReason.Communication_Error_. The loss of connection for custom modules is continuous, there are no intermittent connections to the hub. The IoTEdgeMetricsCollector we use is still online and working properly. This happens on only one gateway device, other devices are healthy.

One thing that stands out is that the EdgeHub module twin of the unhealthy device does not list the custom module in its reported properties' "clients" section, which seems to be the related to the outage. Only the IoTEdgeMetricsCollector module is listed, which is functioning normally.

}, "clients": { "$lastUpdated": "2024-05-22T06:41:40.9107753Z", "deviceId/IoTEdgeMetricsCollector": { "$lastUpdated": "2024-05-22T06:41:40.9107753Z", "status": { "$lastUpdated": "2024-05-22T06:41:40.9107753Z" }, "lastConnectedTimeUtc": { "$lastUpdated": "2024-05-22T06:41:40.9107753Z" } } }

From the EdgeHub module logs we see the following lines (our custom modules named "MyGatewayModule" and "MyHealthCheckModule": image

There is also an exception of interest in the EdgeAgent logs "Newtonsoft.Json.JsonSerializationException: Required property 'message' not found in JSON. Path '', line 1, position 1533.". I suspect however that this is unrelated to the issue. See logs below.

Context

  1. IoTEdge version 1.4.27, EdgeAgent and EdgeHub both at 1.4.35.
  2. We are using AMQP-WS to connect our IoTEdge device to IoTHub.
  3. The device running the module in question is using a 4G connection to connect to the outside world; signal strength, latency and bandwidth are healthy, no DNS issue
  4. Removing and re-adding the custom module to the device from the Azure portal, and deployed it through layered deployment as well, but it didn't fix the issue
  5. The "iotedge check" command returns all green with a couple warnings on package versions, no errors.
  6. Rebooting the gateway, restarting iotedge modules, running "iotedge system restart" does not seem to help, edgeHub is not picking up the custom modules.
  7. Deploying a new module on the device e.g. SimulatedTemperatureSensor correctly registers the module with EdgeHub's clients and the new module functions correctly

Has anyone experienced this issue before?

Thanks

Expected Behavior

Current Behavior

Our custom module is experiencing loss of connectivity to the IoTHub, with _ConnectionStatus.Disconnected+ConnectionStatusChangeReason.Retry_Expired and ConnectionStatus.Disconnected+ConnectionStatusChangeReason.Communication_Error_.

Steps to Reproduce

Not sure if/how this can be reproduced. I suspect accidental edits in edgeHub module twin (via Azure IoTHub explorer tool). What we're looking for is some hints to allow us to quickly recover the device.

Context (Environment)

  1. IoTEdge version 1.4.27, EdgeAgent and EdgeHub both at 1.4.35.
  2. We are using AMQP-WS to connect our IoTEdge device to IoTHub.
  3. The device running the module in question is using a 4G connection to connect to the outside world; signal strength, latency and bandwidth are healthy, no DNS issue
  4. Removing and re-adding the custom module to the device from the Azure portal, and deployed it through layered deployment as well, but it didn't fix the issue
  5. The "iotedge check" command returns all green with a couple warnings on package versions, no errors.
  6. Rebooting the gateway, restarting iotedge modules, running "iotedge system restart" does not seem to help.
  7. Reprovisioned device with "iotedge system reprovision"

Output of iotedge check

Click here ``` Configuration checks (aziot-identity-service) --------------------------------------------- √ keyd configuration is well-formed - OK √ certd configuration is well-formed - OK √ tpmd configuration is well-formed - OK √ identityd configuration is well-formed - OK √ daemon configurations up-to-date with config.toml - OK √ identityd config toml file specifies a valid hostname - OK ‼ aziot-identity-service package is up-to-date - Warning Installed aziot-identity-service package has version 1.4.7 but 1.4.8 is the latest stable version available. Please see https://aka.ms/aziot-update-runtime for update instructions. caused by: Installed aziot-identity-service package has version 1.4.7 but 1.4.8 is the latest stable version available. Please see https://aka.ms/aziot-update-runtime for update instructions. √ host time is close to reference time - OK √ preloaded certificates are valid - OK √ keyd is running - OK √ certd is running - OK √ tpmd is running - OK √ identityd is running - OK √ read all preloaded certificates from the Certificates Service - OK √ read all preloaded key pairs from the Keys Service - OK √ check all EST server URLs utilize HTTPS - OK √ ensure all preloaded certificates match preloaded private keys with the same ID - OK Connectivity checks (aziot-identity-service) -------------------------------------------- √ host can connect to and perform TLS handshake with iothub AMQP port - OK √ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK √ host can connect to and perform TLS handshake with iothub MQTT port - OK √ host can connect to and perform TLS handshake with DPS endpoint - OK Configuration checks -------------------- √ aziot-edged configuration is well-formed - OK √ configuration up-to-date with config.toml - OK √ container engine is installed and functional - OK √ configuration has correct URIs for daemon mgmt endpoint - OK ‼ aziot-edge package is up-to-date - Warning Installed IoT Edge daemon has version 1.4.27 but 1.4.33 is the latest stable version available. Please see https://aka.ms/iotedge-update-runtime for update instructions. caused by: Installed IoT Edge daemon has version 1.4.27 but 1.4.33 is the latest stable version available. Please see https://aka.ms/iotedge-update-runtime for update instructions. √ container time is close to host time - OK √ DNS server - OK √ production readiness: logs policy - OK √ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK √ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK √ Agent image is valid and can be pulled from upstream - OK √ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK Connectivity checks ------------------- √ container on the default network can connect to upstream AMQP port - OK √ container on the default network can connect to upstream HTTPS / WebSockets port - OK √ container on the default network can connect to upstream MQTT port - OK √ container on the IoT Edge module network can connect to upstream AMQP port - OK √ container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - OK √ container on the IoT Edge module network can connect to upstream MQTT port - OK 37 check(s) succeeded. 2 check(s) raised warnings. ```

Device Information

Runtime Versions

Note: when using Windows containers on Windows, run docker -H npipe:////./pipe/iotedge_moby_engine version instead

Logs

aziot-edged logs ``` ```
edge-agent logs ``` <6> 2024-05-22 06:20:21.560 +00:00 [INF] [Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory] - Executing command: "Command Group: (\n [Stop module edgeHub]\n [Update module edgeHub]\n)" <7> 2024-05-22 06:20:21.572 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient] - Making a Http call to unix:///var/run/iotedge/mgmt.sock to stop module edgeHub <7> 2024-05-22 06:20:21.635 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Util.Uds.HttpUdsMessageHandler] - Connecting socket /var/run/iotedge/mgmt.sock <7> 2024-05-22 06:20:21.636 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Util.Uds.HttpUdsMessageHandler] - Connected socket /var/run/iotedge/mgmt.sock <7> 2024-05-22 06:20:21.638 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Util.Uds.HttpUdsMessageHandler] - Sending request http://mgmt.sock/modules/edgeHub/stop?api-version=2022-08-03 <7> 2024-05-22 06:20:22.737 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Util.Uds.HttpUdsMessageHandler] - Response received NoContent <7> 2024-05-22 06:20:22.751 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient] - Received a valid Http response from unix:///var/run/iotedge/mgmt.sock for stop module edgeHub <7> 2024-05-22 06:20:22.775 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient] - Making a Http call to unix:///var/run/iotedge/mgmt.sock to update module edgeHub <7> 2024-05-22 06:20:22.868 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Util.Uds.HttpUdsMessageHandler] - Connecting socket /var/run/iotedge/mgmt.sock <7> 2024-05-22 06:20:22.870 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Util.Uds.HttpUdsMessageHandler] - Connected socket /var/run/iotedge/mgmt.sock <7> 2024-05-22 06:20:22.870 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Util.Uds.HttpUdsMessageHandler] - Sending request http://mgmt.sock/modules/edgeHub?api-version=2022-08-03 <7> 2024-05-22 06:20:24.268 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Util.Uds.HttpUdsMessageHandler] - Response received Created <7> 2024-05-22 06:20:24.338 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient] - Error when getting an Http response from unix:///var/run/iotedge/mgmt.sock for update module edgeHub HTTP Response: Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.GeneratedCode.SwaggerException: Could not deserialize the response body stream as Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.GeneratedCode.ErrorResponse. Status: 201 Response: ---> Newtonsoft.Json.JsonSerializationException: Required property 'message' not found in JSON. Path '', line 1, position 1533. at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.EndProcessProperty(Object newObject, JsonReader reader, JsonObjectContract contract, Int32 initialDepth, JsonProperty property, PropertyPresence presence, Boolean setDefaultValue) at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.PopulateObject(Object newObject, JsonReader reader, JsonObjectContract contract, JsonProperty member, String id) at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateObject(JsonReader reader, Type objectType, JsonContract contract, JsonProperty member, JsonContainerContract containerContract, JsonProperty containerMember, Object existingValue) at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateValueInternal(JsonReader reader, Type objectType, JsonContract contract, JsonProperty member, JsonContainerContract containerContract, JsonProperty containerMember, Object existingValue) at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.Deserialize(JsonReader reader, Type objectType, Boolean checkAdditionalContent) at Newtonsoft.Json.JsonSerializer.DeserializeInternal(JsonReader reader, Type objectType) at Newtonsoft.Json.JsonSerializer.Deserialize(JsonReader reader, Type objectType) at Newtonsoft.Json.JsonSerializer.Deserialize[T](JsonReader reader) at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.GeneratedCode.EdgeletHttpClient.ReadObjectResponseAsync[T](HttpResponseMessage response, IReadOnlyDictionary`2 headers, CancellationToken cancellationToken) --- End of inner exception stack trace --- at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.GeneratedCode.EdgeletHttpClient.ReadObjectResponseAsync[T](HttpResponseMessage response, IReadOnlyDictionary`2 headers, CancellationToken cancellationToken) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/version_2022_08_03/generatedCode/EdgeletHttpClient.cs:line 1938 at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.GeneratedCode.EdgeletHttpClient.UpdateModuleAsync(String api_version, String name, Nullable`1 start, ModuleSpec module, CancellationToken cancellationToken) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/version_2022_08_03/generatedCode/EdgeletHttpClient.cs:line 458 at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Versioning.ModuleManagementHttpClientVersioned.Execute[T](Func`1 func, String operation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/versioning/ModuleManagementHttpClientVersioned.cs:line 152 <7> 2024-05-22 06:20:24.472 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory] - Executing command for operation ["update"] succeeded. <6> 2024-05-22 06:20:24.473 +00:00 [INF] [Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory] - Executing command: "Start module edgeHub" ```
edge-hub logs ``` [edgeHub_output (2).txt](https://github.com/Azure/iotedge/files/15414591/edgeHub_output.2.txt) ```

Additional Information

Please provide any additional information that may be helpful in understanding the issue.

chatnoir1982 commented 1 month ago

Solved. The issue was caused due to erroneously deleting the base deployment that puts the system modules on the device, then following up with layered/manual deployments. The edgeAgent and edgeHub modules were working, but would not see any newly deployed custom modules. Strangely enough, edgeHub would happily register other non-custom modules from the masrketplace, like Simulated Temperature Sensor. Restoring the base deployment brought everything back to working order.