Azure / iotedge

The IoT Edge OSS project
MIT License
1.45k stars 457 forks source link

"Did not receive ack for message / System.TimeoutException: Message completion response not received" errors after environment upgrade #7286

Closed chriswaters78 closed 1 month ago

chriswaters78 commented 2 months ago

Expected Behavior

Modules should run without logging repeated errors in the edgeHub logs after upgrading from Ubuntu 18.04 to Ubuntu 22.04

Current Behavior

We are testing upgrading production devices from Ubuntu 18.04 LTS to Ubuntu 22.04 LTS. We have a test box which is an Azure hosted Ubuntu VM. This runs a set of mock modules which simulate the operation of our production modules which continuously poll modbus readings and respond to Cloud to Device commands.

Initially we upgraded this test box in place and moved to the latest version of the iotedge runtime and client libraries, including moving from .Net 6 to .Net 8 for the custom modules. This has resulted in continuous drop outs of our modules which have been running stably for a number of years with very good reliability. As a further test we provisioned a fresh Ubuntu 22.04 LTS VM, and performed a clean install of iotedge as per the documentation. We then deployed our modules again and are seeing the same behaviour, dropouts and edgeHub logs showing these repeated warnings:

<4> 2024-05-08 10:35:04.483 +00:00 [WRN] - Did not receive ack for message 6e1413bf-003b-4245-bd05-1bdc4918b339 from device/module XXXXX/ipswichb-mockmodbus
<4> 2024-05-08 10:35:04.484 +00:00 [WRN] - Error sending messages to module XXXXX/ipswichb-mockmodbus
System.TimeoutException: Message completion response not received
   at Microsoft.Azure.Devices.Edge.Hub.Core.Device.DeviceMessageHandler.SendMessageAsync(IMessage message, String input) in /mnt/vss/_work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.Core/device/DeviceMessageHandler.cs:line 498
   at Microsoft.Azure.Devices.Edge.Hub.Core.Routing.ModuleEndpoint.ModuleMessageProcessor.ProcessAsync(ICollection`1 routingMessages, IDeviceProxy dp, CancellationToken token) in /mnt/vss/_work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.Core/routing/ModuleEndpoint.cs:line 166

The test modules are in pairs, with one module listening for Cloud to Device messages, and routing that message on to another module which alters its simulated modbus readings in response. An example of a route from one module to another is:

FROM /messages/modules/stowmarketa-sitecontrol/outputs/commandOutput INTO BrokeredEndpoint("/modules/stowmarketa-mockmodbus/inputs/input")

The modules are intermittently working in that we are receiving the mock polling output at our IoT Hub, and the messages routed between modules are being sent and intermittently received, its just there are many drop outs and errors.

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. Provision new Ubuntu VM
  2. Install iotedge
  3. Deploy custom modules with cloud to device listener and inter module communication

Context (Environment)

Output of iotedge check

Click here ``` Configuration checks (aziot-identity-service) --------------------------------------------- √ keyd configuration is well-formed - OK √ certd configuration is well-formed - OK √ tpmd configuration is well-formed - OK √ identityd configuration is well-formed - OK √ daemon configurations up-to-date with config.toml - OK √ identityd config toml file specifies a valid hostname - OK √ aziot-identity-service package is up-to-date - OK √ host time is close to reference time - OK √ preloaded certificates are valid - OK √ keyd is running - OK √ certd is running - OK √ identityd is running - OK √ read all preloaded certificates from the Certificates Service - OK √ read all preloaded key pairs from the Keys Service - OK √ check all EST server URLs utilize HTTPS - OK √ ensure all preloaded certificates match preloaded private keys with the same ID - OK Connectivity checks (aziot-identity-service) -------------------------------------------- √ host can connect to and perform TLS handshake with iothub AMQP port - OK √ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK √ host can connect to and perform TLS handshake with iothub MQTT port - OK Configuration checks -------------------- √ aziot-edged configuration is well-formed - OK √ configuration up-to-date with config.toml - OK √ container engine is installed and functional - OK √ configuration has correct URIs for daemon mgmt endpoint - OK √ aziot-edge package is up-to-date - OK √ container time is close to host time - OK √ DNS server - OK √ production readiness: logs policy - OK ‼ production readiness: Edge Agent's storage directory is persisted on the host filesystem - Warning The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem. Data might be lost if the module is deleted or updated. Please see https://aka.ms/iotedge-storage-host for best practices. ‼ production readiness: Edge Hub's storage directory is persisted on the host filesystem - Warning The edgeHub module is not configured to persist its /tmp/edgeHub directory on the host filesystem. Data might be lost if the module is deleted or updated. Please see https://aka.ms/iotedge-storage-host for best practices. √ Agent image is valid and can be pulled from upstream - OK √ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK Connectivity checks ------------------- √ container on the default network can connect to upstream AMQP port - OK √ container on the default network can connect to upstream HTTPS / WebSockets port - OK √ container on the IoT Edge module network can connect to upstream AMQP port - OK √ container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - OK 33 check(s) succeeded. ```

Device Information

Runtime Versions

Logs

edge-agent logs ``` <6> 2024-05-08 10:35:25.616 +00:00 [INF] - HealthRestartPlanner is clearing restart stats for module 'plymouth-mockmodbus' as it has been running healthy for 00:10:00. <6> 2024-05-08 10:35:25.616 +00:00 [INF] - Plan execution started for deployment 30 <6> 2024-05-08 10:35:25.617 +00:00 [INF] - Executing command: "Saving plymouth-mockmodbus to store" <6> 2024-05-08 10:35:25.619 +00:00 [INF] - Plan execution ended for deployment 30 <6> 2024-05-08 10:35:30.649 +00:00 [INF] - HealthRestartPlanner is clearing restart stats for module 'ipswicha-sitecontrol' as it has been running healthy for 00:10:00. <6> 2024-05-08 10:35:30.649 +00:00 [INF] - HealthRestartPlanner is clearing restart stats for module 'stowmarketa-sitecontrol' as it has been running healthy for 00:10:00. <6> 2024-05-08 10:35:30.650 +00:00 [INF] - HealthRestartPlanner is clearing restart stats for module 'stowmarketb-mockmodbus' as it has been running healthy for 00:10:00. <6> 2024-05-08 10:35:30.650 +00:00 [INF] - HealthRestartPlanner is clearing restart stats for module 'stowmarketb-sitecontrol' as it has been running healthy for 00:10:00. <6> 2024-05-08 10:35:30.650 +00:00 [INF] - HealthRestartPlanner is clearing restart stats for module 'crumlin-mockmodbus' as it has been running healthy for 00:10:00. <6> 2024-05-08 10:35:30.650 +00:00 [INF] - HealthRestartPlanner is clearing restart stats for module 'stowmarketa-mockmodbus' as it has been running healthy for 00:10:00. <6> 2024-05-08 10:35:30.650 +00:00 [INF] - HealthRestartPlanner is clearing restart stats for module 'ipswicha-mockmodbus' as it has been running healthy for 00:10:00. <6> 2024-05-08 10:35:30.650 +00:00 [INF] - HealthRestartPlanner is clearing restart stats for module 'ipswichb-sitecontrol' as it has been running healthy for 00:10:00. <6> 2024-05-08 10:35:30.650 +00:00 [INF] - HealthRestartPlanner is clearing restart stats for module 'plymouth-sitecontrol' as it has been running healthy for 00:10:00. <6> 2024-05-08 10:35:30.650 +00:00 [INF] - HealthRestartPlanner is clearing restart stats for module 'crumlin-sitecontrol' as it has been running healthy for 00:10:00. <6> 2024-05-08 10:35:30.650 +00:00 [INF] - Plan execution started for deployment 30 <6> 2024-05-08 10:35:30.650 +00:00 [INF] - Executing command: "Saving ipswicha-sitecontrol to store" <6> 2024-05-08 10:35:30.650 +00:00 [INF] - Executing command: "Saving stowmarketa-sitecontrol to store" <6> 2024-05-08 10:35:30.651 +00:00 [INF] - Executing command: "Saving stowmarketb-mockmodbus to store" <6> 2024-05-08 10:35:30.651 +00:00 [INF] - Executing command: "Saving stowmarketb-sitecontrol to store" <6> 2024-05-08 10:35:30.651 +00:00 [INF] - Executing command: "Saving crumlin-mockmodbus to store" <6> 2024-05-08 10:35:30.651 +00:00 [INF] - Executing command: "Saving stowmarketa-mockmodbus to store" <6> 2024-05-08 10:35:30.652 +00:00 [INF] - Executing command: "Saving ipswicha-mockmodbus to store" <6> 2024-05-08 10:35:30.652 +00:00 [INF] - Executing command: "Saving ipswichb-sitecontrol to store" <6> 2024-05-08 10:35:30.652 +00:00 [INF] - Executing command: "Saving plymouth-sitecontrol to store" <6> 2024-05-08 10:35:30.653 +00:00 [INF] - Executing command: "Saving crumlin-sitecontrol to store" <6> 2024-05-08 10:35:30.653 +00:00 [INF] - Plan execution ended for deployment 30 <6> 2024-05-08 10:35:30.809 +00:00 [INF] - Updated reported properties <6> 2024-05-08 10:35:35.963 +00:00 [INF] - Updated reported properties <6> 2024-05-08 10:36:01.051 +00:00 [INF] - HealthRestartPlanner is clearing restart stats for module 'edgeHub' as it has been running healthy for 00:10:00. <6> 2024-05-08 10:36:01.051 +00:00 [INF] - Plan execution started for deployment 30 <6> 2024-05-08 10:36:01.051 +00:00 [INF] - Executing command: "Saving edgeHub to store" <6> 2024-05-08 10:36:01.052 +00:00 [INF] - Plan execution ended for deployment 30 <6> 2024-05-08 10:36:06.215 +00:00 [INF] - Updated reported properties <6> 2024-05-08 10:36:51.949 +00:00 [INF] - HealthRestartPlanner is clearing restart stats for module 'ipswichb-mockmodbus' as it has been running healthy for 00:10:00. <6> 2024-05-08 10:36:51.949 +00:00 [INF] - Plan execution started for deployment 30 <6> 2024-05-08 10:36:51.949 +00:00 [INF] - Executing command: "Saving ipswichb-mockmodbus to store" <6> 2024-05-08 10:36:51.950 +00:00 [INF] - Plan execution ended for deployment 30 <6> 2024-05-08 10:36:57.104 +00:00 [INF] - Updated reported properties <6> 2024-05-08 10:37:26.766 +00:00 [INF] - Received direct method call - GetModuleLogs <6> 2024-05-08 10:37:26.767 +00:00 [INF] - Received request GetModuleLogs with payload <6> 2024-05-08 10:37:26.785 +00:00 [INF] - Processing request to get logs for {"schemaVersion":"1.0","items":{"id":"^edgeAgent$","filter":{"tail":1500,"since":"5m","until":null,"loglevel":null,"regex":""}},"encoding":1,"contentType":1} <6> 2024-05-08 10:37:26.814 +00:00 [INF] - Initiating streaming logs for edgeAgent <6> 2024-05-08 10:37:26.925 +00:00 [INF] - Received 700 bytes of logs for edgeAgent <6> 2024-05-08 10:37:26.929 +00:00 [INF] - Successfully handled request GetModuleLogs <6> 2024-05-08 10:37:28.120 +00:00 [INF] - Received direct method call - GetModuleLogs <6> 2024-05-08 10:37:28.120 +00:00 [INF] - Received request GetModuleLogs with payload <6> 2024-05-08 10:37:28.120 +00:00 [INF] - Processing request to get logs for {"schemaVersion":"1.0","items":{"id":"^edgeAgent$","filter":{"tail":1500,"since":"5m","until":null,"loglevel":null,"regex":""}},"encoding":1,"contentType":1} ```

Edge Hub Logs

Additional Information

I can see from the logs that the 6 modules named [SITENAME]-sitecontrol are receiving Cloud to Device messages every 30 seconds and these should be routed to the modules names [SITENAME]-mockmodbus, however the receiving modules seem to be receiving these at 90s intervals so 2 out of 3 of these are never being received, presumably related to the large number of warnings and exceptions being logged by the edgeHub.

As mentioned this exact setup has been running very stably on Ubuntu 18.04 with periodic upgrades to the iotedge services and custom modules device clients, however with a fresh install of the latest version of everything on an Ubuntu 22.04 Azure VM this seems very unstable.

huguesBouvier commented 1 month ago

Have just upgraded ubuntu? not iotedge? It is very possible that the issue stems from the change of the .Net library. Is it possible for you to try again with the older .Net library?

chriswaters78 commented 1 month ago

For the logs displayed this was a new instance of everything (new VM with Ubuntu 22.04, new iot edge install), please see the version numbers above.

Strangely since posting this bug the system has settled down and is no longer throwing these errors or dropping out but it is making me incredibly nervous about attempting this upgrade on production if it is going to result in periods of instability and random drop outs.

ryanwinter commented 1 month ago

Hi @chriswaters78,

We havent been able to repro on our end, so I will close for now given this is not currently occuring. If this is still a concern or you see this happening again, can you please provide information on what client SDK and iot edge version your were running previously and which versions you upgraded too.