Azure / iotedge

The IoT Edge OSS project
MIT License
1.45k stars 457 forks source link

EdgeAgent does not restart crashed module if busy downloading new manifest 1.4.10 #7259

Open mathiash98 opened 3 months ago

mathiash98 commented 3 months ago

Expected Behavior

EdgeAgent should ALWAYS keep all modules running regardless of edgeAgent operation.

Current Behavior

If EdgeAgent is busy downloading new version of modules it will not ensure existing modules are running.

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. Slow internet setup
  2. Deploy new version deployment manifest
  3. Ensure EdgeAgent struggles downloading new versions (100 second timeout) -> repeats for a long time
  4. Crash one of the existing modules with fatal exception or similar
  5. Observe that EdgeAgent does not restart the crashing module

Context (Environment)

Deployments operating on slow satellite internet connection with high ping.

Deployment manifest has:

Output of iotedge check

Click here ``` Configuration checks -------------------- √ config.yaml is well-formed - OK √ config.yaml has well-formed connection string - OK √ container engine is installed and functional - OK × config.yaml has correct hostname - Error config.yaml has hostname PC2FD48A but device reports hostname pc2fd48a. Hostname in config.yaml must either be identical to the device hostname or be a fully-qualified domain name that has the device hostname as the first component. √ config.yaml has correct URIs for daemon mgmt endpoint - OK √ latest security daemon - OK √ host time is close to real time - OK √ container time is close to host time - OK ‼ DNS server - Warning Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub. Please see https://aka.ms/iotedge-prod-checklist-dns for best practices. You can ignore this warning if you are setting DNS server per module in the Edge deployment. ‼ production readiness: certificates - Warning The Edge device is using self-signed automatically-generated development certificates. They will expire in 204 days (at 2024-10-26 11:29:55 UTC) causing module-to-module and downstream device communication to fail on an active deployment. After the certs have expired, restarting the IoT Edge daemon will trigger it to generate new development certs. Please consider using production certificates instead. See https://aka.ms/iotedge-prod-checklist-certs for best practices. √ production readiness: logs policy - OK ‼ production readiness: Edge Agent's storage directory is persisted on the host filesystem - Warning The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem. Data might be lost if the module is deleted or updated. Please see https://aka.ms/iotedge-storage-host for best practices. ‼ production readiness: Edge Hub's storage directory is persisted on the host filesystem - Warning The edgeHub module is not configured to persist its /tmp/edgeHub directory on the host filesystem. Data might be lost if the module is deleted or updated. Please see https://aka.ms/iotedge-storage-host for best practices. Connectivity checks ------------------- √ host can connect to and perform TLS handshake with IoT Hub AMQP port - OK √ host can connect to and perform TLS handshake with IoT Hub HTTPS / WebSockets port - OK √ host can connect to and perform TLS handshake with IoT Hub MQTT port - OK √ container on the default network can connect to IoT Hub AMQP port - OK √ container on the default network can connect to IoT Hub HTTPS / WebSockets port - OK √ container on the default network can connect to IoT Hub MQTT port - OK √ container on the IoT Edge module network can connect to IoT Hub AMQP port - OK √ container on the IoT Edge module network can connect to IoT Hub HTTPS / WebSockets port - OK √ container on the IoT Edge module network can connect to IoT Hub MQTT port - OK 17 check(s) succeeded. 4 check(s) raised warnings. Re-run with --verbose for more details. 1 check(s) raised errors. Re-run with --verbose for more details. ```

Device Information

Runtime Versions

Logs

edge-agent logs ``` <6> 2024-04-08 07:39:46.633 +00:00 [INF] - Obtained Edge agent twin from IoTHub with desired properties version 36 and reported properties version 321. <6> 2024-04-08 07:42:30.361 +00:00 [INF] - Starting compaction of stores <6> 2024-04-08 07:42:30.361 +00:00 [INF] - Starting compaction of store Metrics <6> 2024-04-08 07:42:30.362 +00:00 [INF] - Starting compaction of store deploymentConfig <6> 2024-04-08 07:42:30.362 +00:00 [INF] - Starting compaction of store default <6> 2024-04-08 07:42:30.362 +00:00 [INF] - Starting compaction of store moduleState <6> 2024-04-08 07:43:13.374 +00:00 [INF] - Starting periodic operation Metrics Scrape... <6> 2024-04-08 07:43:13.375 +00:00 [INF] - Scraping Metrics <6> 2024-04-08 07:43:13.375 +00:00 [INF] - Scraping endpoint http://edgeHub:9600/metrics <6> 2024-04-08 07:43:13.375 +00:00 [INF] - Scraping endpoint http://edgeAgent:9600/metrics <6> 2024-04-08 07:43:13.385 +00:00 [INF] - Storing Metrics <6> 2024-04-08 07:43:13.389 +00:00 [INF] - Scraped and Stored Metrics <6> 2024-04-08 07:43:13.389 +00:00 [INF] - Successfully completed periodic operation Metrics Scrape <4> 2024-04-08 07:43:48.781 +00:00 [WRN] - Agent reconcile concluded with errors. System.TimeoutException: Operation timed out at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114 at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160 at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271 <4> 2024-04-08 07:53:53.787 +00:00 [WRN] - Agent reconcile concluded with errors. System.TimeoutException: Operation timed out at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114 at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160 at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271 <4> 2024-04-08 08:03:58.789 +00:00 [WRN] - Agent reconcile concluded with errors. System.TimeoutException: Operation timed out at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114 at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160 at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271 <4> 2024-04-08 08:14:03.794 +00:00 [WRN] - Agent reconcile concluded with errors. System.TimeoutException: Operation timed out at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114 at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160 at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271 <4> 2024-04-08 08:24:08.798 +00:00 [WRN] - Agent reconcile concluded with errors. System.TimeoutException: Operation timed out at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114 at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160 at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271 <6> 2024-04-08 08:28:48.987 +00:00 [INF] - Starting periodic operation refresh twin config... <6> 2024-04-08 08:28:51.738 +00:00 [INF] - Obtained Edge agent twin from IoTHub with desired properties version 36 and reported properties version 321. <6> 2024-04-08 08:28:51.740 +00:00 [INF] - Successfully completed periodic operation refresh twin config <4> 2024-04-08 08:34:13.801 +00:00 [WRN] - Agent reconcile concluded with errors. System.TimeoutException: Operation timed out at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114 at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160 at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271 <6> 2024-04-08 08:43:13.393 +00:00 [INF] - Starting periodic operation Metrics Scrape... <6> 2024-04-08 08:43:13.393 +00:00 [INF] - Scraping Metrics <6> 2024-04-08 08:43:13.393 +00:00 [INF] - Scraping endpoint http://edgeHub:9600/metrics <6> 2024-04-08 08:43:13.394 +00:00 [INF] - Scraping endpoint http://edgeAgent:9600/metrics <6> 2024-04-08 08:43:13.408 +00:00 [INF] - Storing Metrics <6> 2024-04-08 08:43:13.414 +00:00 [INF] - Scraped and Stored Metrics <6> 2024-04-08 08:43:13.414 +00:00 [INF] - Successfully completed periodic operation Metrics Scrape <4> 2024-04-08 08:44:18.807 +00:00 [WRN] - Agent reconcile concluded with errors. System.TimeoutException: Operation timed out at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114 at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160 at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271 <4> 2024-04-08 08:54:23.813 +00:00 [WRN] - Agent reconcile concluded with errors. System.TimeoutException: Operation timed out at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114 at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160 at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271 <4> 2024-04-08 09:04:28.820 +00:00 [WRN] - Agent reconcile concluded with errors. System.TimeoutException: Operation timed out at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114 at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160 at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271 <4> 2024-04-08 09:14:33.823 +00:00 [WRN] - Agent reconcile concluded with errors. System.TimeoutException: Operation timed out at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114 at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160 at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271 ```
david-emakenemi commented 3 months ago

@jollyaakash can you help take a look at this?

vadim-kovalyov commented 2 months ago

Hey @mathiash98 thanks for reporting this. It looks like an existing problem where the pull operation happens synchronously and prevents EdgeAgent reacting to other module lifecycle events. I'll bring it up with the team to see how we can fix that. Meanwhile I can suggest a workaround (if applicable/possible) is to ssh to the device and manually pre-pull (docker pull) the images before applying the deployment. Also you can set imagePullPolicy: Never for that.

Just a side note about ModuleUpdateMode: WaitForAllPulls. It means that all images will be downloaded first, and then containers (re)started, but EA still makes pull synchronously.

mathiash98 commented 2 months ago

Hey @mathiash98 thanks for reporting this. It looks like an existing problem where the pull operation happens synchronously and prevents EdgeAgent reacting to other module lifecycle events. I'll bring it up with the team to see how we can fix that. Meanwhile I can suggest a workaround (if applicable/possible) is to ssh to the device and manually pre-pull (docker pull) the images before applying the deployment. Also you can set imagePullPolicy: Never for that.

Just a side note about ModuleUpdateMode: WaitForAllPulls. It means that all images will be downloaded first, and then containers (re)started, but EA still makes pull synchronously.

Thanks for the information, we will try to keep an eye out for the installation with poor internet connection. Manually polling on 100 devices is too cumbersome, but we will create some dashboard that will show the devices that IotEdge is reporting as failing to mitigate this

mathiash98 commented 2 weeks ago

@vadim-kovalyov our dashboard is helping us identifying the devices that need help. But it is an ongoing issue everytime we release new software for a week or two as these devices can have extremely slow internet for 1-2 weeks before the vessel is back in port.

Are there any progress on this issue?

bishal41 commented 1 week ago

@vadim-kovalyov any update on this issue?

bishal41 commented 6 hours ago

Hi @mathiash98, glad to hear that you have the dashboard that's working as a potential workaround to identify the devices that need help with. @vadim-kovalyov and I discussed this issue briefly. Unfortunately, this is a very rare case and fixing it will require a lot of effort including rework on the edgeAgent logic. We don't have any plans to fix this at the moment and will put a work item in our backlog to revisit this in the future.