Azure / iotedge

The IoT Edge OSS project
MIT License
1.46k stars 460 forks source link

Unable to start custom module #7106

Closed simondegheselle closed 1 year ago

simondegheselle commented 1 year ago

Description

When trying to run a custom Edge module via Azure IoT Edge runtime, it fails to start. However, when the module name is changed, the EdgeAgent is able to successfully start the module. This issue seems confined to Azure IoT as running the Docker container directly without Azure IoT runtime works as expected.

The manifest uses a local registry to pull the images. Images are verified to be stored correctly in the local registry. For example:

docker pull localhost:5000/edge-module-camera-consumer-pylon:latest
latest: Pulling from edge-module-camera-consumer-pylon
Digest: sha256:032b08b0e19ffc804a4f7f832eed55e9d4f27f3ee8681c5b9f4d70789b
d03024
Status: Image is up to date for localhost:5000/edge-module-camera-consume
r-pylon:latest
localhost:5000/edge-module-camera-consumer-pylon:latest

Error logs

Error logs from edgeAgent module

e.PlanRunner.OrderedRetryPlan
   at Microsoft.Azure.Devices.Edge.Agent.Core.PlanRunner.OrderedRetryPlan
   at Microsoft.Azure.Devices.Edge.Agent.Core.Agent.ReconcileAsync(Cancel
   at Microsoft.Azure.Devices.Edge.Agent.Core.Agent.ReconcileAsync(Cancel
 ---> (Inner Exception #1) Microsoft.Azure.Devices.Edge.Agent.Edgelet.Edg
   at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Versioning.ModuleManagem
   at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.Modul
   at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClie
--- End of stack trace from previous location ---
   at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClie
   at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.Loggi
   at Microsoft.Azure.Devices.Edge.Agent.Core.Commands.GroupCommand.Execu
   at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.Loggi
   at Microsoft.Azure.Devices.Edge.Agent.Core.PlanRunner.OrderedRetryPlan
<6> 2023-09-13 17:44:33.566 +08:00 [INF] - Updated reported properties
<6> 2023-09-13 17:44:39.101 +08:00 [INF] - Plan execution started for dep
<6> 2023-09-13 17:44:39.101 +08:00 [INF] - Executing command: "[Null]"
<6> 2023-09-13 17:44:39.101 +08:00 [INF] - Executing command: "Saving cam
<6> 2023-09-13 17:44:39.104 +08:00 [INF] - Executing command: "Command Gr
<6> 2023-09-13 17:44:39.104 +08:00 [INF] - Executing command: "Command Gr
<6> 2023-09-13 17:44:39.781 +08:00 [INF] - Executing command: "Start modu
<3> 2023-09-13 17:44:56.989 +08:00 [ERR] - Executing command for operatio
Microsoft.Azure.Devices.Edge.Agent.Edgelet.EdgeletCommunicationException-
   at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Versioning.ModuleManagem
   at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.Modul
   at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClie
--- End of stack trace from previous location ---
   at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClie
   at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.Loggi
<3> 2023-09-13 17:44:56.995 +08:00 [ERR] - Executing command for operatio
Microsoft.Azure.Devices.Edge.Agent.Edgelet.EdgeletCommunicationException-
   at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Versioning.ModuleManagem
   at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.Modul
   at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClie
--- End of stack trace from previous location ---
   at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClie
   at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.Loggi
   at Microsoft.Azure.Devices.Edge.Agent.Core.Commands.GroupCommand.Execu
   at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.Loggi
<3> 2023-09-13 17:44:57.003 +08:00 [ERR] - Step failed in deployment 168, continuing execution. Failure when running command Command Group: (
  [Command Group: (
  [Prepare module cam-40348799]
  [Command Group: (
  [Stop module cam-40348799]
  [Update module cam-40348799]
)]
)]
  [Start module cam-40348799]

Azure iot edge system logs

Sep 13 17:46:19 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:19Z [INFO
Sep 13 17:46:19 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:19Z [INFO
Sep 13 17:46:20 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:20Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [ERR!
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [ERR!
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO] - Starting new listener for module cam-40341771
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO] - Failed to start module cam-40341771, error Failed to listen on workload socket: Is a directory (os error 21)
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [ERR!] - Could not wait on workload manager response, start of module: cam-40341771
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO] - --> 500 {"content-type": "application/json"}
Sep 13 17:46:21 GSAB177247 aziot-edged[28792]: 2023-09-13T09:46:21Z [INFO] - <-- GET /modules?api-version=2020-07-07 {"host": "2f7661722f72756e2f696f74656467652f6d676d742e736f636b:0"}

Manifest

Fails for manifest with following options

          "cam-40348799": {
            "type": "docker",
            "status": "running",
            "restartPolicy": "always",
            "version": "3",
            "settings": {
              "image": "${CONTAINER_REGISTRY_SERVER}/edge-module-camera-consumer-pylon:latest",
              "createOptions": {
                "Env": [
                  "CAMERA_TYPE=HARV_BASLER_RGB"
                ],
                "HostConfig": {
                  "NetworkMode": "host",
                  "Privileged": true,
                  "Binds": [
                    "/tmp:/tmp:rw",
                    "/etc/timezone:/etc/timezone:ro",
                    "/etc/localtime:/etc/localtime:ro"
                  ]
                },
                "Entrypoint": [
                  "python",
                  "-m",
                  "app",
                  "--produce_ids",
                  "depalletizer_2_rgb_camera"
                ],
                "NetworkingConfig": {
                  "EndpointsConfig": {
                    "host": {}
                  }
                }
              }
            }
          },

And for a manifest without any create options

  "cam-40348799": {
    "type": "docker",
    "status": "running",
    "restartPolicy": "always",
    "version": "3",
    "settings": {
         "image": "${CONTAINER_REGISTRY_SERVER}/edge-module-camera-consumer-pylon:latest"
       }
    },

Iot edge check

Configuration checks (aziot-identity-service)
---------------------------------------------
√ keyd configuration is well-formed - OK
√ certd configuration is well-formed - OK
√ tpmd configuration is well-formed - OK
√ identityd configuration is well-formed - OK
√ daemon configurations up-to-date with config.toml - OK
√ identityd config toml file specifies a valid hostname - OK
√ aziot-identity-service package is up-to-date - OK
√ host time is close to reference time - OK
√ preloaded certificates are valid - OK
√ keyd is running - OK
√ certd is running - OK
√ identityd is running - OK
√ read all preloaded certificates from the Certificates Service - OK
√ read all preloaded key pairs from the Keys Service - OK
√ check all EST server URLs utilize HTTPS - OK
√ ensure all preloaded certificates match preloaded private keys with the same ID - OK

Connectivity checks (aziot-identity-service)
--------------------------------------------
√ host can connect to and perform TLS handshake with iothub AMQP port - OK
√ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with iothub MQTT port - OK

Configuration checks
--------------------
√ aziot-edged configuration is well-formed - OK
√ configuration up-to-date with config.toml - OK
√ container engine is installed and functional - OK
√ configuration has correct URIs for daemon mgmt endpoint - OK
√ aziot-edge package is up-to-date - OK
√ container time is close to host time - OK
‼ DNS server - Warning
    Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
    Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
    You can ignore this warning if you are setting DNS server per module in the Edge deployment.
‼ production readiness: logs policy - Warning
    Container engine is not configured to rotate module logs which may cause it run out of disk space.
    Please see https://aka.ms/iotedge-prod-checklist-logs for best practices.
    You can ignore this warning if you are setting log policy per module in the Edge deployment.
‼ production readiness: Edge Agent's storage directory is persisted on the host filesystem - Warning
    The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.
‼ production readiness: Edge Hub's storage directory is persisted on the host filesystem - Warning
    The edgeHub module is not configured to persist its /tmp/edgeHub directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.
√ Agent image is valid and can be pulled from upstream - OK
√ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK

Connectivity checks
-------------------
√ container on the default network can connect to upstream AMQP port - OK
√ container on the default network can connect to upstream HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to upstream AMQP port - OK
√ container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - OK
31 check(s) succeeded.
4 check(s) raised warnings. Re-run with --verbose for more details.
2 check(s) were skipped due to errors from other checks. Re-run with --verbose for more details.
…c-730ai@GSAB177247 ~/c/g/development_utils (development)
sudo iotedge check

Configuration checks (aziot-identity-service)
---------------------------------------------
√ keyd configuration is well-formed - OK
√ certd configuration is well-formed - OK
√ tpmd configuration is well-formed - OK
√ identityd configuration is well-formed - OK
√ daemon configurations up-to-date with config.toml - OK
√ identityd config toml file specifies a valid hostname - OK
√ aziot-identity-service package is up-to-date - OK
√ host time is close to reference time - OK
√ preloaded certificates are valid - OK
√ keyd is running - OK
√ certd is running - OK
√ identityd is running - OK
√ read all preloaded certificates from the Certificates Service - OK
√ read all preloaded key pairs from the Keys Service - OK
√ check all EST server URLs utilize HTTPS - OK
√ ensure all preloaded certificates match preloaded private keys with the same ID - OK

Connectivity checks (aziot-identity-service)
--------------------------------------------
√ host can connect to and perform TLS handshake with iothub AMQP port - OK
√ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with iothub MQTT port - OK

Configuration checks
--------------------
√ aziot-edged configuration is well-formed - OK
√ configuration up-to-date with config.toml - OK
√ container engine is installed and functional - OK
√ configuration has correct URIs for daemon mgmt endpoint - OK
√ aziot-edge package is up-to-date - OK
√ container time is close to host time - OK
‼ DNS server - Warning
    Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
    Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
    You can ignore this warning if you are setting DNS server per module in the Edge deployment.
‼ production readiness: logs policy - Warning
    Container engine is not configured to rotate module logs which may cause it run out of disk space.
    Please see https://aka.ms/iotedge-prod-checklist-logs for best practices.
    You can ignore this warning if you are setting log policy per module in the Edge deployment.
‼ production readiness: Edge Agent's storage directory is persisted on the host filesystem - Warning
    The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.
‼ production readiness: Edge Hub's storage directory is persisted on the host filesystem - Warning
    The edgeHub module is not configured to persist its /tmp/edgeHub directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.
√ Agent image is valid and can be pulled from upstream - OK
√ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK

Connectivity checks
-------------------
√ container on the default network can connect to upstream AMQP port - OK
√ container on the default network can connect to upstream HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to upstream AMQP port - OK
√ container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - OK
31 check(s) succeeded.
4 check(s) raised warnings. Re-run with --verbose for more details.
2 check(s) were skipped due to errors from other checks. Re-run with --verbose for more details.

Troubleshooting

arsing commented 1 year ago

Failed to start module cam-40341771, error Failed to listen on workload socket: Is a directory (os error 21)

This is the problem. At some point something that isn't IoT Edge started the container. That meant the dedicated workload socket (/var/lib/aziot/edged/mnt/${module_id}.sock) that is mounted in every module container did not exist, so dockerd helpfully created the volume mount as a directory. Now when IoT Edge tries to start it it can't create the socket because there's already a directory there.

You can verify that that path is a directory, and you can fix it by deleting it.

The next release after 1.4.18 will automatically delete the directory when this happens.

arsing commented 1 year ago

The next release after 1.4.18 will automatically delete the directory when this happens.

https://github.com/Azure/azure-iotedge/releases/tag/1.4.20