Azure / azure-iot-operations

The official repo for Azure IoT Operations.
MIT License
25 stars 21 forks source link

Error during installation on a K3s Cluster #44

Open geebinge opened 6 months ago

geebinge commented 6 months ago

Hi, I’m new in the field of Azure ARC and Azure IoT Operations, but in view of the fact that IoT Edge may be retired sooner or later, I am playing now aroundd with these technologies. I have installed it on a Moxa device with Ubuntu 22.04 and the installations worked fine. I have now done the same installation on a K3s cluster with

1 x UP Squared IoT Edge - Intel(R) Atom(TM) Processor E3950 @ 1.60GHz (4 Cores), 8 GB RAM 4 x Raspberry Pi 4 Model B Rev 1.2 - Cortex-A72 (ARM v8), 4 GB RAM

The installation with ARC was no issue, but the installation with Azure IoT Operations, brought an error.

After running

az iot ops init \
    --subscription ${SUBSCRIPTION_ID}  \
    -g ${RESOURCE_GROUP} \
    --cluster $CLUSTER_NAME  \
    --kv-id /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/gebinger/providers/Microsoft.KeyVault/vaults/${KEYVAULTNAME} \
    --custom-location ${CLUSTER_NAME}-cl \
    --target ${CLUSTER_NAME}-target \
    --dp-instance ${CLUSTER_NAME}-processor \
    --simulate-plc \
    --mq-instance mq-instance-titanpi \
    --mq-mode auto

All checks were past

image

I got the error

(DeploymentFailed) At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.
Code: DeploymentFailed
Message: At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.
Target: /subscriptions/51b5e5eb-e220-43a2-a9e8-adcc2d07aefa/resourceGroups/gebinger_azureiotoperations/providers/Microsoft.Resources/deployments/aziotops.init.9607b8452232490bb97335d36976a242
Exception Details:      (ResourceDeploymentFailure) The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.
        Code: ResourceDeploymentFailure
        Message: The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.
        Target: /subscriptions/51b5e5eb-e220-43a2-a9e8-adcc2d07aefa/resourceGroups/gebinger_azureiotoperations/providers/Microsoft.Kubernetes/connectedClusters/geazureiotoperations_titanpi/providers/Microsoft.KubernetesConfiguration/extensions/azure-iot-operations
        Exception Details:      (ExtensionOperationFailed) The extension operation failed with the following error:  Error: [ InnerError: [Helm installation failed : Timed out waiting for the resource to come to a ready/completed state Last resource not ready was azure-iot-operations/aio-cert-manager-cainjector Pod event from k8s: Error: ImagePullBackOff For more events check kubernetes events using kubectl events -n azure-iot-operations : Recommendation Please contact Microsoft support for further inquiries : InnerError [release azure-iot-operations failed, and has been uninstalled due to atomic being set: timed out waiting for the condition]]] occurred while doing the operation : [Create] on the config, For general troubleshooting visit: https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: Please reach out to Microsoft tech care if you need help with debugging the problems with your extension resource.
                Code: ExtensionOperationFailed
                Message: The extension operation failed with the following error:  Error: [ InnerError: [Helm installation failed : Timed out waiting for the resource to come to a ready/completed state Last resource not ready was azure-iot-operations/aio-cert-manager-cainjector Pod event from k8s: Error: ImagePullBackOff For more events check kubernetes events using kubectl events -n azure-iot-operations : Recommendation Please contact Microsoft support for further inquiries : InnerError [release azure-iot-operations failed, and has been uninstalled due to atomic being set: timed out waiting for the condition]]] occurred while doing the operation : [Create] on the config, For general troubleshooting visit: https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: Please reach out to Microsoft tech care if you need help with debugging the problems with your extension resource.

The az iot ops check shows

image

In the portal, the ARC Extensions show this:

image

kubectl events -n azure-iot-operations you can find attached.

2024.03.21errors_during_installationv2.txt

calebherbison commented 6 months ago

I see the following errors from kubectl events that may be causing the problem:

Failed to pull image "mcr.microsoft.com/cbl-mariner/base/cert-manager-controller:1.11.2": rpc error: code = NotFound desc = failed to pull and unpack image "mcr.microsoft.com/cbl-mariner/base/cert-manager-controller:1.11.2": no match for platform in manifest: not found

Failed to pull image "mcr.microsoft.com/cbl-mariner/base/cert-manager-cainjector:1.11.2": rpc error: code = NotFound desc = failed to pull and unpack image "mcr.microsoft.com/cbl-mariner/base/cert-manager-cainjector:1.11.2": no match for platform in manifest: not found

Failed to pull image "mcr.microsoft.com/cbl-mariner/base/cert-manager-webhook:1.11.2": rpc error: code = NotFound desc = failed to pull and unpack image "mcr.microsoft.com/cbl-mariner/base/cert-manager-webhook:1.11.2": no match for platform in manifest: not found

mcr.microsoft.com/cbl-mariner/base/cert-manager-controller:1.11.2 mcr.microsoft.com/cbl-mariner/base/cert-manager-cainjector:1.11.2 mcr.microsoft.com/cbl-mariner/base/cert-manager-webhook:1.11.2

I'm also getting these errors. My k3s cluster is running on an ARM64 device and I'm guessing an ARM64 build of the container image isn't available to us yet.

geebinge commented 6 months ago

Hi, thx, this is very attentive! This helps me to find this https://www.reddit.com/r/k3s/comments/ttu4y3/can_you_mix_arm_and_x64_with_k3s_is_there_an_easy/

The UP Squared IoT Edge would be an x64 architecture. As far as I understand the article, k3s should be able to deploy the right image to the right architecture. Or did I miss something in the reddit question?

calebherbison commented 6 months ago

Yeah, multi-arch k3s cluster is possible. I believe it requires multi-arch docker images (or node affinity) and I'm thinking the cert-manager-* images aren't multi-arch. In your case, you have some ARM64 nodes, and when you deploy IoT operations, it schedules some of its pods on those instead of the x64 node and that causes issues. Can you confirm if your pods are being scheduled on the ARM64 or x64 nodes?

geebinge commented 6 months ago

yes for sure,

Successfully assigned azure-iot-operations/aio-orc-api-5d756f4688-p5hv2 to rhea
Successfully assigned azure-iot-operations/aio-orc-controller-manager-599c956bd4-khpt4 to rhea
Successfully assigned azure-iot-operations/aio-cert-manager-webhook-75f859b7c8-jsjpx to rhea
Successfully assigned azure-iot-operations/aio-cert-manager-57bd6f8778-7bm6j to dion
Successfully assigned azure-iot-operations/aio-cert-manager-cainjector-867c486556-qcc2n to pandora

rhea, dion and pandora are Raspberry PI 4. And I have checked in the meanwhile

kubectl get nodes --show-labels

NAME      STATUS   ROLES                  AGE   VERSION        LABELS
titan     Ready    control-plane,master   9d    v1.28.7+k3s1   beta.kubernetes.io/arch=amd64,... ,kubernetes.io/arch=amd64,kubernetes.io/hostname=titan,...
pan       Ready    worker                 9d    v1.28.7+k3s1   beta.kubernetes.io/arch=arm64,... ,kubernetes.io/arch=arm64,kubernetes.io/hostname=pan,...
dion      Ready    worker                 9d    v1.28.7+k3s1   beta.kubernetes.io/arch=arm64,... ,kubernetes.io/arch=arm64,kubernetes.io/hostname=dion,...
rhea      Ready    worker                 9d    v1.28.7+k3s1   beta.kubernetes.io/arch=arm64,... ,kubernetes.io/arch=arm64,kubernetes.io/hostname=rhea,...
pandora   Ready    worker                 9d    v1.28.7+k3s1   beta.kubernetes.io/arch=arm64,... ,kubernetes.io/arch=arm64,kubernetes.io/hostname=pandora, ... 

the labels are fine.