Azure / azure-iot-operations

The official repo for Azure IoT Operations.
MIT License
20 stars 16 forks source link

Error while deploying azure-iot-operations failed to create Processor with error " The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'. #29

Open ashwinimhaiskar opened 6 months ago

ashwinimhaiskar commented 6 months ago

I was trying to deploy Azure IOT operations from Azure portal. Deployment failed with below error.

ResourceDeploymentFailure","target":"/subscriptions//resourceGroups/**/providers/Microsoft.Kubernetes/connectedClusters//providers/Microsoft.KubernetesConfiguration/extensions/processor","message":"The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.","details":["code":"ExtensionOperationFailed","message":"The extension operation failed with the following error: Error: [ InnerError: [Helm installation failed : Timed out waiting for the resource to come to a ready/completed state : Recommendation Please contact Microsoft support for further inquiries : InnerError [release processor failed, and has been uninstalled due to atomic being set: timed out waiting for the condition]]] occurred while doing the operation : [Create] on the config, For general troubleshooting visit: https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: please reach out to Microsoft tech care if you need help with debugging the problems with your extension resource.","additionalInfo":[]]]

How to solve this issue ?

chgennar commented 6 months ago

Hi @ashwinimhaiskar,

Was step 9 followed from the documentation here? https://learn.microsoft.com/en-us/azure/iot-operations/get-started/quickstart-deploy?tabs=windows#deploy-azure-iot-operations

The az iot ops init command from this step needs to be run on the cluster where AIO will be deployed to initialize some dependencies.

ashwinimhaiskar commented 6 months ago

Yes. this step was followed. We also tried to deploy application from forked code by changing required details through git hub directly. Still, we are getting same issue. In Error details some generic message is coming so not able to understand what is missing. Coping Error message below.

{ "status": "Failed", "error": { "code": "ExtensionOperationFailed", "message": "The extension operation failed with the following error: Error: [ InnerError: [Helm installation failed : Timed out waiting for the resource to come to a ready/completed state : Recommendation Please contact Microsoft support for further inquiries : InnerError [release processor failed, and has been uninstalled due to atomic being set: timed out waiting for the condition]]] occurred while doing the operation : [Create] on the config, For general troubleshooting visit: https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: please reach out to Microsoft tech care if you need help with debugging the problems with your extension resource.", "additionalInfo": [] } }

ruonke commented 6 months ago

@ashwinimhaiskar After you run init, check your kube log for errors:

kubectl get events -n azure-iot-operations -o json

You should find something like MountVolume.SetUp failed 403 error with message : Caller is not authorized to perform action on resource. Caller: appid="this is the id you need"

Go to azure portal, find the keyvault you used in the init command, select secrets, azure-iot-operations and then Access Control (IAM) Grant Key Vault Secrets Officer right to the appid above and run the init command with the appid parameter.

The instructions on the linked microsoft page are kinda bad, but when you run the init for first time with the --no-deploy parameter, the output says: KeyVault CSI Driver

Once you run the init with --no-deploy, it will generate your appid, the App (id) is for --sp-app-id parameter Now that you have the appid, you can configure the keyvault access policy via Azure Portal(as explained above) When you run the init again to deploy, you need to omit the keyvault definition and use the sp-app-id parameter.

BillmanH commented 6 months ago

Same issue exactly. All of the other components have installed.

My error in the deployment log: {\"status\":\"Failed\",\"error\":{\"code\":\"ResourceOperationFailure\",\"message\":\"The resource operation completed with terminal provisioning state 'Failed'.\",\"details\":[{\"code\":\"ExtensionOperationFailed\",\"message\":\"The extension operation failed with the following error: Helm Upgrade Failed : Timed out waiting for the resource to come to a ready/completed state : Recommendation Please contact Microsoft support for further inquiries : InnerError [ InnerError: [Helm installation failed : Timed out waiting for the resource to come to a ready/completed stateStatefulSet is not ready: azure-iot-operations/aio-dp-reader-worker. 0 out of 1 expected pods are ready : Recommendation Please contact Microsoft support for further inquiries : InnerError [release processor failed, and has been uninstalled due to atomic being set: timed out waiting for the condition]]], For general troubleshooting visit: https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: please reach out to Microsoft tech care if you need help with debugging the problems with your extension resource.\",\"additionalInfo\":[]}]}}

I have run the steps above, including keyvault access. You can tell that this has been completed because the other components are installed. image Only the processor component did not install.

On running: kubectl get events -n azure-iot-operations -o json

I get the response:

{
    "apiVersion": "v1",
    "items": [],
    "kind": "List",
    "metadata": {
        "resourceVersion": ""
    }
}
ashwinimhaiskar commented 6 months ago

@ruonke azure IOT operations were deployed properly and started working for us on 22nd dec till 28th Dec ,on 29th morning cluster was offline and not connected to Azure ,so i deleted everything and tried deploying it again but till now it's not happening .

I followed above steps also which you mentioned but still processor deployment is failing .

On running: kubectl get events -n azure-iot-operations -o json

In logs i am getting below details. { "apiVersion": "v1", "count": 201, "eventTime": null, "firstTimestamp": "2024-01-05T05:22:38Z", "involvedObject": { "apiVersion": "v1", "kind": "PersistentVolumeClaim", "name": "runner-local-aio-dp-runner-worker-0", "namespace": "azure-iot-operations", "resourceVersion": "9357", "uid": "55288e01-05ef-4d30-bb01-bc1be14dd94b" }, "kind": "Event", "lastTimestamp": "2024-01-05T06:12:39Z", "message": "no persistent volumes available for this claim and no storage class is set", "metadata": { "creationTimestamp": "2024-01-05T05:22:39Z", "name": "runner-local-aio-dp-runner-worker-0.17a75c04a29524d6", "namespace": "azure-iot-operations", "resourceVersion": "591259", "uid": "ddd9efcb-6565-4dd9-b5ba-6d62e411c170" }, "reason": "FailedBinding", "reportingComponent": "", "reportingInstance": "", "source": { "component": "persistentvolume-controller" }, "type": "Normal" } Error on azure portal after deployment failed . { "status": "Failed", "error": { "code": "ExtensionOperationFailed", "message": "The extension operation failed with the following error: Error: [ InnerError: [Helm installation failed : Timed out waiting for the resource to come to a ready/completed state : Recommendation Please contact Microsoft support for further inquiries : InnerError [release processor failed, and has been uninstalled due to atomic being set: timed out waiting for the condition]]] occurred while doing the operation : [Create] on the config, For general troubleshooting visit: https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: please reach out to Microsoft tech care if you need help with debugging the problems with your extension resource.", "additionalInfo": [] } } Also attached Deployment failed image

image

Mzamankhan commented 5 months ago

@ashwinimhaiskar Are you still blocked on this issue?

Can you please provide a full support bundle ( https://learn.microsoft.com/en-us/cli/azure/iot/ops?view=azure-cli-latest ).

digimaun commented 5 months ago

Hi @ashwinimhaiskar and @BillmanH are you deploying on an Ubuntu environment? If so do you have nfs-common installed as outlined in the cluster setup section?

What is the permission model of the keyvault you are using for the CSI Driver setup step? It must be set to "Vault Access Policy" and not Azure RBAC. image

BillmanH commented 5 months ago

@digimaun , yes that seems to have done it. I must have missed this in the setup docs. Thanks!

digimaun commented 5 months ago

Excellent @BillmanH , glad to hear and thank you for reporting back on what worked.

ashwinimhaiskar commented 5 months ago

@digimaun Yesterday during IST time Zone. I tried to deploy once again, and it got deployed without doing any changes.

mahomedalid commented 5 months ago

I am having this issue today. Tried deploying with bicep into westus2, and using the cli az iot ops init, same issue.

The data processor part fails:

image

This is when trying with bicep:


{
  "status": "Failed",
  "error": {
    "code": "DeploymentFailed",
    "target": "/subscriptions/a6cdfc52-917e-4744-8785-031f64901c9b/resourceGroups/rg-aio/providers/Microsoft.Resources/deployments/aio-deployment-aio-deployment-1641",
    "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.",
    "details": [
      {
        "code": "ResourceDeploymentFailure",
        "target": "/subscriptions/a6cdfc52-917e-4744-8785-031f64901c9b/resourceGroups/rg-aio/providers/Microsoft.Resources/deployments/aio-rg-aio-20240109222244",
        "message": "The resource write operation failed to complete successfully, because it reached terminal provisioning state Failed.",
        "details": [
          {
            "code": "DeploymentFailed",
            "target": "/subscriptions/a6cdfc52-917e-4744-8785-031f64901c9b/resourceGroups/rg-aio/providers/Microsoft.Resources/deployments/aio-rg-aio-20240109222244",
            "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.",
            "details": [
              {
                "code": "ResourceDeploymentFailure",
                "target": "/subscriptions/a6cdfc52-917e-4744-8785-031f64901c9b/resourceGroups/rg-aio/providers/Microsoft.Kubernetes/connectedClusters/cluster-408/providers/Microsoft.KubernetesConfiguration/extensions/azure-iot-operations",
                "message": "The resource write operation failed to complete successfully, because it reached terminal provisioning state Failed.",
                "details": [
                  {
                    "code": "ExtensionOperationFailed",
                    "message": "The extension operation failed with the following error:  Failed to resolve the extension version from the given values. Please refer https://aka.ms/k8s-extension-type-versions to find the correct version for your installation, For general troubleshooting visit: https://aka.ms/k8s-extensions-TSG.",
                    "additionalInfo": []
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

The message from az iot ops init:

(DeploymentFailed) At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.
Code: DeploymentFailed
Message: At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.
Target: /subscriptions/a6cdfc52-917e-4744-8785-031f64901c9b/resourceGroups/rg-aio/providers/Microsoft.Resources/deployments/aziotops.init.b70a195b67a04c8e9afc4267494ad991
Exception Details:      (ResourceDeploymentFailure) The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.
        Code: ResourceDeploymentFailure
        Message: The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.
        Target: /subscriptions/a6cdfc52-917e-4744-8785-031f64901c9b/resourceGroups/rg-aio/providers/Microsoft.Kubernetes/connectedClusters/cluster-408/providers/Microsoft.KubernetesConfiguration/extensions/processor
        Exception Details:      (ExtensionOperationFailed) The extension operation failed with the following error:  Error: [ InnerError: [Helm installation failed : Timed out waiting for the resource to come to a ready/completed stateStatefulSet is not ready: azure-iot-operations/aio-dp-reader-worker. 0 out of 1 expected pods are ready : Recommendation Please contact Microsoft support for further inquiries : InnerError [release processor failed, and has been uninstalled due to atomic being set: timed out waiting for the condition]]] occurred while doing the operation : [Create] on the config, For general troubleshooting visit: https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: please reach out to Microsoft tech care if you need help with debugging the problems with your extension resource.
                Code: ExtensionOperationFailed
                Message: The extension operation failed with the following error:  Error: [ InnerError: [Helm installation failed : Timed out waiting for the resource to come to a ready/completed stateStatefulSet is not ready: azure-iot-operations/aio-dp-reader-worker. 0 out of 1 expected pods are ready : Recommendation Please contact Microsoft support for further inquiries : InnerError [release processor failed, and has been uninstalled due to atomic being set: timed out waiting for the condition]]] occurred while doing the operation : [Create] on the config, For general troubleshooting visit: https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: please reach out to Microsoft tech care if you need help with debugging the problems with your extension resource.
digimaun commented 5 months ago

@mahomedalid have tried the suggestions here https://github.com/Azure/azure-iot-operations/issues/29#issuecomment-1881703064 ?

meenag16 commented 5 months ago

@mahomedalid following up, did the suggestions @digimaun mentioned work out?

ashwinimhaiskar commented 5 months ago

@ruonke azure IOT operations were deployed properly and started working for us on 22nd dec till 28th Dec ,on 29th morning cluster was offline and not connected to Azure ,so i deleted everything and tried deploying it again but till now it's not happening .

I followed above steps also which you mentioned but still processor deployment is failing .

On running: kubectl get events -n azure-iot-operations -o json

In logs i am getting below details. { "apiVersion": "v1", "count": 201, "eventTime": null, "firstTimestamp": "2024-01-05T05:22:38Z", "involvedObject": { "apiVersion": "v1", "kind": "PersistentVolumeClaim", "name": "runner-local-aio-dp-runner-worker-0", "namespace": "azure-iot-operations", "resourceVersion": "9357", "uid": "55288e01-05ef-4d30-bb01-bc1be14dd94b" }, "kind": "Event", "lastTimestamp": "2024-01-05T06:12:39Z", "message": "no persistent volumes available for this claim and no storage class is set", "metadata": { "creationTimestamp": "2024-01-05T05:22:39Z", "name": "runner-local-aio-dp-runner-worker-0.17a75c04a29524d6", "namespace": "azure-iot-operations", "resourceVersion": "591259", "uid": "ddd9efcb-6565-4dd9-b5ba-6d62e411c170" }, "reason": "FailedBinding", "reportingComponent": "", "reportingInstance": "", "source": { "component": "persistentvolume-controller" }, "type": "Normal" } Error on azure portal after deployment failed . { "status": "Failed", "error": { "code": "ExtensionOperationFailed", "message": "The extension operation failed with the following error: Error: [ InnerError: [Helm installation failed : Timed out waiting for the resource to come to a ready/completed state : Recommendation Please contact Microsoft support for further inquiries : InnerError [release processor failed, and has been uninstalled due to atomic being set: timed out waiting for the condition]]] occurred while doing the operation : [Create] on the config, For general troubleshooting visit: https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: please reach out to Microsoft tech care if you need help with debugging the problems with your extension resource.", "additionalInfo": [] } } Also attached Deployment failed image

image

@ruonke I have started facing same issue again ..

ashwinimhaiskar commented 5 months ago

@ruonke We are still facing issue in deployment. we are completely stuck for deployment. Could please let me know what could be the reason for above issue . I am attaching Logs of kubectl get events -n azure-iot-operations -o json event.json

ruonke commented 5 months ago

@ashwinimhaiskar Are you deploying to a host with enough resources? This sounds much like the attempts I had when I was installing to a host with 8Gb ram, I had to increase the VM size to 16Gb to get the system to install reliably. It seems to run with 8Gb, using about 5-6Gb after you get the install process to complete, I've encountered few out-of-memory instances but I have other application pods deployed to the same instance which are using resources as well.

You mentioned that you removed the deployment and ran it again, at expected clean state, check kubectl get pv and kubectl get pvc -n azure-iot-operations to see if the presistent volumes and the claims are actually removed. The failure in logs sounds like there is some remaining volume or claim to one.

ashwinimhaiskar commented 4 months ago

@ruonke We are using below configuration VM Standard D4s v3 (4 vcpus, 16 GiB memory) with Windows (Windows Server 2022 Datacenter Azure Edition).I deleted VM and recreated and tried deploying IOT operations still I am getting same issue. kubectl get pvc after running this command I am getting below issue. No resources found.

kubectl get pvc -n azure-iot-operations NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE aio-dp-msg-store-js-aio-dp-msg-store-0 Pending 3d2h data-aio-dp-nfs-server-provisioner-0 Pending 3d2h nfs-provisioner Pending nfs 12m refdatastore-local-aio-dp-refdata-store-0 Pending 3d2h runner-local-aio-dp-runner-worker-0 Pending 3d2h

ruonke commented 4 months ago

In our tests WSL hasn't been a platform I'd recommend for this. We managed to get iot ops running for one person out of four who tried installing it. The setup works much better on ubuntu host.

BillmanH commented 4 months ago

Would it be possible (as it's in Preview and we know it's somewhat limited on OS). Could you just tell us the VM to use in the docs? You already have great docs on the memory limitations for the VM. But it might be easier to say "This SKU. This OS. Nothing else on it. This is the VM that we are testing on"

I was able to get it working on the actual Edge with an Intel NUC 11 - 16gb ram - 512gb ssd. However a lot of prototypers will likely use a VM just to test the application. Every time I try to install on a VM, I fail. The error message is always some reason other than the memory, but the troubleshooting always comes back to "wrong VM". Maybe even just give us the ARM template of a working VM that we can use in experiment?