Azure / service-fabric-mesh-preview

Service Fabric Mesh is the Service Fabric's serverless offering to enable developers to deploy containerized applications without managing infrastructure. Service Fabric Mesh , aka project “SeaBreeze” is currently available in private preview. This repository will be used for tracking bugs/feature requests as GitHub issues and for maintaining the latest documentation.
MIT License
82 stars 12 forks source link

Error when trying to publish mesh application, "Unable to edit or replace'test': previous deployment is still active" #266

Open MikeChristensen opened 6 years ago

MikeChristensen commented 6 years ago

I'm attempting to use Service Fabric Mesh to deploy a container based app. I'm new to Mesh, but I did get the sample 'Hello App' working. Now I'm trying to deploy an app based on a container I have stored in Azure. I used the following command to deploy this:

az mesh deployment create --resource-group MikeMesh --template-file test.json

The command says 'Deploying . . .' for about 10 minutes, and then says:

Unable to edit or replace deployment 'test': previous deployment from '8/20/2018 9:55:28 PM' is still active (expiration time is '8/27/2018 9:55:27 PM'). Please see https://aka.ms/arm-deploy for usage details.

Note this is a brand new resource group, and I've never deployed anything here before. Also note I can deploy this same container using normal Service Fabric and it works fine.

The link in the error message doesn't contain anything helpful. If I go to the 'Deployments' tab, the status says 'Deploying' but it seems stuck (it's been nearly two hours). If I click on 'Related events', I get 3 events but all of them say 'Succeeded' or 'Accepted'. Under 'Quick Insights' if I click 'Failed deployments', I see a 'Validate deployment' operation that failed 2 hours ago, but I'm not clear if this has anything to do with this deployment. The error message for that is the same as what I saw on the command line.

The test.json is as follows (with some sensitive information removed):

{
  "$schema": "http://schema.management.azure.com/schemas/2014-04-01-preview/deploymentTemplate.json",
  "contentVersion": "1.0.0.0",
  "resources": [
    {
      "apiVersion": "2018-07-01-preview",
      "name": "LimeadeMeshNetwork",
      "type": "Microsoft.ServiceFabricMesh/networks",
      "location": "eastus",
      "dependsOn": [],
      "properties": {
        "addressPrefix": "10.0.0.4/22",
        "ingressConfig": {
          "layer4": [
            {
              "publicPort": "80",
              "applicationName": "LimeadeApp",
              "serviceName": "WebsiteService",
              "endpointName": "WebsiteListener"
            }
          ]
        }
      }
    },
    {
      "apiVersion": "2018-07-01-preview",
      "name": "LimeadeApp",
      "type": "Microsoft.ServiceFabricMesh/applications",
      "location": "eastus",
      "dependsOn": [
        "Microsoft.ServiceFabricMesh/networks/LimeadeMeshNetwork"
      ],
      "properties": {
        "description": "Limeade Mesh Application",
        "services": [
          {
            "type": "Microsoft.ServiceFabricMesh/services",
            "location": "eastus",
            "name": "WebsiteService",
            "properties": {
              "description": "Limeade Website Service",
              "osType": "windows",
              "codePackages": [
                {
                  "name": "WebsiteCode",
                  "image": "web:20180720103109",
                  "imageRegistryCredential": {
                    "server": "<container>.azurecr.io",
                    "username": "<username>",
                    "password": "<password>"
                  },
                  "endpoints": [
                    {
                      "name": "WebsiteListener",
                      "port": "80"
                    }
                  ],
                  "resources": {
                    "requests": {
                      "cpu": "1",
                      "memoryInGB": "1"
                    }
                  }
                }
              ],
              "replicaCount": "1",
              "networkRefs": [
                {
                  "name": "[resourceId('Microsoft.ServiceFabricMesh/networks', 'LimeadeMeshNetwork')]"
                }
              ]
            }
          }
        ]
      }
    }
  ]
}
MikeChristensen commented 6 years ago

More info

MikeChristensen commented 6 years ago

Referencing @MicahMcKittrick-MSFT

mimckitt commented 6 years ago

@BharatNarasimman would you be able to help get someone to assist @MikeChristensen?

I worked with him on a doc related question #13830 but I was not seeing the same behavior so think he should work directly with all of you to get it sorted out in case it is an issue with Mesh itself.

mikkelhegn commented 5 years ago

@vaishnavk

MikeChristensen commented 5 years ago

Tried this again on Oct 18 - Still having same issue. Anyone want to help with this?

jeffj6123 commented 5 years ago

Can you try using the -n flag for the az mesh deployment create command, to see if you still have conflicts with existing deployments?

MikeChristensen commented 5 years ago

Is -n the same as --name? If so, I'm already using that parameter. My command is:

az mesh deployment create --resource-group MikeMeshTest --name TestDeployment --template-file .\test.json

I've tried deploying into two resource groups, both are brand new and have never had any deployments, so nothing to conflict with. The first deployment finally finished after 2 hours and 21 seconds, and gives me the error:

{
  "code": "DeploymentFailed",
  "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details.",
  "details": [
    {
      "code": "RequestTimeout",
      "message": "{\r\n  \"error\": {\r\n    \"code\": \"ResourceDeploymentFailure\",\r\n    \"message\": \"The resource provision operation did not complete within the allowed timeout period. Please see https://aka.ms/arm-deploy for usage details.\"\r\n  }\r\n}"
    }
  ]
}

The second deployment is still in progress (it's been about an hour and a half). I cannot find any information as to what exactly it's doing or where it's getting stuck. Let me know if you want any other logs or info.

jeffj6123 commented 5 years ago

Deployment names are not resource group specific but subscription specific, which can cause issues if you have multiple resource groups with ongoing deployments with the same name. Could you cancel any ongoing deployments you have over all of the resource groups and then see if there are still deployment conflicts?

MikeChristensen commented 5 years ago

Do you mean Mesh deployments? Or Service Fabric deployments? Or any sort of deployments of any sort?

Also, I can use a GUID for my deployment name if that would help.

jeffj6123 commented 5 years ago

Any deployments that are still running that share that name over all your resource groups on that subscription. a GUID should help with avoiding name conflicts.

MikeChristensen commented 5 years ago

I doubt there's any deployment with that same name. I'll try again and use a GUID.

MikeChristensen commented 5 years ago

Running with command:

az mesh deployment create --resource-group MikeMeshTest --name 52d74104a43d40f691bc56400085716e --template-file .\test.json

MikeChristensen commented 5 years ago

Same error:

Unable to edit or replace deployment '52d74104a43d40f691bc56400085716e': previous deployment from '10/18/2018 11:52:08 PM' is still active (expiration time is '10/25/2018 11:51:51 PM'). Please see https://aka.ms/arm-deploy for usage details.

markheath commented 5 years ago

I get this same error every time I try to use az mesh deployment create to scale up one of my containers from 1 replica to 3 (based on the steps shown in the tutorial here). The command takes 10 minutes or so to time out and reports

Unable to edit or replace deployment 'sfmesh-example-voting-app': previous deployment from '10/23/2018 1:40:18 PM' is still active (expiration time is '10/30/2018 1:40:02 PM'). Please see https://aka.ms/arm-deploy for usage details.

The "previous" deployment appears to be referring to the first deployment where I created the app in the first place which already completed successfully. az deployment list returns nothing.

Strangely, while the deployment is in progress I do see (from the portal) that the target service scales out to 3 replicas and I can see that I can communicate with at least 2 different replicas. However, after az mesh deployment create fails, even though az mesh service list shows 3 replicas for that service, but I can see both from the portal and my own app responses that there is now only 1 replica running again.

Not sure what else to try - this is happening every time, and with a few different test applications.

MikeChristensen commented 5 years ago

Yea I'm starting to think this is a generic error that hides some real error that's going on underneath. I think the first thing they need to fix is their error handling, so we can see what's actually going on.

markheath commented 5 years ago

I've tried making my initial deployment have a replica count of 3 instead of starting at 1 and trying to scale up, and that results in a deployment that times out. One of the containers in the service has started ok, while the other 2 are stuck in a continually restarting loop. It's a Windows container image from docker hub if that's relevant. I also tried a different data center this time (eastus) to see if that made a difference.

markheath commented 5 years ago

OK, I've managed to narrow my issue down further. I get the timeout if I have >1 replica of a service with an endpoint. Scaling out services without endpoints to multiple replicas does appear to work, although the az mesh deployment create command still hangs indefinitely.

markheath commented 5 years ago

Finally worked out how to see the actual error I'm getting (az mesh app show). It is to do with having >1 replica of a service with an endpoint. Any suggestions on what I'm doing wrong here (I was trying to base my application on this tutorial)

There was an error during CodePackage activation.System.Fabric.FabricException (-2147017731)
Failed to start Container. ContainerName=sf-11-9f71ba05-53c3-40a6-90fd-e867e12a8b72_86e97206-aec9-4fb7-9e51-a1924dc3c384, ApplicationId=SingleInstance_7_App11, ApplicationName=fabric:/votingApp. 
DockerRequest returned StatusCode=InternalServerError with ResponseBody={\"message\":
\"failed to create endpoint sf-11-9f71ba05-53c3-40a6-90fd-e867e12a8b72_86e97206-aec9-4fb7-9e51-a1924dc3c384 on network nat: HNS failed with error : You were not connected because a duplicate'"
jeffj6123 commented 5 years ago

There is a bug that was causing the error "unable to edit previous deployment", there is a fix being worked on. But that error does not actually stop the deployment, you can also use this command for the same effect https://docs.microsoft.com/en-us/cli/azure/group/deployment?view=azure-cli-latest#az-group-deployment-create it wont check for IP addresses of the mesh applications though.

As for your last issue, @mattrowmsft would you know what could be going wrong with the container failing

srikanthshaps commented 5 years ago

I am facing same issue when I am trying to deploy the Linux container. When I use az mesh app show, it shows 'There was an error during CodePackage activation.Service host failed to activate. Error:E_FAIL'\r\n".