CodeDeploy error resulting in inconsistent state

jarreds commented 6 years ago

Observations

We were receiving notifications that one of our SAM stacks was failing CloudFormation updates. Looking back through the event logs, we noticed that this was our first error that seems to have broken the stack:

16:58:19 UTC-0800 UPDATE_FAILED AWS::Lambda::Alias  GetAllContactAliaslive  A conflicting deployment is in progress

After this error occurred, subsequent updates of this stack began failing within CodeDeploy with the following error. Note: it takes about 30 minutes for this stack to rollback on failure, so it seems like there is a timeout thing going on too.

Error:
Instance ID is Missing.

Deployment Failed:
The deployment failed because the AppSpec file that specifies the AWS Lambda deployment configuration is missing or has an invalid configuration. The Lambda function alias version does not match the current version in AppSpec file. (Error code: INVALID_LAMBDA_CONFIGURATION)

We were stumped. Looking at the sam.yaml configuration we noticed we had one potential configuration that could cause the CodeDeploy conflicting deployment for the GetAllContactAliaslive function:

  GetOneContact:
    Type: AWS::Serverless::Function
    Properties:
      Handler: consumer.GetContact::handleRequest
      Runtime: java8
      CodeUri: ../consumer/getContact/deploy.jar
      MemorySize: 512
      Policies: AWSLambdaBasicExecutionRole
      Timeout: 20
      AutoPublishAlias: live
      DeploymentPreference:
        Type: AllAtOnce
      Events:
        GetApi:
          Type: Api
          Properties:
            Path: /v3.0/consumer/contact/{contact_id}
            Method: get
            RestApiId: !Sub ${API}

  GetAllContact:
    Type: AWS::Serverless::Function
    Properties:
      Handler: consumer.GetContact::handleRequest
      Runtime: java8
      CodeUri: ../consumer/getContact/lambda_deploy.jar
      MemorySize: 512
      Policies: AWSLambdaBasicExecutionRole
      Timeout: 20
      AutoPublishAlias: live
      DeploymentPreference:
        Type: AllAtOnce
      Events:
        GetApi:
          Type: Api
          Properties:
            Path: /v3.0/consumer/contact
            Method: get
            RestApiId: !Sub ${API}

These are two separate lambda functions that are identical with the exception of the events that trigger them. It seems possible that this is what's tripping CodeDeploy up.

We're going to change to the following with one lambda definition and multiple events to see if the conflict goes away:

  GetContact:
    Type: AWS::Serverless::Function
    Properties:
      Handler: consumer.GetContact::handleRequest
      Runtime: java8
      CodeUri: ../consumer/getContact/lambda_deploy.jar
      MemorySize: 512
      Policies: AWSLambdaBasicExecutionRole
      Timeout: 20
      AutoPublishAlias: live
      DeploymentPreference:
        Type: AllAtOnce
      Events:
        GetAllApi:
          Type: Api
          Properties:
            Path: /v3.0/consumer/contact
            Method: get
            RestApiId: !Sub ${API}
        GetOneApi:
          Type: Api
          Properties:
            Path: /v3.0/consumer/contact/{contact_id}
            Method: get
            RestApiId: !Sub ${API}

Questions

Is it possible that the A conflicting deployment is in progress error from CodeDeploy is triggering because the former configuration? If so, this seems like a configuration that should succeed.
The 30 minute CloudFormation rollback seems like a bug.
It appears that the inconsistent state the stack is left in after the rollback is a bug as well.

sanathkr commented 6 years ago

The two functions are entirely different from CodeDeploy's perspective because they will have different Arns. So this doesn't matter.

Can you give me a bit more details to dive deep:

AWS Region this happened
When this happened?
Did this start happening after you added deployment preference?
Did you happened to create Lambda versions outside of the stack using API/CLI/Console?

jarreds commented 6 years ago

AWS Region this happened

us-west-2

When this happened?

The conflict error occurred for the first time on 16:58:19 UTC-0800 Feb-07-18.

Did this start happening after you added deployment preference?

No, this stack was originally created with this DeploymentPreference.

Did you happened to create Lambda versions outside of the stack using API/CLI/Console?

No, we did not. All updates are done via CloudFormation.

sanathkr commented 6 years ago

@jarreds Let's take this offline. Can you shoot me an email sanathkr [at] amazon.com?

jarreds commented 6 years ago

On it.

vinkris commented 6 years ago

@sanathkr @jarreds

I was seeing this same issue, more than twice last week. Getting following errors when trying to deploy SAM stack using CF:

...
20:06:41 UTC-0700 | UPDATE_FAILED | AWS::Lambda::Alias | UserLogoutAliaslive | Already chained. Call unchain() to get rid of the previous chaining.
-- | -- | -- | -- | --
  | 20:03:48 UTC-0700 | UPDATE_FAILED | AWS::Lambda::Alias | UserReturnAliaslive | Already chained. Call unchain() to get rid of the previous chaining.
  | 20:03:29 UTC-0700 | UPDATE_FAILED | AWS::Lambda::Alias | DisconnectAliaslive | Already chained. Call unchain() to get rid of the previous chaining.
  | 19:38:09 UTC-0700 | UPDATE_FAILED | AWS::Lambda::Alias | KinesisScalingAliaslive | Resource update cancelled
19:38:06 UTC-0700 | UPDATE_FAILED | AWS::Lambda::Alias | AdminReturnAliaslive | A conflicting deployment is in progress
...

This doesn't happen all the time. However, once it reaches this state, new updates would not succeed anymore. Had to delete and recreate the stack.

We have about 13 lambdas in the SAM template, with a global setting like this:


Globals:
  Function:
    Runtime: nodejs6.10
    MemorySize: 256
    Timeout: 10
    AutoPublishAlias: live
    DeploymentPreference:
      Type: AllAtOnce

AdminReturn:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: !Sub ${AWS::StackName}-${EnvName}-adminreturn
      CodeUri:
        Bucket: !Ref BootstrapS3Bucket
        Key: !Ref AzureADFunctionCodeS3Key
      Handler: handler.adminReturn
      Policies:
        - AWSLambdaVPCAccessExecutionRole
        - DynamoDBCrudPolicy:
            TableName: !Ref TenantsTable
        - KMSDecryptPolicy:
            KeyId: !Ref AzureADKMSKeyId
      VpcConfig:
        SubnetIds: 
          !Split
            - ','
            - 'Fn::If': [ GentooConsumer, !Sub "${AzureADFunctionSubnetsIds}", 'Fn::ImportValue': !Sub "${BaseVPCStackName}-${EnvName}-SubnetsPrivateAppIds"]
        SecurityGroupIds:
          !Split
            - ','
            - 'Fn::If': [ GentooConsumer, !Sub "${AzureADFunctionSecurityGroupIds}", 'Fn::ImportValue': !Sub "${BaseVPCStackName}-${EnvName}-AzureAdSG"]
      Environment:
        Variables:
          TENANTSTATE_TABLE: !Ref TenantsTable
          DEP_ENVIRONMENT: !Ref EnvName
          ESS_URL: !Ref AzureADESSUrl
          CLIENT_ID: !Ref AzureADMSClientId
          CLIENT_SECRET: !Ref AzureADMSClientSecret
      Events:
        Consent:
          Type: Api
          Properties:
            Path: /adminreturn
            Method: get
            RestApiId:
              !Ref AzureADSyncAPI

Did we find the root cause for this? Any help here would be appreciated. Thanks!

terma commented 6 years ago

Have same problem, could you please post fix or solution if exist? Thx.

harishyarlagaddas commented 5 years ago

We also encountered the same issue. Can someone post the solution if any?

lanefelker commented 5 years ago

We also encountered this issue. Are there any recommendations on how to resolve this?

revolutionisme commented 5 years ago

Did the above offline discussion bear any result? Because we encountered this and are in a way stuck!

revolutionisme commented 5 years ago

One crude hack which may or may not work depending on your setup (which we found out after wasting a complete day) is to temporarily update your alias in the "AutoPublishAlias" config to something else and let cloudformation fix your pipeline and then revert back to the original alias.

badfun commented 4 years ago

I have the same issue.

UPDATE_FAILED AWS::Lambda::Alias

I have tried many things related to AWS::Lambda::Version and AWS::Lambda::Alias but have been unable to make it work except by changing the name of the alias, same as mentioned by @revolutionisme .

I have two stacks setup in similar ways: one I forked from aws (https://github.com/aws-samples/aws-serverless-app-sam-cdk), and one I am adapting for my own use. The forked version does not have this problem. I assumed it must be something in my code but I have tried strippng it right down to the basics and the problem remains.

Interestingly, I am not changing the actual lambda code at all, just the template.yaml file. I notice that both projects change the lambda version even though the code has not changed, so that the original forked project lambda is now at version 8 or so (after several experiments), and my own project's lambda is now at version 50! This despite tearing down and completely deleting the stack multiple times. This seems to go against the idea behind using Lambda versioning and aliases, or at least my understanding of it.

It seems to me there is something wrong with AutoPublishAlias, but to remove it means having to remove the DeploymentPreference option as well, which is a major setback.

badfun commented 4 years ago

I have solved the alias issue for my case. The problem was that the PreTraffic hook function was failing on subsequent builds. I'm still not sure why, but I think I will start using LocalStack to mock up the behaviour before pushing it. To solve it, I put it the generic code from the examples and got the whole pipeline to run.

As for the versioning issue, it seems that using AutoPublishAlias creates a new Lambda version on deploy, regardless if there is any change to the lambda code. I have a hard time with that one, since before trying these automated Canary deployments my lambda versions would match actual code changes. Now I am onto version 67 of a function that hasn't had a single change. Some folks have reached deployment package limits because of all the unused versions. Confusing.

More info here: https://hackernoon.com/mind-the-75gb-limit-on-aws-lambda-deployment-packages-163b93c8eb72

https://serverless.com/framework/docs/providers/aws/guide/functions#versioning-deployed-functions

https://stackoverflow.com/questions/54140748/aws-serverless-application-model-sam-is-there-a-versionfunctions-equivalent

jfuss commented 2 years ago

It's not clear to me the action needed here but this hasn't had any responses in a over a year.

Going to close this out. If there is something that wasn't answered, please open a new issue (we typically do not track closed issues).

aws / serverless-application-model

CodeDeploy error resulting in inconsistent state #291

Observations

Questions