apex / up

Deploy infinitely scalable serverless apps, apis, and sites in seconds to AWS.
https://up.docs.apex.sh
MIT License
8.79k stars 377 forks source link

Deployment Fails with "ResourceConflictException" in Lambda #833

Closed RickCogley closed 3 years ago

RickCogley commented 3 years ago

Prerequisites

Description

Please see: https://apex-dev.slack.com/archives/C65P0GAV8/p1631749067003000 ... and: https://aws.amazon.com/blogs/compute/coming-soon-expansion-of-aws-lambda-states-to-all-functions/

I have this issue too, but it was first reported by Ben Nichols on the Slack #up channel.

Whether via CLI up staging and up production, or, via in my case Github Actions, you get an error like:

Error: deploying: <region>: updating function code: ResourceConflictException: The operation cannot be performed at this time. An update is in progress for resource: arn:aws:lambda:<region>:<arn_id>:function:my_func

... and the deployment fails.

Steps to Reproduce

Make a visible change in one of your branches and do up staging or up production as appropriate, or git push to the branch and have your CI run it. Either way, you get an error like the above.

As Ben Nichols mentioned, you can add aws:states:opt-out as the lambda description, to bypass the problem, but it's reportedly going to stop working as of 1st Oct 2021.

This feels like something other up users are suddenly going to experience, so it's my hope that someone can figure out how to change the code to fix this problem urgently.

Slack

Join us on Slack https://chat.apex.sh/

t1bb4r commented 3 years ago

I was about to make the same post.

I even tried a fresh app from the README, it deploys the first time but after that I can't do any more deployments. I can't figure out why it would suddenly stop working,

~/Workspace/my-app$ up

     build: 5 files, 6.8 MB (678ms)
     deploy: staging (version 1) (24.576s)
     stack: complete (20.248s)
     endpoint: https://bawza8mlwc.execute-api.eu-west-1.amazonaws.com/staging/
~/Workspace/my-app$ up deploy -v staging
     4ms     DEBU up version 1.7.0-pro (os: linux, arch: amd64)
     0s      DEBU inferred runtime type=node
   ⠋ 0s      DEBU 1 regions from config
     4.329s  DEBU 1 regions from config
     0s      DEBU event deploy map[commit: stage:staging]
     0s      DEBU event platform.build map[]
     0s      DEBU hook prebuild is not defined
     0s      DEBU event hook map[hook:[] name:build]
     1ms     DEBU hook "build" command ""
     0s      DEBU event hook.complete map[duration:1.521584ms hook:[] name:build]
     0s      DEBU injecting proxy
     237ms   DEBU loading env vars
     166ms   DEBU loaded env vars duration=237
     0s      DEBU open
     0s      DEBU filtered .git – 4096
     0s      DEBU add _proxy.js: size=3609 mode=-rwxr-xr-x
     0s      DEBU add app.js: size=100 mode=-rwxrwxr-x
     259ms   DEBU add main: size=13813177 mode=-rwxrwxr-x
     1ms     DEBU add up-env.json: size=2 mode=-rwxr-xr-x
     0s      DEBU add up.json: size=86 mode=-rwxr-xr-x
     0s      DEBU stats dirs_filtered=1 files_added=5 files_filtered=0 size_uncompressed=14 MB
     14ms    DEBU close
     0s      DEBU event platform.build.zip map[duration:677.298639ms files:5 size_compressed:6827405 size_uncompressed:13816974]
     5ms     DEBU removing proxy
     0s      DEBU hook postbuild is not defined
     0s      DEBU event platform.build.complete map[duration:683.539072ms]
     0s      DEBU hook predeploy is not defined
     0s      DEBU hook deploy is not defined
     4.528s  DEBU checking for role
     0s      DEBU found existing role
     337ms   DEBU updating role policy
     0s      DEBU set role to arn:aws:iam::***:role/my-app-function
     0s      DEBU event platform.deploy map[commit: region:eu-west-1 stage:staging]
     4.504s  DEBU fetching function config region=eu-west-1
     5.574s  DEBU ensuring s3 bucket exists name=up-***-eu-west-1
     6.05s   DEBU uploading function to bucket up-***-eu-west-1 key my-app/staging/1631861611-ndAEJ3t5oTlWTDEw.zip
     288ms   DEBU updating function
     319ms   DEBU updating function code
     0s      DEBU event platform.function.update map[commit: region:eu-west-1 stage:staging]
     0s      DEBU event platform.deploy.complete map[commit: duration:16.735614612s region:eu-west-1 stage:staging version:]
 DEBU event platform.deploy.complete map[commit: duration:16.735614612s region:eu-west-1 stage:staging version:]
   ⠦ 0s      DEBU event deploy.complete map[commit: duration:22.285139868s stage:staging]
     Error: deploying: eu-west-1: updating function code: ResourceConflictException: The operation cannot be performed at this time. An update is in progress for resource: arn:aws:lambda:eu-west-1:***:function:my-app
{
  RespMetadata: {
    StatusCode: 409,
    RequestID: "cab54633-8476-459b-ba54-cad7362b8dd5"
  },
  Message_: "The operation cannot be performed at this time. An update is in progress for resource: arn:aws:lambda:eu-west-1:***:function:my-app",
  Type: "User"
}
RickCogley commented 3 years ago

It's possible that destroying the stack each time will let you deploy, but that means a several minutes where there is no website.

@t1bb4r Did you try the temporary fix of putting aws:states:opt-out in the description? That fixed it for me, but will only work thru 1st Oct.

image (7)

t1bb4r commented 3 years ago

@RickCogley That worked for me, thanks a lot!

RickCogley commented 3 years ago

Sure thing @t1bb4r. I tried this with a couple more sites and I'm getting the same error consistently, with different up setups.

RickCogley commented 3 years ago

In the article that Ben found, it mentions that lambda permissions can be added to a service role being used by CloudFormation (see "Updating CloudFormation’s service role" section on https://aws.amazon.com/blogs/compute/coming-soon-expansion-of-aws-lambda-states-to-all-functions/). I know that up uses CloudFormation, but, I am not sure if or how it's using a service role, or how to prove it either way. I did try adding Lambda:GetFunction (https://docs.aws.amazon.com/lambda/latest/dg/API_GetFunction.html) to the IAMs user you make for up to use, but it did not make a difference.

benkauffman commented 3 years ago

We are also experiencing this issue

tj commented 3 years ago

Hey guys sorry for the delay, taking a look at this. I read the announcement post but I'm a bit confused how it would influence Up, the recommended policy for running Up (https://apex.sh/docs/up/credentials/#iam_policy_for_up_cli) already has `lambda:Get*.

It sounds a bit like simply updating the SDK will work, I'll try that today and update here (and push a release if it's fine).

tj commented 3 years ago

I'm not having any luck reproducing it actually, I'm still able to deploy my apps with 1.7.0-pro and I tried doing a few fresh application stacks as well. Are you guys seeing any particular pattern or is it across all of your apps?

benkauffman commented 3 years ago

I'm having it across any existing up apps ... if i create a new stack (destroy and create an existing) it will work.

The way that i've hacked around this is running: aws lambda update-function-configuration --function-name $(node -p "require('./up.json').name") --description "aws:states:opt-out"

Before an up deploy which makes sure that the lambda description is updated to "aws:states:opt-out" for existing lambda functions

Which was defined in the article

RickCogley commented 3 years ago

Thanks for looking into it @tj. I had tried it on a few sites which were built on AWS "sub-organizations" underneath our master account. (not sure what they are really called) All of those failed with the error, and each of their IAM users does have the right permissions, it appears.

I just tried it on one on our master account, and it succeeded. So I tried another one on our master, and that failed.

FYI

RickCogley commented 3 years ago

Not sure if it makes any difference, but the apps I am deploying are just static sites, either hand coded HTML files and a few assets in a "html" folder, or, Hugo generated into its usual "public" folder.

josenriq commented 3 years ago

We started experiencing this issue today as well. This workaround allowed us to do deployments though:

you can add aws:states:opt-out as the lambda description, to bypass the problem, but it's reportedly going to stop working as of 1st Oct 2021.

tj commented 3 years ago

I was reading in the docs that it actually recommends:

If a function is stuck in the Pending state for more than six minutes, call one of the following API operations to unblock it:

So it seems like they actually anticipate being in a stuck state which is a bit odd, it’s like they’re admitting it’s broken. Do you guys use it in a VPC? Mine aren’t in a VPC, that could explain why I’m not really seeing it.

There might not be anything I can really do there, I wish any new deploy would simply override the previous, but it looks like that’s not really how they wrote the system.

RickCogley commented 3 years ago

hi @tj as for us, no, we're not using it in a VPC.

t1bb4r commented 3 years ago

I was about to make the same post.

I even tried a fresh app from the README, it deploys the first time but after that I can't do any more deployments. I can't figure out why it would suddenly stop working,

~/Workspace/my-app$ up

     build: 5 files, 6.8 MB (678ms)
     deploy: staging (version 1) (24.576s)
     stack: complete (20.248s)
     endpoint: https://bawza8mlwc.execute-api.eu-west-1.amazonaws.com/staging/
~/Workspace/my-app$ up deploy -v staging
     4ms     DEBU up version 1.7.0-pro (os: linux, arch: amd64)
     0s      DEBU inferred runtime type=node
   ⠋ 0s      DEBU 1 regions from config
     4.329s  DEBU 1 regions from config
     0s      DEBU event deploy map[commit: stage:staging]
     0s      DEBU event platform.build map[]
     0s      DEBU hook prebuild is not defined
     0s      DEBU event hook map[hook:[] name:build]
     1ms     DEBU hook "build" command ""
     0s      DEBU event hook.complete map[duration:1.521584ms hook:[] name:build]
     0s      DEBU injecting proxy
     237ms   DEBU loading env vars
     166ms   DEBU loaded env vars duration=237
     0s      DEBU open
     0s      DEBU filtered .git – 4096
     0s      DEBU add _proxy.js: size=3609 mode=-rwxr-xr-x
     0s      DEBU add app.js: size=100 mode=-rwxrwxr-x
     259ms   DEBU add main: size=13813177 mode=-rwxrwxr-x
     1ms     DEBU add up-env.json: size=2 mode=-rwxr-xr-x
     0s      DEBU add up.json: size=86 mode=-rwxr-xr-x
     0s      DEBU stats dirs_filtered=1 files_added=5 files_filtered=0 size_uncompressed=14 MB
     14ms    DEBU close
     0s      DEBU event platform.build.zip map[duration:677.298639ms files:5 size_compressed:6827405 size_uncompressed:13816974]
     5ms     DEBU removing proxy
     0s      DEBU hook postbuild is not defined
     0s      DEBU event platform.build.complete map[duration:683.539072ms]
     0s      DEBU hook predeploy is not defined
     0s      DEBU hook deploy is not defined
     4.528s  DEBU checking for role
     0s      DEBU found existing role
     337ms   DEBU updating role policy
     0s      DEBU set role to arn:aws:iam::***:role/my-app-function
     0s      DEBU event platform.deploy map[commit: region:eu-west-1 stage:staging]
     4.504s  DEBU fetching function config region=eu-west-1
     5.574s  DEBU ensuring s3 bucket exists name=up-***-eu-west-1
     6.05s   DEBU uploading function to bucket up-***-eu-west-1 key my-app/staging/1631861611-ndAEJ3t5oTlWTDEw.zip
     288ms   DEBU updating function
     319ms   DEBU updating function code
     0s      DEBU event platform.function.update map[commit: region:eu-west-1 stage:staging]
     0s      DEBU event platform.deploy.complete map[commit: duration:16.735614612s region:eu-west-1 stage:staging version:]
 DEBU event platform.deploy.complete map[commit: duration:16.735614612s region:eu-west-1 stage:staging version:]
   ⠦ 0s      DEBU event deploy.complete map[commit: duration:22.285139868s stage:staging]
     Error: deploying: eu-west-1: updating function code: ResourceConflictException: The operation cannot be performed at this time. An update is in progress for resource: arn:aws:lambda:eu-west-1:***:function:my-app
{
  RespMetadata: {
    StatusCode: 409,
    RequestID: "cab54633-8476-459b-ba54-cad7362b8dd5"
  },
  Message_: "The operation cannot be performed at this time. An update is in progress for resource: arn:aws:lambda:eu-west-1:***:function:my-app",
  Type: "User"
}

I created an app 5 days ago from the README and was experiencing this issue. It's now working. No changes to the lambda description, aws account, up version or app code and its just working.

I deployed a few times (5 days ago this was a 100% failure):

~/Workspace/my-app$ up deploy staging

     build: 5 files, 6.8 MB (861ms)
     deploy: staging (version 3) (12.105s)
     endpoint: https://bawza8mlwc.execute-api.eu-west-1.amazonaws.com/staging/

~/Workspace/my-app$ up deploy staging

     build: 5 files, 6.8 MB (958ms)
   ⠧ deploy: staging
   ⠦ deploy: staging
   ⠋ deploy: staging
   ⠼ deploy: staging
     deploy: staging (version 4) (13.61s)
     endpoint: https://bawza8mlwc.execute-api.eu-west-1.amazonaws.com/staging/

~/Workspace/my-app$ up deploy staging

     build: 5 files, 6.8 MB (861ms)
     deploy: staging (version 5) (13.856s)
     endpoint: https://bawza8mlwc.execute-api.eu-west-1.amazonaws.com/staging/                                                                                                               

~/Workspace/my-app$ up deploy staging                                                                                                                              

     build: 5 files, 6.8 MB (825ms)
     deploy: staging (version 6) (11.657s)
     endpoint: https://bawza8mlwc.execute-api.eu-west-1.amazonaws.com/staging/                                                                                                               

~/Workspace/my-app$ up deploy staging

     build: 5 files, 6.8 MB (923ms)
     deploy: staging (version 7) (14.029s)
     endpoint: https://bawza8mlwc.execute-api.eu-west-1.amazonaws.com/staging/

~/Workspace/my-app$ up deploy staging

     build: 5 files, 6.8 MB (865ms)
     deploy: staging (version 8) (10.654s)
     endpoint: https://bawza8mlwc.execute-api.eu-west-1.amazonaws.com/staging/

The only conclusion that I can make is that AWS made some changes to cause this, but then fixed it again. Is anyone still experiencing this issue right now?

RickCogley commented 3 years ago

I've got a hugo site that consistently works, and a static site in an "html" folder that consistently fails. Just re-confirmed that neither site has the add aws:states:opt-out as the lambda description workaround set. Both sites are using a github action to deploy, which was working fine before this problem reared its head, and I'm getting the same error running up locally as well.

The only real difference between the settings is that the (succeeding) hugo site has setup and build steps whereas the (failing) html site is just a literal file copy. There was a "endpoint:regional" setting in the up.json in the failing static site, which I removed (https://github.com/RickCogley/cogley.info/commit/0a6256e83d087fa23ab7388977d589aab3c7f566) but this made no difference; a re-run still failed.

In AWS console, lambda page for the failing static site:

This comment https://github.com/claudiajs/claudia/issues/226#issuecomment-921883467 mentions that they are using terraform and updated a version ...

It's a hail mary (as is the above sequence of voodoo majick testing) but @tj, as you mentioned maybe a recompile would actually help? Who knows...

RickCogley commented 3 years ago

Ok, found something else @tj: this forum post https://forums.aws.amazon.com/thread.jspa?messageID=995863&tstart=0 says you "need to put a check for the function state in between the update_function_code and the publish version calls. Make sure the state is active before proceeding https://docs.aws.amazon.com/lambda/latest/dg/functions-states.html"

And, someone else mentions: "I also noticed that the ci/cd tool is using an old version of the AWS SDK (1.11.834), and if I deploy the code using AWS CLI (2.2.37) it works. Could this be related?"

tj commented 3 years ago

@RickCogley ahhh interesting, that sounds like a reasonable fix. I guess there's always room for a race condition after doing the request for the status as well since it's not atomic, but if we can assume it's deploying in a CI or just one person at a time it should be ok.

I guess in that case we'd just have to keep polling until it's done, which sounds like it can be several minutes according to the docs. I'll try and get that in on Monday, I still couldn't reproduce that state but I'll make sure they deploy normally and hopefully that'll fix it in your cases

Jonnx commented 3 years ago

we are also seeing this issue. setting aws:states:opt-out as the function description seems to have gotten us going again but its definitely a temporary fix that will break once AWS decides to force lifecycles on everyone

RickCogley commented 3 years ago

Thanks @tj !

tj commented 3 years ago

yikes so I guess you need to poll/wait before UpdateFunctionCode, UpdateFunctionConfiguration, and PublishVersion by the looks of it haha.. good old AWS, making things slow and difficult. I'll have to add some reasonable limit for now when it comes to the wait so it doesn't hang forever, but ideally it's configurable

tj commented 3 years ago

re-opening until you guys can confirm the fix since I can't reproduce it. It'll take about 20m to get the releases built/uploaded. I guess the worst-case is some of them are actually getting stuck in that pending state

tj commented 3 years ago

Ok if you up upgrade you should get v1.7.1-pro now with 0b09440, and if you run with -v you should see a bunch of logs mentioning checking and waiting for the state to change, curious to know how long it's actually stuck in a pending state if that is what's going on

RickCogley commented 3 years ago

Confirmed I get the latest version and it works on the site that was failing. Thanks!

Edit: I mean I got the latest version automatically when deploying via GH actions. Also, running up upgrade from my $HOME upgraded showing a progress bar, then gave a message "Updated 1.7.0 Pro to 1.7.1 Pro".

RickCogley commented 3 years ago

@tj trying to run up in up.json with a -v to get more verbose logs. Is there a way to specify switches?

...
    - name: Deploy via Apex Up
      env:
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          UP_CONFIG: ${{ secrets.UP_CONFIG }}
      uses: apex/actions/up@v0.5.1
      with:
        stage: production
    - name: Check folder contents
      run: |
          ls
          echo "====== PUBLIC ======"
          ls public
RickCogley commented 3 years ago

ok, ran up -v production successfully on another was-failing site, from local. The relevant part of the log:

⠹ 3.495s  DEBU uploading function to bucket up-27999195-ap-northeast-1 key rickcogley-logr/production/1699939-ci8oFg9iVx768   ⠸ 3.598s  DEBU uploading function to bucket up-279990195-ap-northeast-1 key rickcogley-logr/production/1999-ci8999Vx768     3.68s   DEBU uploading function to bucket up-2799900195-ap-northeast-1 key rickcogley-logr/production/1639990239-ci8oFg9iVx768Z2t.zip
     0s      DEBU updating function
     43ms    DEBU checking if function is pending (attempt 1 of 30)
     156ms   DEBU function is in state "Active" / "Successful"
     0s      DEBU updating function code
     47ms    DEBU checking if function is pending (attempt 1 of 30)
     5.004s  DEBU function is in state "Active" / "InProgress", trying again in 5s
     48ms    DEBU checking if function is pending (attempt 2 of 30)
     5.846s  DEBU function is in state "Active" / "Successful"
     44ms    DEBU alias production to 91
     45ms    DEBU alias production-previous to 90
     65ms    DEBU alias commit-d752006 to 91
     29ms    DEBU alias production-previous to 90
     0s      DEBU event platform.function.update map[commit:d752006 region:ap-northeast-1 stage:production]
   ⠇ 0s      DEBU event platform.deploy.complete map[commit:d752006 duration:16.035353527s region:ap-northeast-1 stage:production ver   ⠇ 88ms    DEBU event platform.deploy.complete map[commit:d752006 duration:16.035353527s region:ap-northeast-1 stage:production ver     124ms   DEBU event platform.deploy.complete map[commit:d752006 duration:16.035353527s region:ap-northeast-1 stage:production version:91]
     0s      DEBU event platform.deploy.url map[url:https://q24o3id8m2.execute-api.ap-northeast-1.amazonaws.com/production/]
     0s      DEBU hook postdeploy is not defined
     0s      DEBU event hook map[hook:[up -v prune -s production -r 10] name:clean]
     5.05s   DEBU hook "clean" command "up -v prune -s production -r 10"
     0s      DEBU event hook.complete map[duration:5.050488581s hook:[up -v prune -s production -r 10] name:clean]
     0s      DEBU event deploy.complete map[commit:d752006 duration:30.677725571s stage:production]
   ⠏ 0s      DEBU track "Deploy" map[actions_count:0 alerts_count:0 app_name_hash:91ffc84307999a4a47a171ee29 arch:amd64 ci:false dns_zone_count:0 duration:30964 environment_count:0 has_cors:false has_error_pages:true has_logs:true has_profile:true header_rules_count:1 inject_rules_count:0 is_git:true lambda_accelerate:false lambda_memory:1024 os:darwin plan:pro proxy_timeout:15 redirect_rul     0s      DEBU track "Deploy" map[actions_count:0 alerts_count:0 app_name_hash:91ffc84307999f162a4a47a171ee29 arch:amd64 ci:false dns_zone_count:0 duration:30964 environment_count:0 has_cors:false has_error_pages:true has_logs:true has_profile:true header_rules_count:1 inject_rules_count:0 is_git:true lambda_accelerate:false lambda_memory:1024 os:darwin plan:pro proxy_timeout:15 redirect_rules_count:0 regions:[ap-northeast-1] stage:production stage_count:3 stage_domain_count:2 type:static version:1.7.1-pro]
   ⠼ 515ms   DEBU flushing analytics
     536ms   DEBU flushing analytics
   ⠴ 0s      DEBU flushing analyticsuser=5.59s system=1.82s cpu=23% total=31.621

Hth

bennichols commented 3 years ago

It's fixed for me as well. Here's my log:

Screen Shot 2021-09-28 at 2 46 09 PM
tj commented 3 years ago

awesome thanks guys! I'll close for now 😄

marcos-pricefy commented 2 years ago

Hi! Even using "aws:states:opt-out" I have got problems. Does anyone have any ideas?

RickCogley commented 2 years ago

Did you upgrade per the above?

marcos-pricefy commented 2 years ago

Yeah! This was already updated since when aws recommended it, it was working, but yesterday it didn't!

jandson-oliveira commented 2 years ago

Hi guys, same thing here, all in the last version and the same error, and even putting the flags in the optional fields.