"ECS Deployment Circuit Breaker was triggered'." seemingly not possible to deploy again

aws / copilot-cli

The AWS Copilot CLI is a tool for developers to build, release and operate production ready containerized applications on AWS App Runner or Amazon ECS on AWS Fargate.

https://aws.github.io/copilot-cli/

Apache License 2.0

3.48k stars 401 forks source link

"ECS Deployment Circuit Breaker was triggered'." seemingly not possible to deploy again #5296

Open hderms opened 1 year ago

hderms commented 1 year ago

deploying like this:

copilot svc deploy --name upc-ner -e test

environments/app were created with the defaults

name: test
type: Environment

# Import your own VPC and subnets or configure how they should be created.
# network:
#   vpc:
#     id:

# Configure the load balancers in your environment, once created.
# http:
#   public:
#   private:

# Configure observability for your environment resources.
observability:
  container_insights: false

svc I am deploying:

deployment:
  rolling: 'default'

# storage:
  # readonly_fs: true       # Limit to read-only access to mounted root filesystems.

# Optional fields for more advanced use-cases.
#
variables:                    # Pass environment variables as key value pairs.
  LOG_LEVEL: info
  PG_USER: postgres
  PG_PASS: redacted
  PG_PORT: 5432

#secrets:                      # Pass secrets from AWS Systems Manager (SSM) Parameter Store.
#  GITHUB_TOKEN: GITHUB_TOKEN  # The key is the name of the environment variable, the value is the name of the SSM parameter.

# You can override any of the values defined above by environment.
environments:
  production:
    count: 1
    variables:
      PG_HOST: redacted
  test:
    count: 1
    deployment:            # The deployment strategy for the "test" environment.
      rolling: 'default' # Stops existing tasks before new ones are started for faster deployments.
    variables:
      PG_HOST: redacted

Deploying a broken image that won't start causes the ECS deployment circuit breaker to trigger, and makes deploying again with a fixed image seemingly impossible.

Using the --force cli flag doesn't cause it to force the deploy, though I wasn't under the impression that would actually "force" it in the way I'd want, after reading some github issues.

"Cancel update stack" on the resulting cloudformation "stack" doesn't seem to have any effect, at least in the short term (15 minutes or so).

I don't want to turn the circuit breaker off, because I'd like to make it as difficult as possible to deploy broken code that won't stabilize. Similarly, I saw a workaround to turn the replicas down to 0 before deploying again, which is supposed to allow you to bypass the circuit breaker. This didn't seem to work to allow me to deploy, and I'd like to avoid it if at all possible because I don't want a failed deploy to cause an outage.

I'm trying to figure out if I'm doing something wrong because based on my experiences, I'd feel like there would be a lot more github issues/discussion around this behavior. It's very difficult to use copilot if deploying broken code causes deployments to fail for hours.

I know this is probably an issue with ECS, but I've read through everything I can find on Google and I simply can't figure out how to circumvent this issue, and the number of resources that even mention this ECS deployment circuit breaker are so scarce that I'm left scratching my head.

Related: Gitter

Lou1415926 commented 1 year ago

Deploying a broken image that won't start causes the ECS deployment circuit breaker to trigger, and makes deploying again with a fixed image seemingly impossible.

This should be possible after the circuit breaker is done rolling back your service - typically this happens quite a while after the deployment, perhaps 10-ish to even 15-ish minutes. That is, if you have deployed a false image, typically you would have the following experience:

Wait for circuit breaker to complete its work (10-15 minutes, maybe more), until the rollback is completed. OR, better, press ctrl+c to trigger the rollback early instead of waiting for circuit breaker.
Deploy again with the fixed image.

To help me understand better the experience that you had, did you mean that: a) You weren't able to do "2.deploy again" anymore. Your service is now stuck in limbo, completely un-updatable (maybe even after maybe hours after the deployment). b) You wanted to be able to do "2. deploy again" earlier, preferably during the 10-15 minutes window when you were assumed to be waiting for the circuit breaker.

hderms commented 1 year ago

@Lou1415926 yeah, you are correct with statement "a". Once I trigger the deployment circuit breaker it will not permit further deployments of anything, including totally new versions of the code, for what seems like 3 hours. What's weird is that the cloud formation stack in question is in 'UPDATE_COMPLETE' state and there's no evidence it's doing anything at all at the moment:

but if I try to deploy to the same service (but a different image) it will continuously fail with the 'ECS Deployment Circuit Breaker triggered' error message immediately after it pushes the new image to ECR:

at 03:33:54 PM ➜ copilot svc deploy --name upc-ner -e test --tag upc_ner_d64534c
Login Succeeded
[+] Building 6.2s (13/13) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                                                             0.0s
 => => transferring dockerfile: 32B                                                                                                                                                                              0.0s
 => [internal] load .dockerignore                                                                                                                                                                                0.0s
 => => transferring context: 2B                                                                                                                                                                                  0.0s
 => [internal] load metadata for docker.io/library/python:3.11                                                                                                                                                   5.9s
 => [base 1/8] FROM docker.io/library/python:3.11@sha256:8488a4b1a393b0b2cb479a2da0a0d11cf816a77c0f9278205015148adadf9edf                                                                                        0.0s
 => => resolve docker.io/library/python:3.11@sha256:8488a4b1a393b0b2cb479a2da0a0d11cf816a77c0f9278205015148adadf9edf                                                                                             0.0s
 => [internal] load build context                                                                                                                                                                                0.2s
 => => transferring context: 995B                                                                                                                                                                                0.2s
 => CACHED [base 2/8] RUN pip install -U pgcli                                                                                                                                                                   0.0s
 => CACHED [base 3/8] WORKDIR /app                                                                                                                                                                               0.0s
 => CACHED [base 4/8] COPY requirements.txt .                                                                                                                                                                    0.0s
 => CACHED [base 5/8] RUN pip install --no-cache-dir -r requirements.txt                                                                                                                                         0.0s
 => CACHED [base 6/8] COPY api/*.py api/                                                                                                                                                                         0.0s
 => CACHED [base 7/8] RUN useradd user                                                                                                                                                                           0.0s
 => CACHED [base 8/8] COPY output/model-best output/model                                                                                                                                                        0.0s
 => exporting to image                                                                                                                                                                                           0.0s
 => => exporting layers                                                                                                                                                                                          0.0s
 => => writing image sha256:5a2c39fc439f55aed2da8c0804005a4d85dfbd38611802adfa1c600b9c9fe277                                                                                                                     0.0s
 => => naming to 042357577846.dkr.ecr.us-east-1.amazonaws.com/upc-ner/upc-ner:latest                                                                                                                             0.0s
 => => naming to 042357577846.dkr.ecr.us-east-1.amazonaws.com/upc-ner/upc-ner:upc_ner_d64534c                                                                                                                    0.0s
The push refers to repository [042357577846.dkr.ecr.us-east-1.amazonaws.com/upc-ner/upc-ner]
300de1089802: Layer already exists
b9c1d4e7e563: Layer already exists
12ffc9c8506c: Layer already exists
a7d8f628332c: Layer already exists
27a106bf01b8: Layer already exists
51bd17e29374: Layer already exists
e423ba6a7623: Layer already exists
49df279faf6c: Layer already exists
f2c0489561b5: Layer already exists
4831c7caec2d: Layer already exists
d2b487de5a01: Layer already exists
8fe5334a79c9: Layer already exists
acd413ce78f8: Layer already exists
1a26fac01f32: Layer already exists
b8544860ba0b: Layer already exists
latest: digest: sha256:024c52d50f6ce6fa12a96cc355d55f4b3fcee840df7b533008fbd7874cbd89ca size: 3472
The push refers to repository [042357577846.dkr.ecr.us-east-1.amazonaws.com/upc-ner/upc-ner]
300de1089802: Layer already exists
b9c1d4e7e563: Layer already exists
12ffc9c8506c: Layer already exists
a7d8f628332c: Layer already exists
27a106bf01b8: Layer already exists
51bd17e29374: Layer already exists
e423ba6a7623: Layer already exists
49df279faf6c: Layer already exists
f2c0489561b5: Layer already exists
4831c7caec2d: Layer already exists
d2b487de5a01: Layer already exists
8fe5334a79c9: Layer already exists
acd413ce78f8: Layer already exists
1a26fac01f32: Layer already exists
b8544860ba0b: Layer already exists
upc_ner_d64534c: digest: sha256:024c52d50f6ce6fa12a96cc355d55f4b3fcee840df7b533008fbd7874cbd89ca size: 3472
- No new infrastructure changes for stack upc-ner-test-upc-ner
Note: Set --force to force an update for the service.
deploy service: change set with name copilot-c99069d9-1c9d-45e7-9c9f-1e480afbbbb1 for stack upc-ner-test-upc-ner has no changes: Resource handler returned message: "Error occurred during operation 'ECS Deployment Circuit Breaker was triggered'." (RequestToken: a3ab2120-1821-f83c-3adc-23d519cacdfe, HandlerErrorCode: GeneralServiceException)

hderms commented 1 year ago

I should add that I'm not sure precisely how long it takes for me to be able to deploy again, but it's definitely longer than an hour as far as I can tell. I can try getting a more precise measurement

Lou1415926 commented 1 year ago

Ah, the stack events screenshot and the error message logs that you provided were really helpful!!

One final thing to confirm before I suggest my theory - I saw copilot svc deploy --name upc-ner -e test --tag upc_ner_d64534c. I assumed in the last deployment, you ran the same command with --tag upc_ner_d64534c as wel? That is, the tag value was not changed?

hderms commented 1 year ago

@Lou1415926 I had added the tag flag because I wanted to see if it would change the behavior, like if the circuit breaker was specific to a given tag or something and had forgotten to stop adding that flag as part of the continued deployment efforts.

also I changed the tag a bunch of times while the circuit breaker was triggered and it had no effect on allowing me to deploy

Lou1415926 commented 1 year ago

I see! So here is my theory of what happened:

You deployed with copilot svc deploy --name upc-ner -e test --tag upc_ner_d64534c one time, successfully. You mentioned in this gitter thread that you had one success (the response with "even though my most recent deploy was a success"), I suspect this was ran with --tag upc_ner_d64534c.
You ran copilot svc deploy --name upc-ner -e test --tag upc_ner_d64534c again. Now, because you provided the same --tag value, given that there is no other change in your manifest, Copilot will produce a template that is exactly the same as the last time you ran svc deploy, which was in step 1. (Why was the template exactly the same? See "Why did you get "change set with name contains no changes" below)
Copilot submitted this template to CloudFormation. CloudFormation returned an error that said "change set contains no changes", because literally the template didn't change.
Copilot got the error message. It then read your CloudFormation stack for the latest failure stack event, which happened to be the one with "ECS Deployment Circuit Breaker triggered".
Copilot wrongly appended this failure stack event, to the "change set with name contains no changes" error.

This caused the illusion that it was "ECS Deployment Circuit Breaker triggered" that stopped you from deploying, while really, what prevented you from deployment was "change set contains no changes".

Why did you get "change set with name contains no changes"?

Copilot builds and pushes your image to ECR, and pass the image URI into the CloudFormation template. There are several ways to refer to an image:

By "repo:tag"
By "repo@sha"

When you simply run copilot svc deploy, without --tag, Copilot will use the "repo@sha" as the URI - the sha changes when your application code changes. Therefore, the generated CloudFormation template would at least contain the image URI change.

When you run copilot svc deploy --tag, Copilot will use "repo:tag" as the URI. If your --tag value doesn't change, the URI doesn't change. When there is no other configuration change in your manifest, Copilot produces the same template as before. This was why you got the "change set with name contains no changes" error.

What you can do

Try running copilot svc deploy --name upc-ner -e test, without --tag. If using --tag is important for you, adding --force should help. I know you mentioned that

Using the --force cli flag doesn't cause it to force the deploy, though I wasn't under the impression that would actually "force" it in the way I'd want, after reading some github issues.

We can get to that in detail, if you are interested.

What Copilot should improve on

Copilot should stop appending the latest failure stack event to the "change set with name contains no changes". The initial motivation was to surface the actual error that was causing the stack failure. But in this case, this just added to the confusion.

hderms commented 1 year ago

@Lou1415926 I had changed the tag a bunch of times though while still experiencing the same error, so I feel like there's something still missing from this explanation. I can try reproducing this again and see if changing the tag arbitrarily will make it work

Lou1415926 commented 1 year ago

@hderms I tried to deploy once with --tag 1, and then again with --tag 2, the "change set contains no change" error did not occur. It is entirely possible that I was missing something from the picture though. If the issue persists definitely let me know 😊

hderms commented 1 year ago

@Lou1415926 here's what I did:

make a change that prevents code from stabilizing (returns 500 in health check)
deploy using copilot svc deploy --name upc-ner -e test --tag foobar1 and let it go to the full 10 failed stabilizations so the circuit breaker triggers
commit a fix (returns 200s again in health check)
attempt to deploy with copilot svc deploy --name upc-ner -e test --tag foobar1 (failed)
attempt to deploy with copilot svc deploy --name upc-ner -e test --tag foobar2 (failed)
attempt to deploy with copilot svc deploy --name upc-ner -e test (failed)

here are the error messages I got: (the first deploy which had the 10 failed stabilizations doesn't have an error message logged on my laptop because I got a connection reset by peer error)

step 4:

✘ Proposing infrastructure changes for stack upc-ner-test-upc-ner
✘ deploy service upc-ner to environment test: deploy service: stack upc-ner-test-upc-ner is currently being updated and cannot be deployed to: Resource handler returned message: "Error occurred during operation 'ECS Deployment Circuit Breaker was triggered'." (RequestToken: a3ab2120-1821-f83c-3adc-23d519cacdfe, HandlerErrorCode: GeneralServiceException)

step 5:

✘ Proposing infrastructure changes for stack upc-ner-test-upc-ner
✘ deploy service upc-ner to environment test: deploy service: stack upc-ner-test-upc-ner is currently being updated and cannot be deployed to: Resource handler returned message: "Error occurred during operation 'ECS Deployment Circuit Breaker was triggered'." (RequestToken: a3ab2120-1821-f83c-3adc-23d519cacdfe, HandlerErrorCode: GeneralServiceException)

step 6:

✘ Proposing infrastructure changes for stack upc-ner-test-upc-ner
✘ deploy service upc-ner to environment test: deploy service: stack upc-ner-test-upc-ner is currently being updated and cannot be deployed to: Resource handler returned message: "Error occurred during operation 'ECS Deployment Circuit Breaker was triggered'." (RequestToken: a3ab2120-1821-f83c-3adc-23d519cacdfe, HandlerErrorCode: GeneralServiceException)

I then made another commit and tried to deploy it with no --tag flag and it still failed

hderms commented 1 year ago

also copilot svc deploy --name upc-ner -e test --force still failed with the circuit breaker message.

@Lou1415926 do you know if the ecs/cloud formation circuit breaker is shared by all deployments of a specific service? Based on my experience, once you hit the circuit breaker you can't deploy that same service for hours, regardless of whether the tag changes, etc...

hderms commented 1 year ago

also minor correction, it definitely doesn't seem to be hours this time. it was a more reasonable amount of time like 30 minutes

Lou1415926 commented 1 year ago

Ah got it. Given this GH issue and the Gitter thread, I think I've observed two different (perhaps independent) issues that you've encountered.

You weren't able to deploy the fixed image, for 30 min ~ 3 hrs, because of the "stack upc-ner-test-upc-ner is currently being updated and cannot be deployed to" error. The first issue was largely discussed in the Gitter thread, and you reproduced it in the ⬆️ recent response.
After one successful deployment, you weren't able to deploy the service again because of "change set with name contains no changes" error. This issue was reported in this earlier response of this issue, and was also mentioned in one of your responses in the Gitter thread.

I think my explanation & solution covers the second issue - why you couldn't deploy again after one successful deployment, and what you could do to get out of it. I think these were the discussions in the Gitter thread related to this second issue:

Even though my most recent deploy was a success... Then when I finally got it working, I deployed again, it went all the way through because the service could actually stabilize this time. And then subsequent deploys failed w/ circuit breaker triggered. So I'm guessing it must be something to do with the failed deploys before that. Though the circuit breaker evidently had not been triggered because I was able to deploy the working container. so one of them kept running or something in the background until it triggered the circuit breaker?

I will now discuss the first issue.

In the Gitter thread, the first error persisted for 3 hours. In the GitHub response above, it persisted for 30 minutes. It seems like in both cases, your stack eventually landed in UPDATE_ROLLBACK_COMPLETE state. Is that true? You can confirm this by going through the Events tab on the AWS CloudFormation console.

I am especially concerned with a state called UPDATE_ROLLBACK_FAILED. I assume you don't have that state, because if you do, you wouldn't have been able to make any deployment from that point (not even after 3 hrs) without extra effort.

I am asking about the state, because landing in UPDATE_ROLLBACK_FAILED or UPDATE_ROLLBACK_COMPLETED could lead to completely different theories and solutions.

If both deployment eventually landed you in UPDATE_ROLLBACK_COMPLETE, I guess it was just because CB happened to be taking 3 hours to rollback. Then the question for us to investigate would be "why was the ECS rollback taking so much time?".

is it expected behavior that once the deployment circuit breaker triggers that you can't redeploy anything to that service for 3 hours?

Like I said above, I am not sure about 3 hours, but 30 minutes is certainly expected as the amount of time for ECS and CloudFormation to complete the rollback. I did some experiment myself too, and for me, the whole stack update (from UPDATE_IN_PROGRESS to UPDATE_ROLLBACK_COMPLETE/FAILED) took 1 hours in total: ~30 minutes for ECS to decide that the deployment is a failure and trigger the circuit breaker, 30 minutes for CloudFormation to rollback the stack.

Please let me know the answer to the question ⬆️ with the bold font, and in the meanwhile, I'll need to make more attempts to see if I can get the CB to hang for 3 hrs to understand what was going on during this time.

hderms commented 1 year ago

@Lou1415926 thank you for looking into it more.

We've never gotten into UPDATE_ROLLBACK_FAILED as far as I can see. it always gets to UPDATE_ROLLBACK_COMPLETE once rollback is initiated. It seems like once UPDATE_ROLLBACK_IN_PROGRESS triggers it usually took about 5-10 minutes to get to UPDATE_ROLLBACK_COMPLETE

My theory on why it has taken considerably longer at times, which may be a completely unfounded theory, is that once one of the rollbacks finished, perhaps that allowed a previously started deploy to actually run to completion. If that deploy also failed to stabilize, it would end up blocking deploys for another hour or so. That might explain why it seemed like I couldn't deploy for longer than expected, because it was due to multiple events occurring rather than just a single one.

If stabilizing the cloud formation rollback takes about an hour then I guess I'm just experiencing a known issue due to cloud formation being slower than desired. If this is expected behavior then I guess I can close the issue

Lou1415926 commented 1 year ago

@hderms Sorry for the belated response!

If stabilizing the cloud formation rollback takes about an hour then I guess I'm just experiencing a known issue due to cloud formation being slower than desired. If this is expected behavior then I guess I can close the issue

It would usually take a lot less than an hour, probably a couple of to ten-ish minutes, if it doesn't involve any ECS rollback! But once ECS service needs to roll back, then the time spent would be CloudFormation + ECS trying to spin up tasks, making health check calls, and eventually deciding whether the tasks are stable or not, which could take longer than usual.

One way to speed up the ECS part of the process is to use deployment.rolling: 'recreate' (also see the blog post). You can use this configuration for non-production environment to speed up the feedback loop.