aws / copilot-cli

The AWS Copilot CLI is a tool for developers to build, release and operate production ready containerized applications on AWS App Runner or Amazon ECS on AWS Fargate.
https://aws.github.io/copilot-cli/
Apache License 2.0
3.42k stars 397 forks source link

[Bug]: `svc deploy` does not update ECS Service #5808

Open rluisr opened 1 month ago

rluisr commented 1 month ago

Description:

svc deploy with "Load Balanced Web Service" uploads a new container image to ECR but ECS Service is not updated.

related: https://github.com/aws/copilot-cli/issues/3343

Details:

copilot: 1.33.2

svc deploy

svc deploy --app app --name api --env prd --tag prd --force does not update ECS Service.

#11 exporting to image
#11 exporting layers
#11 exporting layers 1.8s done
#11 writing image sha256:done
#11 naming to done
#11 naming to  done
#11 DONE 1.9s
api:latest
api:prd
25l- Updating the infrastructure for stack app-prd-api                       [update in progress]  [29.6s]
  - An autoscaling target to scale your service's desired count               [not started]         
  - A custom resource returning the ECS service's running task count          [update complete]     [4.6s]
  - An ECS service to run and maintain your tasks in the environment cluster  [update in progress]  [18.6s]
    Deployments                                                                                     
               Revision  Rollout      Desired  Running  Failed  Pending                             
      PRIMARY  2         [completed]  5        5        0       0                                   
- Updating the infrastructure for stack app-prd-api                       [update complete]   [51.0s]
  - An autoscaling target to scale your service's desired count               [not started]       
  - A custom resource returning the ECS service's running task count          [update complete]   [4.6s]
  - An ECS service to run and maintain your tasks in the environment cluster  [update complete]   [35.7s]
    Deployments                                                                                   
               Revision  Rollout      Desired  Running  Failed  Pending                           
      PRIMARY  2         [completed]  5        5        0       0                                 
25h✔ Deployed service api.
Recommended follow-up action:
  - Your service is accessible at https://api over the internet.

deploy

but deploy --app app --name api --env prd --tag prd --force works perfectly.

#11 exporting to image
#11 exporting layers
#11 exporting layers 1.8s done
#11 writing image sha256:done
#11 naming to api:latest done
#11 naming to api:prd done
#11 DONE 1.8s
api:latest
api:prd
25l- Updating the infrastructure for stack app-prd-api                       [update complete]   [13.6s]
  - An autoscaling target to scale your service's desired count               [not started]       
  - A custom resource returning the ECS service's running task count          [update complete]   [2.1s]
  - An ECS service to run and maintain your tasks in the environment cluster  [not started]       
25h✔ Deployed service api.
Recommended follow-up action:
  - Your service is accessible at https://api over the internet.

manifest

image:
  build: Dockerfile

cloudformation

スクリーンショット 2024-05-07 22 05 19

Observed result:

svc deploy does not update ECS Service.

Expected result:

svc deploy should update ECS Service.

Debugging:

iamhopaul123 commented 1 month ago

Sorry @rluisr. Did you actually mean deploy --force doesn't update your ECS service? Because from the description svc deploy --force worked well.

rluisr commented 1 month ago

Hi @iamhopaul123.

No, just the opposite. svc deploy --force doesn't update the ECS service. but deploy --force worked well.

I couldn't see any events on ECS Service like increment task definition, deploy, etc. with svc deploy --force

iamhopaul123 commented 1 month ago

But from the screenshot you posted and the log, when you did svc deploy --force there was a CFN event for ECS service update, and update info in the progress tracker for your CLI.

rluisr commented 1 month ago

Yes, we can see from the CFn logs and CLI output that the ECS has been updated, but it has done nothing. svc deploy --force just uploading a new container image to ECR.

iamhopaul123 commented 1 month ago

I've been trying to reproduce the issue and this is what i tried.

copilot svc deploy --force

demo git:(main) ✗ copilot svc deploy --force
Found only one service, defaulting to: frontend
Only found one option, defaulting to: test
Login Succeeded
[+] Building 0.6s (7/7) FINISHED                                                                                                                                                               docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                                           0.0s
 => => transferring dockerfile: 413B                                                                                                                                                                           0.0s
 => [internal] load metadata for public.ecr.aws/nginx/nginx:latest                                                                                                                                             0.6s
 => [internal] load .dockerignore                                                                                                                                                                              0.0s
 => => transferring context: 2B                                                                                                                                                                                0.0s
 => [internal] load build context                                                                                                                                                                              0.0s
 => => transferring context: 126B                                                                                                                                                                              0.0s
 => [1/2] FROM public.ecr.aws/nginx/nginx:latest@sha256:0d6cee50bcf761ecf2d42024566522d1f299c50a578847a4c45c88477aa637d5                                                                                       0.0s
 => CACHED [2/2] COPY index.html /usr/share/nginx/html                                                                                                                                                         0.0s
 => exporting to image                                                                                                                                                                                         0.0s
 => => exporting layers                                                                                                                                                                                        0.0s
 => => writing image sha256:ef7703770b211f018e444afc5e9663bcf2b7986a9379c81bc2f14df45ebed1e0                                                                                                                   0.0s
 => => naming to 403971813171.dkr.ecr.us-west-2.amazonaws.com/demo/frontend:latest                                                                                                                             0.0s

View build details: docker-desktop://dashboard/build/desktop-linux/desktop-linux/vxrqheq1375op0pqkkz8u2rl0

What's Next?
  View a summary of image vulnerabilities and recommendations → docker scout quickview
The push refers to repository [1234567890.dkr.ecr.us-west-2.amazonaws.com/demo/frontend]
495fc3d2ec25: Layer already exists 
2a4e0f85c473: Layer already exists 
d672eb98862f: Layer already exists 
285cf1fde295: Layer already exists 
25aa6aa4ec97: Layer already exists 
7e028cfbf374: Layer already exists 
39c300f46f6c: Layer already exists 
8560597d922c: Layer already exists 
latest: digest: sha256:1e07389691e8cd8a73df3a312d442200a8b5fc67ab2c0f4d1cd8262bd0acd352 size: 1986
- No new infrastructure changes for stack demo-test-frontend
✔ Forced an update for service frontend from environment test.
✔ Deployed service frontend.
Recommended follow-up action:
  - Your service is accessible at https://v1.copilot.penghaoh.com over the internet.

Seems like it works as expected - building/uploading to ECR and then force update the service.

copilot deploy --app demo --env test --name frontend --tag prod --force

I got the exact same output as above. And I used v1.33.3.

iamhopaul123 commented 1 month ago

Also just wanted to clarify: when you do --force we are calling ECS's update service API directly under the hood so it is not expected to see any CFN infra changes. As long as you see new ECS tasks are spinning up and old tasks are draining, it should be working as expected.

ssyberg commented 2 weeks ago

I'm also experiencing this, maybe I'm not totally following. If I make changes to count and deploy without the --force flag, are you saying expected behavior is for copilot to not update the autoscaling group?

ssyberg commented 2 weeks ago

There is something funky going on here, here's the output of my latest deploy which I ran with --force, you can see that the autoscaling target is still reflecting "not started". I can confirm that the deploy was otherwise successful despite a failed healthcheck, 5 new tasks from the new revision are up and healthy.

- Updating the infrastructure for stack staging-backend                               [update complete]   [924.9s]
  - An autoscaling target to scale your service's desired count                               [not started]       
  - A custom resource returning the ECS service's running task count                          [update complete]   [1.9s]
  - An ECS service to run and maintain your tasks in the environment cluster                  [update complete]   [904.2s]
    Deployments                                                                                                   
               Revision  Rollout      Desired  Running  Failed  Pending                                           
      PRIMARY  354       [completed]  5        5        0       0                                                 
    Latest 1 stopped task                                                                                         
      TaskId    CurrentStatus  DesiredStatus                                                                      
      8902c9f1  STOPPED        STOPPED                                                                            

    ✘ Latest 1 task stopped reason                                                                                
      - [8902c9f1]: Task failed ELB health checks in (target-group arn:aws:ela                                    
        sticloadbalancing:us-east-1:<redacted>:targetgroup/st-Targe-GM7E                                    
        UBPMPEPC/cb93bb3c1316f2fb)                                                                                

    Troubleshoot task stopped reason                                                                              
      1. You can run `copilot svc logs --previous` to see the logs of the last stopped task.                      
      2. You can visit this article: https://repost.aws/knowledge-center/ecs-task-stopped.                        

    ✘ Latest failure event                                                                                        
      - (service staging-backend-Service-hLRCOqmgmw2F) (port 80) is un                                    
        healthy in (target-group arn:aws:elasticloadbalancing:us-east-1:<redacted>:targetgroup/st-Targe-GM7EUBPMPEPC/cb93bb3c1316f2fb) due to                                     
        (reason Health checks failed with these codes: [502]).                                                    
  - An ECS task definition to group your containers and run them on ECS                       [delete complete]   [0.0s]
25h✔ Deployed service backend.
Recommended follow-up action:
  - Your service is accessible at https://<redacted>/ over the internet.
ssyberg commented 1 week ago

This seems like it might be a more pervasive issue, I'm noticing changes to my environment manifest in other projects coming back with "no proposed changes", even with a force. I'm also seeing more instances of deploy completing "successfully" with line items still set to "not started". Should I create a separate ticket?

Lou1415926 commented 1 week ago

I'm noticing changes to my environment manifest in other projects coming back with "no proposed changes", even with a force.

Is this when you are running copilot env deploy? What changes did you make to your environment manifest? Have you tried copilot env deploy --diff and see what it says?

I'm also seeing more instances of deploy completing "successfully" with line items still set to "not started".

Can you go to the CloudFormation console "Events" tab and take a look at the events there (whose start time aligns with when you ran the deploy) - is there anything unexpected there (like a red status)?

Should I create a separate ticket?

It does look like a different issue than this one. If you can, a separate ticket would be great!

ssyberg commented 1 week ago

It's different on different projects, but I'll focus on the one more related to this ticket which is a svc deploy, the service is showing update complete on the 17th even though I've repeatedly done a force deployment since then:

image

When I click into the stack I see a lot of weird status' that would indicate is it not actually complete:

image
ssyberg commented 1 week ago

Another interesting data point, we noticed our changes to our auto scaling policy weren't working on a different environment, and we simply moved the count section from under a production environment override to the top level of the service manifest and it suddenly updated the scaling policy, but it's still not respecting the staging environment override. I suppose I could swap them and deploy and hope that it leaves production along this time but updates staging