aws-actions / amazon-ecs-deploy-task-definition

Registers an Amazon ECS task definition and deploys it to an ECS service.
MIT License
639 stars 238 forks source link

Deploy taking incredibly long suddenly. #102

Open digitlninja opened 4 years ago

digitlninja commented 4 years ago

How can I troubleshoot why the deploy is taking incredibly long all of a sudden?

my task definition


{
  "ipcMode": null,
  "executionRoleArn": "arn:aws:iam::185944984862:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "dnsSearchDomains": null,
      "environmentFiles": null,
      "logConfiguration": {
        "logDriver": "awslogs",
        "secretOptions": null,
        "options": {
          "awslogs-group": "/ecs/identity-backend",
          "awslogs-region": "eu-west-2",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "entryPoint": null,
      "portMappings": [
        {
          "hostPort": 3001,
          "protocol": "tcp",
          "containerPort": 3001
        }
      ],
      "command": null,
      "linuxParameters": null,
      "cpu": 1024,
      "environment": [],
      "resourceRequirements": null,
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "secrets": [
        {
          "valueFrom": "xxx",
          "name": "IoTBackend-Staging"
        }
      ],
      "dockerSecurityOptions": null,
      "memory": null,
      "memoryReservation": null,
      "volumesFrom": [],
      "stopTimeout": null,
      "image": "185944984862.dkr.ecr.eu-west-2.amazonaws.com/identity:4f177c4240adda5b3bf8f5f83f7b766e490e2775",
      "startTimeout": null,
      "firelensConfiguration": null,
      "dependsOn": null,
      "disableNetworking": null,
      "interactive": null,
      "healthCheck": null,
      "essential": true,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "pseudoTerminal": null,
      "user": null,
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "systemControls": null,
      "privileged": null,
      "name": "identity-backend"
    }
  ],
  "placementConstraints": [],
  "memory": "2048",
  "taskRoleArn": "arn:aws:iam::185944984862:role/ecsTaskExecutionRole",
  "compatibilities": [
    "EC2",
    "FARGATE"
  ],
  "taskDefinitionArn": "arn:aws:ecs:eu-west-2:185944984862:task-definition/identity-backend:3",
  "family": "identity-backend",
  "requiresAttributes": [
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.execution-role-awslogs"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.ecr-auth"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.secrets.asm.environment-variables"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.task-iam-role"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.execution-role-ecr-pull"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.task-eni"
    }
  ],
  "pidMode": null,
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "networkMode": "awsvpc",
  "cpu": "1024",
  "revision": 3,
  "status": "ACTIVE",
  "inferenceAccelerators": null,
  "proxyConfiguration": null,
  "volumes": []
}
allisaurus commented 4 years ago

@digitlninja as a first step, you can check the ECS service events in the AWS ECS web console to see whether your tasks are flapping (coming up/down) or some specific task (like becoming healthy in the load balancer) is taking a while to stabilize. Or, if you have CloudWatch logs enabled, you can check the task logs in the ECS or CloudWatch logs web console to see if specific containers are hanging on any particular start up step. Also, if your container image recently increased in size or changed location (like a diff region) it might be taking longer to pull?

Sorry these are rather generic advice, but I can't think of a specific reason this action would cause deployments to take longer. If it just recently started happening w/o any changes on your end we could continue to investigate. LMK if any of the above it helpful.

allisaurus commented 4 years ago

Closing due to lack of response, but feel free to reopen if you find reason to believe this action is effecting your deployment times.

JefferyHus commented 3 years ago

I can confirm this. The ECS task is running & the new updates have been applied, but the GitHub action is still loading. It takes up to 10 minutes to finish and sometimes more.

SunnGHubX commented 3 years ago

Same with me while deploying Task definitions via ecs using github actions, it hangs to passing all the way to 25 mins. I had to stop it. I had investigated this issue and verify that from the Events tab, old ones are still not shutting down when new are trying to deploy.. meaning this is a safety secure feature as green-->blue deployment just incase new deployment is not good. But for me my deployment didn't error out just hung up . Is there a fixed for this?

allisaurus commented 3 years ago

@JefferyHus can you tell me more about that status of your service when the GH action is hanging? Has a new revision successfully been deployed and stabilized, or has a rollback occurred? Any output you have from the service's events (tab visible in ECS console or via ecs:describe-services) or errors from the GH action event itself would be helpful

@SunnGHubX what you describe sounds similar to https://github.com/aws-actions/amazon-ecs-deploy-task-definition/issues/113#issuecomment-717465045 . Can you let me know if adjusting the deployment preferences fixes your issue?

JefferyHus commented 3 years ago

@allisaurus No errors whatsoever cominng from both the ECS logs or GH actions. GH action just hangs there loading for minutes before moving to the next & final step, the build howevere is successful. The only thing that I could think of is that the GH plugin is waiting a status & ECS doesn't return a status till fargate switches the old container with the new one.

mcsrk commented 3 years ago

Im facing the same situation using GH actions. In my case, it happened like this:

I used ECS to deploy a django image from ECR. Simple Fargate cluster and simple task definition and NOT using load balancer. When I was messign around to test my deployment I triggered several times my actions an the average time was 4 min. (90% of the time it was consumed by the "Deploy Amazon ECS task definition" step (shown in mage). imagen

Due to my requirements, my implementation must have a static ip, so I did my research and restructured the whole thing, so I added to the ECS Service a Network Load Balancer that uses an Elastic IP. So I was doing my multiples commits to test the result and this time the "Deploy Amazon ECS task definition" was taking 12 minutes each time I ran the actions. I don´t fully understand why it might take more time by adding a Load Balancer.

imagen

PS: Apart from the LB, the thign i made different the second time was creating a ecs service from the "task def." tab in the ECS console, instead of go into a cluster and click "Create Service" or "Run task" which was the way I did on my 1st try.

damusix commented 2 years ago

I get timed out at 30 minutes. Everything deploys, but the GH Actions runs until failure.

baranberkay96 commented 2 years ago

Is there any update?

samlachance commented 2 years ago

I am experiencing this as well. The container appears to be deployed and functioning but github actions just spins. I also use a load balancer for what it's worth.

chihiros commented 2 years ago

I too take a long time to deploy. How can I shorten it?

image

amalsgit commented 2 years ago

It takes around 12 mins for me to update my Fargate service 😢

aencalado commented 2 years ago

Same problem, any update?

grommir commented 2 years ago

I think it's not an action issue, but ECS. I took a look at the ECS service console and found that status 2/1 Tasks running lasts here for a long time image

grommir commented 2 years ago

The problem is in deregistration_delay parameter, whose default value is 300 seconds.

I tried to set it to 5 seconds and now deploy of the task definition takes about 2 minutes image

aencalado commented 2 years ago

I found that if you disable the task stability check then it takes only a few seconds/minutes to deploy

      - name: Deploy Amazon ECS task definition
        uses: aws-actions/amazon-ecs-deploy-task-definition@de0132cf8cdedb79975c6d42b77eb7ea193cf28e
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: false. # <--- default is true
ghost commented 2 years ago

The same task, unchanged, will suddenly take 20-30 minutes, or fail altogether. Seems like an issue worth checking out. Screen Shot 2022-06-22 at 16 00 47

Deep1144 commented 2 years ago

Facing the same issue, is there any update?

naarkhoo commented 2 years ago

same issue - the same task from two month ago now takes for ever - I thought its about RAM/CPU but hey I am using elastic container ...

0Lucifer0 commented 1 year ago

Same for me: what's surprised me is that a lot of ecs instance are failing to start and are killed until one finally successfully start. It is likely due to AWS ECS and not this action.

STOPPED (ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post "https://api.ecr.us-west-2.amazonaws.com/": dial tcp 52.119.174.83:443: i/o timeout)

willisplummer commented 1 year ago

I had the same experience as @0Lucifer0 — would be nice if the github action could fail loudly or provide more insight into why the task isn't getting deployed correctly

nosachamos commented 1 year ago

Also hitting the same problem. Seems random - sometimes finishes in 2 mins, sometimes in 20. Deploying using aws-actions/amazon-ecs-deploy-task-definition@v1 as everyone else here.

This is a major, major pain and eats up billable minutes in github actions. Please take a look, AWS.

DLoBoston commented 1 year ago

I can confirm that the hanging is during the stability check. Not a fan of turning this off, but in order to save on billable minutes I am. You can use other automated health checks to check on the service and keep this action limited to build a deploy.

Excerpt of debug info:

image

@aencalado 's solution worked for me to turn off wait-for-service-stability

ddaniel27 commented 1 year ago

The solution given by @grommir worked for me. image

0Lucifer0 commented 1 year ago

@ddaniel27 can you give more details on how to achieve this ? how/where did you change the deregistration_delay property ?

ddaniel27 commented 1 year ago

@0Lucifer0 If you are in the AWS console, go to EC2 > target groups > your target group > attributes > Edit. There you just must to change the 300 to something like 5 seconds and that's all. I don't know how this can affect something else in the ECS behavior but for this issue, it works.

0Lucifer0 commented 1 year ago

I guess that won't work for me as for some reason there is no target groups 😢

RazGvili commented 1 year ago

I'm having the same issue with deregistration_delay of 10 sec

SmashingQuasar commented 1 year ago

I can confirm this is still an issue. When a deployment is failing, it seems AWS does not answer anything which leads to an extremely long deployment time that ultimately ends up in timeout.

nosachamos commented 1 year ago

For what is worth, my long deployment time was due to using the wrong VPC somewhere which was causing a delay and then a timeout.

Em ter., 9 de mai. de 2023 às 10:12, Nicolas Cordier-Cerezo < @.***> escreveu:

I can confirm this is still an issue. When a deployment is failing, it seems AWS does not answer anything which leads to an extremely long deployment time that ultimately ends up in timeout.

— Reply to this email directly, view it on GitHub https://github.com/aws-actions/amazon-ecs-deploy-task-definition/issues/102#issuecomment-1540348250, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJUA5WZFG3X2R3Z46IBK73XFJNF7ANCNFSM4QHITRQQ . You are receiving this because you commented.Message ID: <aws-actions/amazon-ecs-deploy-task-definition/issues/102/1540348250@ github.com>

sombriks commented 1 year ago

Hes the issue still here, deployment failed, action hanged out

SebastianDix commented 1 year ago

Guys because it is not just deploying the task definition, it is also waiting for service stability. Service stability is # of required tasks == number of active tasks + Health checks passing. If it takes 30 minutes (default) then it's because it was waiting 30 minutes for healthchecks to pass and they didn't. You can set a timeout to a lower value than 30 minutes.

erwan-joly commented 1 year ago

Guys because it is not just deploying the task definition, it is also waiting for service stability. Service stability is # of required tasks == number of active tasks + Health checks passing. If it takes 30 minutes (default) then it's because it was waiting 30 minutes for healthchecks to pass and they didn't. You can set a timeout to a lower value than 30 minutes.

The issue is not really that it timeout after 30min it is that is take a long time (25min is quite long even if it does not timeout) to successfully start in some cases. Sadly the error seems more like an underlying issue with AWS than with that specific github action

https://github.com/aws-actions/amazon-ecs-deploy-task-definition/issues/102#issuecomment-1266750550

rahulbhanushali commented 8 months ago

Anybody figure a solution for this? Facing this frequently once in a while and is very annoying.

We have a staging environment where we have set desired count to 1. When deployment gets stuck, we face service downtime.

I can see the task logs and see service has started by for some reason ecs deployment still shows deploying and ELB won't route to the new task instance.

carolzbnbr commented 7 months ago

Same here :(

JGSweets commented 3 months ago

I looked into the configuration of the wait timer and my thoughts are as follows:

As a result, stability is checked early and will almost always default to 120 seconds since early attempts will fail.

That means once stability is reached (depending on your AWS settings of the target group / ecs health check delays etc) an additional 120 seconds seems to be tacked on top of the stability.


Solutions to fix:

safwanshamsir99 commented 2 months ago

image

I faced the same issue. Usually, it takes about 4-5 mins.

Yangeok commented 1 month ago

Same issue with python 3.11 poetry environment.

Jepkosgei3 commented 1 week ago

I found that if you disable the task stability check then it takes only a few seconds/minutes to deploy

      - name: Deploy Amazon ECS task definition
        uses: aws-actions/amazon-ecs-deploy-task-definition@de0132cf8cdedb79975c6d42b77eb7ea193cf28e
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: false. # <--- default is true

this solved mine