aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 319 forks source link

[Fargate] [Bug]: multiple container dependencies in firelens task causes the deployment to never leave PENDING state #1070

Open dekimsey opened 4 years ago

dekimsey commented 4 years ago

Community Note

Tell us about your request

I am attempting to deploy firelens support to my existing ECS Fargate services using the 1.4.0 platform.

Which service(s) is this request for? Fargate

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Adding a depends_on to my main application task appears to prevent the ECS deployment from ever leaving the PENDING state. Deployments never succeed. I also ended up filing this this as Support Case #7364610971.

Are you currently working around this issue? depends_on is disabled.

Additional context The error handling here is ... not great. I cannot tell why my service will not start.

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

PettitWesley commented 4 years ago

@dekimsey Can you please show us your task definition, and any custom config for Fluent Bit/Fluentd.

dekimsey commented 4 years ago

The following is an excerpt (from v20 and v21 of the task definitions). I have verified (repeatedly testing via Terraform) this is the only change between the two definitions.

Works

 "containerDefinitions": [
    {
      "logConfiguration": {
        "logDriver": "awsfirelens",
        "secretOptions": null,
        "options": {
          "Port": "9200",
          "Host": "...",
          "Name": "es"
        }
      },
      "dependsOn": [],
      "firelensConfiguration": null,
      "name": "app"
    },
    {
      "name": "log-router",
      "firelensConfiguration": {
        "type": "fluentbit",
        "options": {}
      },
      "image": "906394416424.dkr.ecr.us-east-2.amazonaws.com/aws-for-fluent-bit:2.6.1",
      "logConfiguration": {
        "logDriver": "awslogs",
        "secretOptions": null,
        "options": {
          "awslogs-group": "stg-pa",
          "awslogs-region": "us-east-2",
          "awslogs-stream-prefix": "saml-response-service"
        }
      }
  }
]

Doesn't work:

 "containerDefinitions": [
    {
      "logConfiguration": {
        "logDriver": "awsfirelens",
        "secretOptions": null,
        "options": {
          "Port": "9200",
          "Host": "...",
          "Name": "es"
        }
      },
      "dependsOn": [
        {
          "containerName": "log-router",
          "condition": "START"
        },
      "firelensConfiguration": null,
      "name": "app"
    },
    {
      "name": "log-router",
      "firelensConfiguration": {
        "type": "fluentbit",
        "options": {}
      },
      "image": "906394416424.dkr.ecr.us-east-2.amazonaws.com/aws-for-fluent-bit:2.6.1",
      "logConfiguration": {
        "logDriver": "awslogs",
        "secretOptions": null,
        "options": {
          "awslogs-group": "stg-pa",
          "awslogs-region": "us-east-2",
          "awslogs-stream-prefix": "saml-response-service"
        }
      }
  }
]
PettitWesley commented 4 years ago

@dekimsey I'm pretty sure the ECS Agent adds a START dependency on the FireLens container for all containers which use the awsfirelens log driver. This is done under the hood; you don't see it. Hence- adding that start dependency should be a no-op. It shouldn't change what's happening under the hood.

I'm also very sure we tested adding custom container dependencies when FireLens launched... I remember doing that. So if your report is correct this is a regression.

dekimsey commented 4 years ago

I'll confer with Support since I have an open case and see what they say/see. But I was definitely able to toggle the behavior. I'll double-check my work tomorrow and confirm here once I do.

PettitWesley commented 4 years ago

Unless overridden with our Container Dependency feature, ECS ensures that the FireLens container starts first and stops last. (More precisely, it will start before any containers that use the awsfirelens log driver, and will stop after any containers that use the awsfirelens log driver).

https://aws.amazon.com/blogs/containers/under-the-hood-firelens-for-amazon-ecs-tasks/

dekimsey commented 4 years ago

Good to know, I didn't see that documented in the firelens docs. I initially went down this path to enable the ECS healthcheck behavior in the logging sidecar since that is not available by default. I did double check to verify healthcheck configuration is not related, as the issue happens with a clean-slate amazon image.

I confirmed the issue exists by setting a depends_on to START or HEALTHY. The task never leaves the PENDING state and ECS doesn't report any issues. Just simply stuck not doing anything.

fierlion commented 4 years ago

https://github.com/aws/amazon-ecs-agent/blob/master/agent/api/task/task.go#L1020-L1021 <- the ecs-agent does add a START dependency (see @PettitWesley's comment above) It also looks like the agent checks for an existing dependency:

for _, container := range task.Containers {
  if !container.DependsOnContainer(firelensContainer.Name) {
    // add START dependency
  }
}

A small question: Is the above "Doesn't work:" taskdef an exact copy? (It's missing a closing ] in the dependsOn field.)

PettitWesley commented 4 years ago

Hey @dekimsey I tried to repro this using the task def here: https://gist.github.com/PettitWesley/ef8eadd213a28c22beb63278c087fbe6

The task proceeded to running; it works.

PettitWesley commented 4 years ago

A helpful person pointed out that I was probably accidentally testing on Fargate Platform Version 1.3.0... which was a good hunch... I was.

However, I just tested the same task def again on the 1.4.0, and it still works, the task proceeds to running.

PettitWesley commented 4 years ago

@dekimsey AWS Support sent me your full Task Definition... I realized the one I tried using to reproduce is very simple in comparison. It is possible that there is a more complicated bug in 1.4.0 with container dependencies and FireLens that is only triggered Task Defs like your's- we can't rule that out until we try repro-ing with something similar.

dekimsey commented 4 years ago

So maybe the thinking being that there's some subtle bug with having a more complex dependency tree comprising of several different containers. I'd buy it. If I recall correctly in my examples the log-router and my consul container both sit at the root. And the "app" container explicitly depends on both (and some other sidecars). Now I'm wondering if I should just toggle off other sidecars and their dependencies until it maybe starts deploying.

If there's anything I can provide to help or if you'd like to ask questions regarding the details off list, please feel free to do so. I'd be happy to provide whater I can.

PettitWesley commented 4 years ago

We've made some attempts to reproduce, it seems likely that there is some sort of bug with more complicated container dependencies.

PettitWesley commented 3 years ago

We have rolled out a new version of Fargate platform version 1.4 which we believe fixes this issue, please re-launch your tasks on 1.4 and let us know if the issue has been resolved (and then we will close this ticket).