aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.18k stars 314 forks source link

ECS Service Discovery not respecting TTL when updating service #343

Open matthewduren opened 5 years ago

matthewduren commented 5 years ago

Summary

When updating a service or otherwise scaling out ECS tasks for a service that uses Service Discovery, tasks are being stopped before reaching the TTL of the service discovery record(s).

Description

When updating a service or otherwise scaling out ECS tasks for a service that uses Service Discovery, tasks are being stopped before reaching the TTL of the service discovery record(s).

To reproduce - create an ECS service from a simple "hello world" type task definition that runs forever and does nothing. Set min healthy to 100, max to 200, count to 1. Setup service discovery and create a DNS record with a long TTL, say 300s. Update the service to use a new revision of the task definition (no changes to the task def needed), and note that the tasks are stopped before the TTL time is reached.

Expected Behavior

ECS Agent should remove the route53 record(s), and then wait to stop the tasks after the TTL duration has elapsed.

Observed Behavior

ECS Agent does not wait any additional time when stopping tasks for services that use service discovery.

Environment Details

Supporting Log Snippets

himberjack commented 5 years ago

The current behavior is outrageous, that such flaw exists in AWS service. It is forcing us to use ELB/ALB and to increase unnecessary costs, besides performance impacts.

AndrewLugg commented 5 years ago

There should also be the option for it to disappear from Route53 when it starts draining, not just enough time for TTL. As requests can come in at the end of TTL, and there not be enough time to process them.

matthewduren commented 5 years ago

Right, when it starts draining it should immediately remove the r53 record. After the TTL has elapsed the normal SIGTERM signal should be sent to the container, followed by SIGKILL 30 seconds later if the task is still up just like how tasks that don't use service discovery behave.

melbourne2991 commented 5 years ago

Was hoping to avoid using an lb and this came to mind, sad to see it's an unresolved issue :(

nikushi commented 5 years ago

I'm on the same issue.

I am building an infrastructure for gRPC services by using ECS Fargate and it's service discovery feature, without having ELB. The communication between services is transferred by Envoy proxy. Each envoy listener is served via ECS' service discovery.

I got some of the gRPC unavailable error during updating the service. The envoy that is transferring a request to other envoy could lose all the upstream connections since there is a moment that the envoy only knows old container's ip addresses, which are already dead by SIGTERM. As workaround, I configured the DNS TTL as very short value such as 3s, though I got still errors in a short period of time(about in 10 seconds).

I hope that the issue will be resolved.

kuongknight commented 5 years ago

I hope that the issue will be resolved too :(

afawaz2 commented 5 years ago

+1.

holstvoogd commented 4 years ago

FYI: If you put the minimum healthy to 0% & max to 100%; i.e stop everything before starting new instances, your service is unreachable for several minutes due to negative DNS caching: I've been experimenting with a service that I really always only want one instance of running, this is what a restart looks like:

  Event Time in second since stop
task stop 0
task start 22
dns gone 26
service listening 35
ecs ready 82
dns back 270

For roughly 4 minutes the service is ready to accept connections, but the DNS returns a NXDOMAIN. So don't try to use Service Discovery for this purpose. Also note that the VPC dns resolver does not adhere to the 24h TTL set in the SOA record for the service discovery dns zone. But you cannot change that TTL anyway, so I guess we should be happy with that and the service not being unreachable for 24h.

Though I'd mention this caveat here since this is where I ended up researching SD ttls.

victor-paddle commented 4 years ago

We are facing same issue, any updates on this?

thanks 😄

joeke80215 commented 4 years ago

We are facing same issue, any updates on this?

thanks 😄

2mositalebi commented 3 years ago

The issue still exists.

awsiv commented 3 years ago

There should also be the option for it to disappear from Route53 when it starts draining, not just enough time for TTL. As requests can come in at the end of TTL, and there not be enough time to process them.

In addition to this, the instance health should change to "unhealthy" so that any api based calls does not see the instance as healthy. Similar to "deregistration delay" in target groups. Also discussed here: https://github.com/aws/containers-roadmap/issues/473

Related:

rilutham commented 3 years ago

Having the same issue with our gRPC server in ECS service and service discovery. Also, we are utilizing spot instances for the service. The gRPC clients cannot call the gRPC server when there is a spot instance interruption, although ECS has spawned a new task before the current task stopped.

Hope this issue will be fixed soon.

kuongknight commented 3 years ago

We are facing same issue :(

hgsgtk commented 3 years ago

We are facing same issue :)

hgsgtk commented 3 years ago

Reading through the linked issue, that bug is related to not respecting TTLs. The bug we fixed in ECS was an ordering issue where some tasks may be stopped before new tasks are actively visible in DNS.

https://github.com/aws/aws-app-mesh-roadmap/issues/151

CraigHead commented 3 years ago

We are facing the same issue. I created a tool that let us graph the behavior. But basically I've seen the HTTP 503 errors show up AFTER ECS is done deploying new tasks and after the old tasks are shutdown. The Y-axis in the below graph are HTTP status code. Ignore the fact that my service was returning a 403. I wasn't providing a token but that is unrelated to this point. aesyay

CraigHead commented 3 years ago

I noticed during an ECS Fargate deployment ServiceDiscovery will return a empty array for a short period of time e.g. aws servicediscovery discover-instances --namespace-name my-namespace --service-name MyService

{
    "Instances": []
}
kiranmeduri commented 2 years ago

@CraigHead can you provide details on your service's configuration specifically

  1. desiredCount
  2. maximumPercent
  3. minimumHealthyPercent

I can confirm that when desiredCount=1, maximumPercent=100, minimumHealthyPercent=0 there will be 5xx as expected. Note that this is not a recommended setting to use when working with ECS service discovery.

CraigHead commented 2 years ago

@kiranmeduri I spoke with my AWS SA and at his request I opened a support case in May. Without going into exhaustive detail, a flaw was found in the integration between CloudMap and API Gateway for ECS service resolution during deployments. That fix was deployed last week and I confirmed HTTP 5xx errors are no longer happening for about 40 seconds AFTER a deployment occurs and ECS stabilizes.

marc-costello commented 2 years ago

This is still a problem (on ECS instance types at least, I haven't tested fargate). Is there any progress? As it stands it makes this feature unviable for production which is a real shame😢

pgeler commented 2 years ago

same problem does exists on fargate as well

false-vacuum commented 2 years ago

Any update on this? I have been dealing with this issue on Fargate for years.

Why doesn't ECS service discovery kill the DNS records for draining instances?

pablodiegoss commented 2 years ago

Any update on this? I have been dealing with this issue on Fargate for years.

Why doesn't ECS service discovery kill the DNS records for draining instances?

I can only reply to this question with the very first comment on this issue.

The current behavior is outrageous, that such flaw exists in AWS service. It is forcing us to use ELB/ALB and to increase unnecessary costs, besides performance impacts.

I've been dealing with this issue on Fargate for a few months and just now discovered that it is an old problem 😠

chrisburrell commented 2 years ago

Is there any chance this can be fixed? It's really the only reason stopping us from using service discovery with ECS.

I assume eks does its own service discovery which is why this is so low priority.

donaltuohy commented 2 years ago

Still facing this issue. We have to move away from service discovery as we can't cycle our instances without errors.

kocou-yTko commented 1 year ago

not fixed yet :(

will3942 commented 1 year ago

This is a blocker for us to continue using service discovery on ECS.

chaudharydeepak commented 1 year ago

is this still an issue? I was planning on using this feature - guess will need go down the alb route.

KlemenKozelj commented 1 year ago

The issue was opened in 2018 😢 hopefully a resolution will follow soon... Because of this we also opted for ECS and private ALB combo.

a0s commented 1 year ago

I don't believe this problem will ever be fixed. This is pushing to switch to EKS which is much more expensive.

herrhound commented 1 year ago

@a0s - have you looked at ECS Service Connect? It's an evolution of service discovery and it supports connection draining that helps ensure tasks are not stopped while there are active connections. Additionally, it resolves DNS in the Service Connect Agent, with a rather fast change propagation.

nathanpeck commented 1 year ago

Hello all. We have launched ECS Service Connect which is intended to be a drop-in replacement for Cloud Map DNS based service discovery. You can keep using the same DNS names, but it no longer uses DNS propagation. Instead Service Connect uses an Envoy Proxy sidecar which is configured by monitoring the ECS control plane for task launches and stops. This means there is far less propagation delay. Additionally, the sidecar automatically retries and redirects your request to a different task in the rare case that a task has crashed in the short interval between the sidecar receiving the latest updated config from ECS.

Please try it out and let us know if this solves your problem. Service Connect is designed to feel the same as DNS based service discovery, but overall much more featureful and doesn't have the same DNS propagation timing issues.

ranman commented 1 year ago

Is there a workaround for regions that don't have service connect available yet?

herrhound commented 1 year ago

@ranman, no, not at the moment, but we are working on region builds at high priority. Which region(s) do you have in mind?

ranman commented 1 year ago

govcloud (us-gov-west-1)

herrhound commented 1 year ago

Thanks, @ranman. Us-gov-west-1 is definitely on our roadmap.

ranman commented 1 year ago

That's great to hear! The question for me is the estimated timeline there (which I know is very difficult for AWS to provide). I.e. do I need to invent around this for now, or can I advise my customer to wait for a few ¿months? And then deploy the new version of their application leveraging service connect.

prince367gro commented 1 year ago

Hello all. We have launched ECS Service Connect which is intended to be a drop-in replacement for Cloud Map DNS based service discovery. You can keep using the same DNS names, but it no longer uses DNS propagation. Instead Service Connect uses an Envoy Proxy sidecar which is configured by monitoring the ECS control plane for task launches and stops. This means there is far less propagation delay. Additionally, the sidecar automatically retries and redirects your request to a different task in the rare case that a task has crashed in the short interval between the sidecar receiving the latest updated config from ECS.

Please try it out and let us know if this solves your problem. Service Connect is designed to feel the same as DNS based service discovery, but overall much more featureful and doesn't have the same DNS propagation timing issues.

Are we able to use Service connect to connect from Ec2 or lambda to ECS services ? As we use to do in service discovery.

prince367gro commented 1 year ago

@ranman @nathanpeck

brafales commented 1 year ago

People beware of the "Deployment order" gotcha: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-connect.html

Existing tasks can't resolve and connect to the new endpoint. Only new Amazon ECS tasks that have a Service Connect configuration in the same namespace and that start running after this deployment can resolve and connect to this endpoint. For example, an Amazon ECS service that runs a client application must be redeployed to connect to a new endpoint. Start that deployment after the deployment completes of the server that makes the endpoint that the client connects to.

Deployment order matters. If you have a Service A under servicea.internal, then deploy a Service B under serviceb.internal, Service A will not be able to talk to Service B (DNS resolution error will happen) until Service A containers are restarted. I find this behaviour quite irritating, as dynamic creation of new ECS services then require restarting of other services if you want those to be able to "discover" the new one. Our use case requires us to dynamically create new test environments under different domains, and now we are being forced to restart a whole bunch of other consuming services each time we want to create a new test environment.

In my opinion, allowing customers to configure AWS Service Discovery with:

would have solved this use case.

I would have rather seen AWS improving an existing service with what looks like a fairly simple and useful feature, instead of being provided with a new, "improved" service which comes with other issues, and which forces you to update all your existing ECS services to take advantage of.

imdkbj commented 3 weeks ago

Ah! More than 4 long years and we are forced to use ALB.