Open matthewduren opened 6 years ago
The current behavior is outrageous, that such flaw exists in AWS service. It is forcing us to use ELB/ALB and to increase unnecessary costs, besides performance impacts.
There should also be the option for it to disappear from Route53 when it starts draining, not just enough time for TTL. As requests can come in at the end of TTL, and there not be enough time to process them.
Right, when it starts draining it should immediately remove the r53 record. After the TTL has elapsed the normal SIGTERM signal should be sent to the container, followed by SIGKILL 30 seconds later if the task is still up just like how tasks that don't use service discovery behave.
Was hoping to avoid using an lb and this came to mind, sad to see it's an unresolved issue :(
I'm on the same issue.
I am building an infrastructure for gRPC services by using ECS Fargate and it's service discovery feature, without having ELB. The communication between services is transferred by Envoy proxy. Each envoy listener is served via ECS' service discovery.
I got some of the gRPC unavailable error during updating the service. The envoy that is transferring a request to other envoy could lose all the upstream connections since there is a moment that the envoy only knows old container's ip addresses, which are already dead by SIGTERM. As workaround, I configured the DNS TTL as very short value such as 3s
, though I got still errors in a short period of time(about in 10 seconds).
I hope that the issue will be resolved.
I hope that the issue will be resolved too :(
+1.
FYI: If you put the minimum healthy to 0% & max to 100%; i.e stop everything before starting new instances, your service is unreachable for several minutes due to negative DNS caching: I've been experimenting with a service that I really always only want one instance of running, this is what a restart looks like:
Event | Time in second since stop | |
---|---|---|
task | stop | 0 |
task | start | 22 |
dns | gone | 26 |
service | listening | 35 |
ecs | ready | 82 |
dns | back | 270 |
For roughly 4 minutes the service is ready to accept connections, but the DNS returns a NXDOMAIN. So don't try to use Service Discovery for this purpose. Also note that the VPC dns resolver does not adhere to the 24h TTL set in the SOA record for the service discovery dns zone. But you cannot change that TTL anyway, so I guess we should be happy with that and the service not being unreachable for 24h.
Though I'd mention this caveat here since this is where I ended up researching SD ttls.
We are facing same issue, any updates on this?
thanks 😄
We are facing same issue, any updates on this?
thanks 😄
The issue still exists.
There should also be the option for it to disappear from Route53 when it starts draining, not just enough time for TTL. As requests can come in at the end of TTL, and there not be enough time to process them.
In addition to this, the instance health should change to "unhealthy" so that any api based calls does not see the instance as healthy. Similar to "deregistration delay" in target groups. Also discussed here: https://github.com/aws/containers-roadmap/issues/473
Related:
Having the same issue with our gRPC server in ECS service and service discovery. Also, we are utilizing spot instances for the service. The gRPC clients cannot call the gRPC server when there is a spot instance interruption, although ECS has spawned a new task before the current task stopped.
Hope this issue will be fixed soon.
We are facing same issue :(
We are facing same issue :)
Reading through the linked issue, that bug is related to not respecting TTLs. The bug we fixed in ECS was an ordering issue where some tasks may be stopped before new tasks are actively visible in DNS.
We are facing the same issue. I created a tool that let us graph the behavior. But basically I've seen the HTTP 503 errors show up AFTER ECS is done deploying new tasks and after the old tasks are shutdown. The Y-axis in the below graph are HTTP status code. Ignore the fact that my service was returning a 403. I wasn't providing a token but that is unrelated to this point.
I noticed during an ECS Fargate deployment ServiceDiscovery will return a empty array for a short period of time
e.g.
aws servicediscovery discover-instances --namespace-name my-namespace --service-name MyService
{
"Instances": []
}
@CraigHead can you provide details on your service's configuration specifically
I can confirm that when desiredCount=1, maximumPercent=100, minimumHealthyPercent=0 there will be 5xx as expected. Note that this is not a recommended setting to use when working with ECS service discovery.
@kiranmeduri I spoke with my AWS SA and at his request I opened a support case in May. Without going into exhaustive detail, a flaw was found in the integration between CloudMap and API Gateway for ECS service resolution during deployments. That fix was deployed last week and I confirmed HTTP 5xx errors are no longer happening for about 40 seconds AFTER a deployment occurs and ECS stabilizes.
This is still a problem (on ECS instance types at least, I haven't tested fargate). Is there any progress? As it stands it makes this feature unviable for production which is a real shame😢
same problem does exists on fargate as well
Any update on this? I have been dealing with this issue on Fargate for years.
Why doesn't ECS service discovery kill the DNS records for draining instances?
Any update on this? I have been dealing with this issue on Fargate for years.
Why doesn't ECS service discovery kill the DNS records for draining instances?
I can only reply to this question with the very first comment on this issue.
The current behavior is outrageous, that such flaw exists in AWS service. It is forcing us to use ELB/ALB and to increase unnecessary costs, besides performance impacts.
I've been dealing with this issue on Fargate for a few months and just now discovered that it is an old problem 😠
Is there any chance this can be fixed? It's really the only reason stopping us from using service discovery with ECS.
I assume eks does its own service discovery which is why this is so low priority.
Still facing this issue. We have to move away from service discovery as we can't cycle our instances without errors.
not fixed yet :(
This is a blocker for us to continue using service discovery on ECS.
is this still an issue? I was planning on using this feature - guess will need go down the alb route.
The issue was opened in 2018 😢 hopefully a resolution will follow soon... Because of this we also opted for ECS and private ALB combo.
I don't believe this problem will ever be fixed. This is pushing to switch to EKS which is much more expensive.
@a0s - have you looked at ECS Service Connect? It's an evolution of service discovery and it supports connection draining that helps ensure tasks are not stopped while there are active connections. Additionally, it resolves DNS in the Service Connect Agent, with a rather fast change propagation.
Hello all. We have launched ECS Service Connect which is intended to be a drop-in replacement for Cloud Map DNS based service discovery. You can keep using the same DNS names, but it no longer uses DNS propagation. Instead Service Connect uses an Envoy Proxy sidecar which is configured by monitoring the ECS control plane for task launches and stops. This means there is far less propagation delay. Additionally, the sidecar automatically retries and redirects your request to a different task in the rare case that a task has crashed in the short interval between the sidecar receiving the latest updated config from ECS.
Please try it out and let us know if this solves your problem. Service Connect is designed to feel the same as DNS based service discovery, but overall much more featureful and doesn't have the same DNS propagation timing issues.
Is there a workaround for regions that don't have service connect available yet?
@ranman, no, not at the moment, but we are working on region builds at high priority. Which region(s) do you have in mind?
govcloud (us-gov-west-1)
Thanks, @ranman. Us-gov-west-1 is definitely on our roadmap.
That's great to hear! The question for me is the estimated timeline there (which I know is very difficult for AWS to provide). I.e. do I need to invent around this for now, or can I advise my customer to wait for a few ¿months? And then deploy the new version of their application leveraging service connect.
Hello all. We have launched ECS Service Connect which is intended to be a drop-in replacement for Cloud Map DNS based service discovery. You can keep using the same DNS names, but it no longer uses DNS propagation. Instead Service Connect uses an Envoy Proxy sidecar which is configured by monitoring the ECS control plane for task launches and stops. This means there is far less propagation delay. Additionally, the sidecar automatically retries and redirects your request to a different task in the rare case that a task has crashed in the short interval between the sidecar receiving the latest updated config from ECS.
Please try it out and let us know if this solves your problem. Service Connect is designed to feel the same as DNS based service discovery, but overall much more featureful and doesn't have the same DNS propagation timing issues.
Are we able to use Service connect to connect from Ec2 or lambda to ECS services ? As we use to do in service discovery.
@ranman @nathanpeck
People beware of the "Deployment order" gotcha: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-connect.html
Existing tasks can't resolve and connect to the new endpoint. Only new Amazon ECS tasks that have a Service Connect configuration in the same namespace and that start running after this deployment can resolve and connect to this endpoint. For example, an Amazon ECS service that runs a client application must be redeployed to connect to a new endpoint. Start that deployment after the deployment completes of the server that makes the endpoint that the client connects to.
Deployment order matters. If you have a Service A under servicea.internal
, then deploy a Service B under serviceb.internal
, Service A will not be able to talk to Service B (DNS resolution error will happen) until Service A containers are restarted. I find this behaviour quite irritating, as dynamic creation of new ECS services then require restarting of other services if you want those to be able to "discover" the new one. Our use case requires us to dynamically create new test environments under different domains, and now we are being forced to restart a whole bunch of other consuming services each time we want to create a new test environment.
In my opinion, allowing customers to configure AWS Service Discovery with:
would have solved this use case.
I would have rather seen AWS improving an existing service with what looks like a fairly simple and useful feature, instead of being provided with a new, "improved" service which comes with other issues, and which forces you to update all your existing ECS services to take advantage of.
Ah! More than 4 long years and we are forced to use ALB.
@nathanpeck How to solve the use case of an EC2 VM that wants to talk to an ECS Container that is using service discovery? ECS Service Connect only works with ECS Clusters; regular EC2 VM's can not use it.
If ECS could be told to first update the DNS, then wait for the service discovery DNS TTL, before bringing any containers in a deployment down, it would solve so many problems.
Summary
When updating a service or otherwise scaling out ECS tasks for a service that uses Service Discovery, tasks are being stopped before reaching the TTL of the service discovery record(s).
Description
When updating a service or otherwise scaling out ECS tasks for a service that uses Service Discovery, tasks are being stopped before reaching the TTL of the service discovery record(s).
To reproduce - create an ECS service from a simple "hello world" type task definition that runs forever and does nothing. Set min healthy to 100, max to 200, count to 1. Setup service discovery and create a DNS record with a long TTL, say 300s. Update the service to use a new revision of the task definition (no changes to the task def needed), and note that the tasks are stopped before the TTL time is reached.
Expected Behavior
ECS Agent should remove the route53 record(s), and then wait to stop the tasks after the TTL duration has elapsed.
Observed Behavior
ECS Agent does not wait any additional time when stopping tasks for services that use service discovery.
Environment Details
Supporting Log Snippets