Open muzfuz opened 6 months ago
Thanks for this request. We’d like to check into the behavior you saw more thoroughly. Could you share your support case ID so we can look at your specific setup?
@kshivaz thank you for looking at this. The case ID is 171276418200173.
Thanks @muzfuz.
This issue unfortunately defeats the purpose of using service connect.
case ID 171998743200404 shows CloudMap also doesn't check healthcheck result.
@kshivaz I can confirm the same is happening randomly for us during scale in/out or restarts
@bmariesan : Please open a support case with the details of your environment / configuration and error logs, so our team can look into it.
We face the same issue. We have multiple applications (mostly Java and JRuby) deployed that communicate via service connect. During container startup, we frequently see requests hitting a task where the application container is not ready yet
As a way to prevent this from occurring, an additional container can be added to the task definition with a dependency on the application container to be marked as HEALTHY (this means that there must be a health check defined for the application container). The container should be marked non-essential and designed to exit.
How this works is that ECS would only transition a task to RUNNING state only if all containers in the task have started. This method prevents the task from reaching RUNNING state until the application container is healthy.
Tested this approach using a container which intentionally calls sleep for 60 seconds before starting the webserver process, and used an additional non-essential alpine container. Without the additional container, 503s are returned as expected during a deployment. With the container, no 503s are observed.
Thanks @jenademoodley for posting the workaround! -
We (thanks to @rishabhpar) have also validated that this workaround works and effective. Please find the workaround guideline below
"healthCheck":{
"command":[
"CMD-SHELL",
"curl -f http://localhost/ || exit 1"
],
"interval":30,
"timeout":5,
"retries":3,
"startPeriod":60
}
This is adjustable to your preferences.
{
"name":"serviceconnecthold",
"image":"public.ecr.aws/docker/library/alpine:edge",
"cpu":0,
"portMappings":[],
"essential":false,
"environment":[],
"environmentFiles":[],
"mountPoints":[],
"volumesFrom":[],
"dependsOn":[
{
"containerName":"<---NAME OF THE MAIN CONTAINER--->",
"condition":"HEALTHY"
}
],
"systemControls":[]
}
Thanks @jenademoodley for posting the workaround! -
We (thanks to @rishabhpar) have also validated that this workaround works and effective. Please find the workaround guideline below
Steps to mitigate the problem:
- Identify the container definition that has the long spinup time in the task definition.
- Add a container health check to the identified container definition. Like
"healthCheck":{ "command":[ "CMD-SHELL", "curl -f http://localhost/ || exit 1" ], "interval":30, "timeout":5, "retries":3, "startPeriod":60 }
This is adjustable to your preferences.
- Add a second container to the list of container definitions. This will be a dummy container designed to exit and not consume resources. It will only spinup once the main container is healthy. See the dependsOn section
{ "name":"serviceconnecthold", "image":"public.ecr.aws/docker/library/alpine:edge", "cpu":0, "portMappings":[], "essential":false, "environment":[], "environmentFiles":[], "mountPoints":[], "volumesFrom":[], "dependsOn":[ { "containerName":"<---NAME OF THE MAIN CONTAINER--->", "condition":"HEALTHY" } ], "systemControls":[] }
- Submit the new task definition revision
- Update the service with the new task definition revision.
I can also confirm that the workaround above works like a charm
Also confirm this workaround works. Would be great if AWS could implement a proper fix though
@muzfuz and all, is the service-connect container being marked as "unhealthy" in your case?
I have several ECS tasks running on EC2 using ECS service connect for internal communication. Sometimes, during new deployments, the ECS service connect container linked to these tasks becomes unhealthy, preventing the deployment from succeeding. This issue doesn't occur with every deployment.
These ECS tasks are GPU-based and take some time to start. I don't have any health check configured for the task definitions.
I can confirm, this is very much still an issue cc @thiagoscodelerae .
I'll use the above mentioned ~hack~ workaround for now, thank you @jenademoodley 😄
Community Note
Tell us about your request Service Connect does not support application health checks. This means it attempts to route traffic to containers before they're ready.
We would like Service Connect to have configurable health checks similar to ALBs, or to respect the Docker healthchecks which are configured in the task definition.
Which service(s) is this request for? Fargate - specifically Service Connect options.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? We run several "big" services which have a long startup time (10 to 60 seconds). These services communicate privately using Service Connect.
We noticed that we were getting served 503s during deploys or container restarts.
After some back and forth with AWS Support we were able to establish the following sequence of events:
I received the following guidance on this from AWS Support:
From our POV we would like one of two things to be true here.
The fact that it is currently simply routing traffic to a task as soon as the Envoy sidecar becomes healthy means we need to do some pretty aggressive retries in the client applications, which works to paper over the cracks but can still lead to failure.
Are you currently working around this issue? Yes. A combination of aggressive retries and long Docker health checks has proven effective.
We received the following guidance from AWS Support:
This solution "works" but is merely a sticking plaster - it can still lead to failed requests and needlessly extends deploy / restart times.