Issue with Service Discovery when updating backend services

ampiy commented 2 years ago

I have a load balanced service with nginx container running as service (say nginx-app) and backend service nodejs app (say BES_APP). The nginx configuration is pointed to the service discovery of backend service. Everything works fine initially and the BES_APP paths are accessible via internet.

However, when backend service is deployed with new changes, the paths become unavailable from the internet. But if I execute the curl command to path of the service discovery from the nginx app, it is able to reach the backend service. Only from internet it becomes unreahable.

If I force deploy the nginx app, everything starts working fine. So everytime I update and deploy the backend service, I have to redeploy even the nginx service. This becomes a huge problem when we start adding multiple backend services.

Any ideas what is causing the issue, and how to fix this.

The version of copilot I'm using is v1.15.0

huanjani commented 2 years ago

Hello, @ampiy! Would you mind pasting the manifest for your backend service here so we can try to solve this? Thank you!

ampiy commented 2 years ago

``

Hello, @ampiy! Would you mind pasting the manifest for your backend service here so we can try to solve this? Thank you!

Hi @huanjani,

Here is the backend service Manifest File

huanjani commented 2 years ago

Hi, @ampiy. Thank you for sending that.

We think what you're experiencing may be due to a known issue with Service Discovery (see https://github.com/aws/containers-roadmap/issues/343) where the TTL is not respected before stopping old tasks 🙇🏼 .

Until that problem is resolved, I wonder if it's possible to configure nginx to use a smaller TTL than 10s (https://aws.amazon.com/blogs/containers/load-balancing-amazon-ecs-services-with-a-kubernetes-ingress-controller-style-approach/) with the valid=1s directive so that route53 is queried more frequently to resolve the endpoint 🤔 .

Alternatively, maybe you can place a private APIGW in front of the backend service using addons/: https://github.com/g-grass/aws-copilot-backend-without-natgateway/blob/main/resources/copilot/backend-api/addons/apigateway.yml#L36-L74 so that there is a reliable endpoint.

Please let us know if either of those works. We're sorry about this inconvenience!

ampiy commented 2 years ago

Hi @huanjani,

I think the resolvers TTL was the problem. I added the resolver 10.100.0.2 valid=5s block in configuration of nginx gateway app (load balanced web service copilot app) and this works. Thanks a lot !!!

Probably could increase valid to even 60s, only downside would be whenever the backend services update, there might be 60 secs of non-reachability to the updated backend services.

ampiy commented 2 years ago

Update: Doesnt work. I set it to even valid=1s but still same issue.

Lou1415926 commented 2 years ago

I wonder if adding an API Gateway in front of the service would get around the TTL issue. Here is an example. The hope is that perhaps APIGW is able to handle the case when the service discovery destination is unreachable.

On a side note, how long is your endpoint unreachable between deployments?

I am sorry for the churn!

ampiy commented 2 years ago

I ended using another ec2 instance with webhook server, just to redeploy nginx container everytime there's an update in any of the backend services.

Create github webhook (for all backend service repos) --> Webhook receiver server in EC2 --> On any push to deployment branch --> Start force redeployment of nginx container after 6 mins delay.

Additionally, All copilot services have their own AWS code pipelines. So the delay of nginx reployment is based on AWS code pipelines.

The pitfall are:

Updated backend services become unreachable for 4-5 mins sometimes.
Time to see any changes/updates doubles between 12-20 mins.

Lou1415926 commented 2 years ago

Hello @ampiy !! I apologize in advance for this length response. I am sorry for the difficulty that you have right now (4-5 min unreachability, 12-20 update time), and I'd like to help you as much as I can as well as understanding the issue behind this. The response is lengthy because I want to provide enough clarify for you so that we can maybe remove the pitfall you described as soon as possible.

I tried to reproduce the issue by setting up a backend service named backend and a LBWS named frontend. They used the service discovery endpoint for communication.

On the contrary to our expectation (knowing that there is this resolver TTL issue), my LBWS was able to consistently receive response from backend service, even during deployment. To further understand this, we did the below experiment.

From the frontend container, I have a script that runs dig +short backend.test.app.local every second, and log the output to a file. Then, I redeploy the backend service, wait until everything is stable, and look at the log file to see how the IP info returned from the dig command changes during the deployment.

Let's say the old backend task's IP is 10.0.1.1, and the new task's IP is 10.0.0.20.

Below is my observation:

Around 20 seconds after the new task started, I have both 10.0.1.1 and 10.0.0.20 returned from dig.
Within 30 seconds (just a rough estimation, could be a lot shorter) after the old task start starts de-provision, 10.0.1.1 disappeared from the output. I received only 10.0.0.20 from dig.

Also, there is this short description of how DNS resolver chooses from 10.0.1.1 and 10.0.0.20.

This explains why my application worked. Between the moment when the old task de-provisions, and the moment when 10.0.1.1 is removed from DNS resolver (say it's 10 seconds because we use 10 for TTL), the DNS resolver

Tries 10.0.1.1
Finds out that it failed
Turns to try 10.0.0.20
Succeeds because 10.0.0.20 is the new running task.

Given the ⬆️ experiments, we have a guess:

Missing Health Check

I see that you don't have health check for your backend service. Without the health check, the agent doesn't know when the task should be considered steady - the result is that, the agent could add the new task IP to Route53 before the task should be considered steady.

In the ⬆️ example, that'd mean 10.0.0.20 may have been added to the record before it's good enough to handle any requests. The result of this would be that the DNS resolve fails to route requests to neither 10.0.1.1 nor 10.0.0.20 until 10.0.0.20 finally becomes stable. How long it takes for the new task to become stable would depend on your backend implementation.

Suggestion

Would you mind adding a health check for your backend service and see if it helps?

Besides, I have a clarification question:

Only from internet it becomes unreachable.

By this ⬆️ , did you mean that the LBWS wasn't able to send request to the backend through service discovery?

Again I am sorry that this is lengthy and hard to read 😢 From the experiments though I believe we see a possibility of fixing the issue you've encountered.

Thank you and sorry for the churn!

ampiy commented 2 years ago

Hi @Lou1415926. Thank you for the detailed reply and thanks for taking time to analyze the issue.

Your suggested solution 1: -> Missing Health Check: I tried adding healthcheck and cant seem to find the proper healthcheck command. My backend service is at nodeapp.test.test-copilot-app.local:10000 . So I tried the following:

- command: ["CMD-SHELL", "curl -f http://localhost:10000 || exit 1"] 
- command: ["CMD-SHELL", "curl -f nodeapp.test.test-copilot-app.local:10000 || exit 1"] 
- path: '/'

All these commands yielded failed deployments, I'm unable to get the right commands for healthcheck for backend services.

To get a better understanding of the problem: I have separate repos for frontend (nginx-gateway) and backend (nodeapp). Here's the sample repo's:

Frontend (nginx-gateway): https://github.com/ampiyofficial/aws-copilot-nginx-gateway-service Backend (nodeapp): https://github.com/ampiyofficial/aws-copilot-nodeapp-microservice

Whenever I update the nodeapp backend services repo, the backend service seems to be unavailable from frontend. The only only way to solve this issue to force redeploy the frontend.

When I exec from frontend to perform dig it yields the updated IP of backend, but for some reason LBWS cant seem to update the IP of backend services.

Steps to recreate the problem or error.

- Deploy Backend Service from  Repo (nodeapp) [https://github.com/ampiyofficial/aws-copilot-nodeapp-microservice]
- Deploy Frontend Service from  Repo (nginx-gateway) [https://github.com/ampiyofficial/aws-copilot-nginx-gateway-service]
-  Make some changes to Backend Service and deploy backend service
-  Backend will become unreachable (The only way to solve to this is to force redeploy frontend service)

Please let me how to solve this multi-repo deployment problem.

ampiy commented 2 years ago

Turns out the hacky way is to reload the nginx configuration via copilot exec. That solves the issue.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no response activity. Remove the stale label, add a comment, or this will be closed in 14 days.

github-actions[bot] commented 1 year ago

This issue is closed due to inactivity. Feel free to reopen the issue if you have any further questions!

aws / copilot-cli

Issue with Service Discovery when updating backend services #3287

Missing Health Check

Suggestion