codesuki / ecs-nginx-proxy

Reverse proxy for AWS ECS. Lets you address your docker containers by sub domain.
MIT License
98 stars 23 forks source link

nginx 503 temporarily unavailable #12

Closed julieproductops closed 7 years ago

julieproductops commented 7 years ago

Thanks for creating this project.

I am using ecs-nginx-proxy deployed in our dev environment ecs cluster with about 77 tasks running. Intermittently, various users report getting the nginx 503 temporarily unavailable error message.

I have been using a load test with 100 clients hitting container url's over a 15 second period in order to reliably reproduce this and dig into some logs. While I was troubleshooting, I would periodically check the AWS console page for ECS to look at one the task definition for the proxy, and got "Failed to describe private-ecs-nginx-proxy - Rate exceeded".

While examining the code for ecs-gen, it became clear that describeTaskDefinition is being used for each request in order to route to the correct container (because that is where you would find the VIRTUAL_HOST env var, of course). But it seems that if the api returns an error, the code returns nil and there is nothing being logged. I am wondering if:

a) have you ever run into this rate limit before in your usage or testing? b) do you have any plans to produce some logging for troubleshooting in ecs-gen :) ? c) Can ecs-gen perhaps cache the list of task definitions for a short period of time so that we can avoid hitting this limit?

It's a bit annoying that aws does not publish these rate limits, so I realize it's hard to guage how long to wait before invalidating the cache. Perhaps it can be a config variable.

Many thanks, Julie

codesuki commented 7 years ago

Hi Julie,

Great find and happy to hear your are using the project successfully.

a) I haven't been limited by the rate limit, yet. So this is very helpful info to make ecs-gen more stable. b) Yes that absolutely needs to be in there. c) Definitely would want to implement that. I am sinking in other work though, but I will try to find some spare minutes to look at the problem.

Thanks for the report and the solution approach!

codesuki commented 7 years ago

I just had a look at the code. Following up the errors all I checked ended in a log call. Maybe AWS SDK returns some info in the response value, that I am not checking. This needs more investigating.

julieproductops commented 7 years ago

Hi codesuki, I did look deeper into the code, and you are printing the errors that occur, and indeed eventually I did find them in the logs. We ended up being able to mitigate hitting the rate limit issue in AWS by simply using the --frequency param (modifying the CMD in the dockerfile that launches ecs-gen). So, thank you for providing that option! I also noticed a --run-once option, which we could cobble together a solution that kicks off ecs-gen on-demand, i.e. when a new deployment occurs, etc.

But the above are workarounds. You could implement the exponential back-off algorithm that AWS recommends (which sounds like a pain....). I would consider my issue closed.

codesuki commented 7 years ago

Glad you fixed your problem. Thanks for the feedback. I am planning to implement some kind of quota handling in the near future.