Closed ricetj closed 4 years ago
What specifically are we looking to do here? The web servers are in an ASG now and the worker nodes need performance tuning before we even look at ASG tuning. Is this an improvement ticket or implement auto scaling using some other toolset?
What specifically are we looking to do here? The web servers are in an ASG now and the worker nodes need performance tuning before we even look at ASG tuning. Is this an improvement ticket or implement auto scaling using some other toolset?
Well this came from a conversation with @wyattwalter and that vets-api isn’t auto-scaled. We can scale it up manual, but it isn't automatic doing that now, which may help us in the future.
Looks like I also opened #169 which has a bit better description. They are in an ASG, but the thread limits currently are preventing from the scaling from ever kicking in before the service starts having issues. Just adding some additional threads to puma I think and enabling this option would fix what we need: https://github.com/department-of-veterans-affairs/devops/blob/master/ansible/deployment/DEPLOYMENT_OPTIONS.md#deployment_add_cpu_scaling_policy
Closing this in favor of the other ticket.
Problem Statement
The vets-api service doesn't have a great auto-scaling story to date. As the platform scales, we should strive for a fairly somewhat simple, but direct way to auto-scale at this tier. The instances are likely not tuned well individually right now, especially since a lot of traffic can end up sitting waiting for some upstream node.
AC
Notes from the past: Ran into an issue today that highlighted some needs on this:
we're taking defaults on puma and sidekiq worker threads (puma is 16, sidekiq 25) database connection pool size is set to 16 in the shared connection template we have metrics via statsd currently for puma thread workers, but not the other two items Because of these inconsistencies sidekiq tried to continue using more threads, but seemingly ran out of db connections in the pool. This caused the worker queue to back up until someone manually upped the ASG size to accommodate and it worked through the queue pretty quickly.
We need to sync up these values and get metrics in order to effectively auto-scale this particular tier of the app.