Research Scheduled Scaling Strategies

dpb587 commented 10 years ago

Research and test different strategies for scaling during business hours while maintaining data.

dpb587 commented 10 years ago

With some thought and research, I think this is how we should approach it. First the core concepts:

Allocation Awareness - this lets us set "properties" that elasticsearch will respect when allocating and balancing shards. Initially I thought we should have two properties (zone and schedule), but I think our scaling operations will be more efficient with a single property with values of (daytime, fulltime_euwest1a, and fulltime_euwest1b). Elasticsearch won't allocate a shard to multiple nodes with the same set of "awareness properties". I don't think we should care where the daytime nodes are running, but the full time nodes should be mirrored across AZs.
Dynamic Replica Configuration - we can dynamically specify how many replicas we want the cluster to have as scaling needs change. This lets us continue to monitor the green/yellow/red-related statuses that elasticsearch will show.

Putting those concepts together, we'll specify cluster.routing.allocation.awareness.attributes as zone,schedule and we'll inject those properties to each node. During non-business hours I think we should continue with our 2-replica setting (keeping data replicated across AZs). About an hour before our scale-up deadline, we can update the settings to specify we want 3 replicas and enable cluster.routing.allocation.disable_allocation. Then we can start up the extra nodes; and once they're all online, we can disable cluster.routing.allocation.disable_allocation to let them sync back up and chat about who gets to handle the third replica. At the end of business, we scale down by again disabling allocation, terminate the nodes, update replica settings down to 2, and re-enable allocation.

These procedures won't necessarily be lightweight operations, but it's something we're interested in at this time, so, we can at least give it some time and experimentation.

sopel commented 10 years ago

I haven't fully contemplated your proposal yet and this would only address the schedule aspect, but incidentally AWS has just announced that CloudFormation supports Auto Scaling Scheduled Actions (long overdue actually, they exist for quite a while already):

AWS CloudFormation now supports Auto Scaling scheduled actions [...]

[...] With support for scheduled actions, you can now model Auto Scaling schedules in CloudFormation templates. If you have a predictable traffic pattern, you can scale Auto Scaling groups using scheduled actions. We have created a sample template to show you how.

Given my mind set and resp. preference to handle everything from within the available self contained automation layer, my thinking is once again towards inversion of control:

handle the schedule via CloudFormation's Auto Scaling support, so all scheduling and scaling dynamics are using the same layer execution/documentation wise
push/pull the additional configuration via CloudFormation's Mustache template support, so only the modifiable parameters (the Mustache context) would be exposed as CloudFormation parameters and everything
orchestrate the allocation/replication via on instance scripts (obviously the most complex part)
- alternatively/additionally parts of this could eventually be done via CloudFormation Custom Resources, if need be or more appropriate - I'm aware that's a ceterum censeo of sorts, so please disregard until I insist (and argue) it to be a superior solution for a specific issue in fact ;)

Orchestrating the allocation/replication via on instance scripts might be just what you had in mind though? Let's discuss this during the upcoming hangout ...

dpb587 commented 10 years ago

I've created a PR (#326) with the relevant code changes. Currently I consider the implementation to be primarily a "proof of concept" - I've tested it with smaller datasets.

cityindex-attic / logsearch

Research Scheduled Scaling Strategies #313