Analyze/Implement Auto Scaling for Elasticsearch

sopel commented 10 years ago

This has been extracted from #149: Being a distributed restful search and analytics service, Elasticsearch is obviously built to scale horizontally out of the box.

While this is easy to achieve via manually scaling, it isn't without challenges to do so in a highly dynamic or even auto scaling fashion - e.g. Rotem Hermon argues Auto scaling doesn't make a lot of sense with ElasticSearch:

Shard moving and re-allocation is not a light process, especially if you have a lot of data. It stresses IO and network, and can degrade the performance of ElasticSearch badly. (If you want to limit the effect you should throttle cluster recovery using settings like cluster.routing.allocation.cluster_concurrent_rebalance, indices.recovery.concurrent_streams, indices.recovery.max_size_per_sec . This will limit the impact but will also slow the re-balancing and recovery).

This seems to highlight that we need to further refine the goals first (see #149 for a high level summary), insofar it isn't (necessarily) desired to facilitate Auto Scaling for its utmost goal of frequently scaling in and out automatically, rather use the Auto Scaling platform to easily apply manual adjustments at appropriate times or with slowly but ever increasing load (we are doing this already to scale vertically by changing the instance type via CoudFormation for example - currently this implies some downtime of course).

dpb587 commented 10 years ago

I agree with Hermon in that scaling elasticsearch is not and should not be as dynamic as our, for example, logstash parsers. I do think we can still use auto scaling concepts to at least start out, but any auto-scaling behavior should be restricted to manual operation or long-term trending (unlike our 5 minute alarms for logstash). As I understand it, there are a couple levels that we can scale elasticsearch:

data - elasticsearch data nodes are primarily responsible for storing and retrieving data
- primary scaling metric:
- disk % usage - very low suggests we scale-in, very high suggests scale-out
- disk I/O - very low suggests we scale-in, very high suggests scale-out
- automation blockers:
- rebalancing shards does indeed require a lot of network and disk I/O; that would probably need to be manual, at least to start out until we gather better parameters. Given our primarily quiet usage of prior days however, this is probably not as difficult as in most setups with mostly hot indices.
master / client - I'm not entirely sure the difference between these two operation modes aside from "master" contains the global cluster state and "client" is responsible for aggregating results from multiple shards; neither needs to house the data itself. I believe all masters are clients but not all clients are masters. I don't know the distinguishing metrics, but for at least "clients":
- primary scaling metric:
- cpu - used to aggregate results together and calculate true results
- automation blockers: none

In architecture diagrams, I've frequently seen masters as the endpoint for pushing data into the cluster, and masters or clients for getting data out. I think scaling masters is more important with larger or frequently changing cluster nodes?

All this discussion about auto-scaling is moot though until we discuss our backup strategies. We currently cannot scale beyond two nodes due to our simplistic snapshot and recovery techniques. By switching to 3+ nodes, data is no longer guaranteed to live on a single disk.

sopel commented 10 years ago

@dpb587 - thanks for the analysis/summary; I'm trying to wrap my brain around the (partially evolving) Elasticsearch architecture and scaling concepts:

Disclaimer

As per the opening statement and the length of my analysis below, I might very well be entirely mistaken about some aspects discussed here, or just not see the wood for all the trees ;)

Preface

As well known and illustrated on the ES Overview, ES clusters are resilient and built to scale horizontally out of the box - accordingly we need to analyze what aspects of our solution are blocking us from fully utilizing that potential right now (apparently).

Horizontal scaling is supposed to address different but overlapping aspects in different ways (emphasis mine):

capacity by increasing the number of primary shards - relevant aspects:
- You can specify fewer or more primary shards to scale the number of documents that your index can handle. You cannot change the number of primary shards in an index, once the index is created.
- this implies scaling out the capacity (and persistence layer) being conceptually impossible, because one needs a new index in case, i.e. effectively requiring to replace a cluster instead (see 4. below for more on this)
performance/resilience by increasing the number of replicas - relevant aspects:
- By default, each primary shard has one replica, but the number of replicas can be changed dynamically on an existing index. A replica shard will never be started on the same node as its primary shard.
- this would allow to add replica shards via auto scaling indeed

Questions/Comments

0) Is there any good high level overview of the ES architecture aside from what's promoted on the overview page?

1) Regarding master/client you presumably refer to what is touched within master election (or are there any other references)?

2) It appears to me that it might be reasonable to start with splitting data and http nodes for separation of concerns - if I understand this correctly, this would imply easing of auto scaling aspects already, insofar these nodes would perform relevant parts of the search operations still:

The benefit of using that is first the ability to create smart load balancers. These "non data" nodes are still part of the cluster, and they redirect operations exactly to the node that holds the relevant data. The other benefit is the fact that for scatter / gather based operations (such as search), these nodes will take part of the processing since they will start the scatter process, and perform the actual gather processing.

This relieves the data nodes to do the heavy duty of indexing and searching, without needing to process HTTP requests (parsing), overload the network, or perform the gather processing.

3) Speaking of load balancers, these are also the correct approach to address the topics of handling fixed IP addresses and scaling the reverse proxy (as mentioned in https://github.com/cityindex/logsearch/issues/149#issuecomment-30308645 and https://github.com/cityindex/logsearch/issues/271#issuecomment-30436499).

this obviously requires clients to register their IP address with the load balancer automatically, which is trivial to to when adding an Elastic Load Balancer to our using Auto Scaling groups (see e.g. Auto Scaling Group with LoadBalancer, Auto Scaling Policies, and CloudWatch Alarms)

4) If I read things correctly, ES used to have the concept of shared gateways, which would have made full cluster recovery/replication easier by depending on the infinitely scalable S3 backend for example, but those are all deprecated in favor of the local gateway, which implies that full cluster recovery/replication (of an actually distributed cluster, which we do not use yet) requires a copy of each nodes local storage, making some things more complex in fact?:

Note, to backup/snapshot the full cluster state it is recommended that the local storage for all nodes be copied (in theory not all are required, just enough to guarantee a copy of each shard has been copied, i.e. depending on the replication settings) while disabling flush. Shared storage such as S3 can be used to keep the different nodes' copies in one place, though it does comes at a price of more IO.

Do you happen to know the reasoning behind deprecating such a helpful concept as S3 backups (aside from the mentioned I/O and thus cost) or am I missing the point somehow?

Either way, this seems to hint on #272 being the better approach eventually, i.e. rather than recovering/replicating a cluster from an existing persisted index, we could deploy a new cluster and ingest all currently relevant messages again to create a new index from scratch (would also enable zero downtime aka blue green deployments)?

Preliminary Conclusion

We seem to be currently using ES very conservatively, almost like a database with a single read replica only (also evidenced by only ever scaling up rather than out so far). The reason for this seems to be relying on the one backing EBS volume being the single source of truth regarding persistence, which defeats the distributed aspects of ES and thus our ability to scale.

:question: Writing this, I wonder whether it is actually possible to take that to the extreme and configure ES to use a single node with primary shards only, yet have multitude of read replica nodes (not that I think this is smart, just an interesting edge case)? Section shared allocation filtering seems to suggest it is ...
:interrobang: It seems to be possible to (semi-) automatically scale out the query middle tier (documented) and the read replica search backend (only if isolation from primary shards would indeed be possible) one way or another, addressing one of two performance issues on file (namely #217)?!
:exclamation: Either way we apparently need to address our persistence concept differently to remove the implied blocker for scaling out (rather than just up as before)!

dpb587 commented 10 years ago

You cannot change the number of primary shards in an index, once the index is created.

While technically possible to recreate an index with more shards, it is quite inconvenient. The "shard" aspect only refers to how "scaled" the index can be. For example, we currently use 4 shards which means that index can be split across, at most, 4 instances of elasticsearch. This excludes replicas which could be on an additional 4 separate instances. If we decide we want more scaling, it is still true that it's easiest to expand the primary shard number to new indices only.

0) ES architecture 1) master/client

You might find the following articles from my browser history interesting:

I agree and is something I mentioned in my scaling blog post a while ago; it would simplify DNS requirements if we're letting a local elasticsearch node discover itself via security groups.

3) load balancer

Not sure if you're referring to kibana or elasticsearch traffic. Elasticsearch traffic can be "load balanced" by using a local http elasticsearch instance as in (2) if that's what you were referring to. It's an alternative to managing active elasticsearch IPs

4) shared gateways

Shared has been deprecated due to the I/O requirements. Here's the summary issue created for it. I don't agree with #272 necessarily being the correct solution to this. It would be much more inefficient and much slower than recovering from a snapshot of the elasticsearch state.

We seem to be currently using ES very conservatively...

Agreed.

single primary shard node, multitude replica nodes

Possible, yes. Useful only as long as we'd be able to scale the two classes separately since eventually the primary shard could get overloaded having too much writing vs reading.

scale out query middle tier

Yes, either by elasticsearch client/proxy instances, or an HTTP load balancer to elasticsearch

sopel commented 10 years ago

@dpb587 - Always funny how one might cycle around himself over time, I've actually explored the local vs. shared gateway aspects myself in https://github.com/cityindex/logsearch/issues/25#issuecomment-20009594 already, linking that very issue you referenced as well ;)

the comment also references a few alternatives, amongst them the new Snapshot/Restore API meanwhile started via https://github.com/elasticsearch/elasticsearch/issues/3826 - incidentally/ironically phase I will only support shared file system repository and S3 repository ...

And thanks for the pointer/reminder regarding your thorough and insightful Labs Team scaling post (assuming you are referring to 'LogSearch: thoughts for possible improvements'), which I always planned to circle back to after my vacation, but never did (other than reading it) - if I had done this, I could have saved me (and you) most of the time for reading/discussing this, sorry for the repetition ...

sopel commented 10 years ago

As discussed in https://github.com/cityindex/logsearch-config/issues/59, we are going to tackle this via separate issues for separation of concerns, see above(/below).

dpb587 commented 10 years ago

Merged with #309.

cityindex-attic / logsearch