Closed dpb587 closed 10 years ago
I'd appreciate whatever feedback you might have on the PR description or implementation. Particularly @sopel, if you have any feedback from the AWS/Auto-scaling perspective.
I also prefer the Dedicated Instance option. Lets go for that option.
Great progress on the Auto Scaling topics :)
Regarding the DNS topic I'd pretty much prefer using an ELB, because it provides such an easy integration with Auto Scaling and CloudWatch (and is one component less to be concerned with) - we are pretty dependent on AWS anyway right now with CloudFormation and Auto Scaling in particular, and while all could in theory be replaced by custom solutions, it would be a lot of currently unjustified work (IMHO); ironically the ELB would be the tier replaced most easily by something else (e.g. a resp. proxy).
That being said, a separate dedicated instance is something I've contemplated as well already, which means your preferred solution might be reused for other things too, here's why (this also applies to the EBS volume topic explored in detail by @dpb587 in his two internal RFCs):
I had already mentioned CloudFormation Custom Resources before, which are special AWS CloudFormation resources that provide a way for a template developer to include resources in an AWS CloudFormation stack that are provided by a source other than Amazon Web Services.
Their usage has become much easier with the advent of the aws-cfn-resource-bridge, a custom resource framework for AWS CloudFormation - it has been introduced at re:Invent and documentation is pretty much lacking still, but the following resources provide sufficient details:
Of the latter 5 aws-cfn-custom-resource-examples, the following 3 are particularly noteworthy and likely useful down the road:
The latter two require a custom resource backend running somewhere - not surprisingly CloudFormation templates are included, but the service should also be runnable elsewhere conceptually, e.g. as a Cloud Foundry app.
@dpb587 - I reckon this framework and your scripting skills might yield quite some new options for tailoring CloudFormation usage here and elsewhere, I think custom resources are heavily under utilized so far.
Regardless of my own, there are differing opinions here and is probably something we should discuss and decide tomorrow. I'm really not heavily against ELB, but I did have an additional question about it: with a dedicated instance it can have nginx on port 80 and elasticsearch on 9200 with only 80 being public. Would ELB require multiple balancers to achieve that split security? Additionally, with one ELB I think it'd require putting kibana/nginx back on the individual elasticsearch servers (which ultimately isn't that big of a deal).
If you want to load balance port 9200 also, it is indeed not possible to achieve split security with an ELB, at least not in EC2-Classic (it might be possible in a VPC by means of closing a port in the additional [Network ACLs](Network ACLs) filtering layer, but I haven't explored that yet (see also Deploy Elastic Load Balancing in Amazon VPC) - anyway, we are not there yet VPC wise.
Either way, I didn't want to promote switching to ELBs, given you have a working solution and both you and @mrdavidlaing are in favor of a dedicated instance (which might get reused for custom resource backends), just wanted to record my preference for posterity ;)
Regarding Kibana I maintain that I'd rather run it as a separate tier for separation of concerns, preferably just from S3 (plus CloudFront eventually), it's an entirely static website after all (see #271) - but that's really for another day (i.e. only relevant once all the complex/costly stuff is addressed).
Ah, thanks for clarifying.
For me the clincher is that with ElasticSearch acting as the router on the dedicated instance; new nodes & retired nodes will be detected quicker. (in theory). So I'm still for the original dedicated instance option.
Its probably also worth noting that nginx has become an important part of our HTTP pipeline into ES; it has turned out to be a surprisingly useful location to restrict / modify ES traffic (eg, the ?timeout=10
addition, blocking of the _all
searches, etc)
That's correct, and Nginx is frequently used as the frontend/proxy/edge tier like so; it's worth noting though that the current implementation via node-kibana-default.template
once again introduces a single point of failure and load contention (albeit one likely to be less fragile than the backends of course).
I maintain that single dedicated instances simply have no place in a truly elastic cloud ready architecture (at least for the application tier, persistence is the tricky part of course), so that should be an Auto Scaling group instead (even if running a single instance only for whatever reason, but scaling a stateless proxy is comparatively trivial).
@dpb587 - Shouldn't we be able to run the ES instance in proxy mode via an Auto Scaling group too (at least that's what I've figured from previous conversations around the ES scaling architecture, see last paragraph in https://github.com/cityindex/logsearch/issues/270#issuecomment-30753943)?
It's true about maintaining a SPOF, but I believe it's less susceptible as a SPOF than our current setup. In theory, the node-kibana-default could be run as an auto-scaling group quite trivially.
I think Hipchat's ES experience on AWS favours the "Dedicated ES master / router" instance we're proposing too.
Specifically, from http://highscalability.com/blog/2014/1/6/how-hipchat-stores-and-indexes-billions-of-messages-using-el.html:
Automated failover on Amazon is problematic because of the network unreliability issues. In a cluster it can cause an election to occur errantly.
Ran into this problem with ElasticSearch. Originally had 6 ES node running as master electable. A node would run out of memory or hit a GC pause and on top of that a network loss. Then others could no longer see the master, hold an election, and declare itself a master. Flaw in their election architecture that they don’t need a quorum. So there’s a brain split. Two masters. Caused a lot of problems.
Solution is to run ElasticSearch masters on dedicated nodes. That’s all they do is be the master. Since then there have been no problems. Masters handle how shard allocation is done, who is primary, and maps where replica shards are located. Makes rebalancing much easier because the masters can handle all the rebalancing with excellent performance. Can query from any node and will do the internal routing itself.
At the core, we can now use AWS Auto-Scaling to manage our elasticsearch cluster. Instead of duplicating hefty chunks of code in our main CloudFormation template to add a node, we now simply update a single parameter to adjust the size of the cluster. While it's technically easy to automatically add or remove a node, it's logistically difficult to automatically rebalance the data across a new cluster size (so that's still a manual optimization step). It's pretty fun to be able to quickly add a new node and see it magically appear on an elasticsearch dashboard. The result is a new CloudFormation template called group-elasticsearch-default.template.
This auto-scaling does present a new problem, however. Previously our main CloudFormation template created DNS records which pointed to one of the elasticsearch nodes. As an auto-scaling group, those IPs are no longer directly accessible from within CloudFormation. Basically, we need a reliable elasticsearch endpoint for:
I thought of a couple alternatives:
I picked the Dedicated Instance option as my preferred implementation though because:
The result is a new CloudFormation template called node-kibana-default.template and minor adjustment to disable kibana on the auto-scaling template. I'm not a huge fan of adding an additional node for the reasons of cost, which is why I'm pushing for the related discussion on creating reliable spot instances and their significant cost savings.