Verify cluster can be updated via CloudFormation

sopel commented 11 years ago

This details #115 and depends on #116 (and probably #39) - it should be possible to update the cluster with a single command by means of updating the stack via the existing CFN template, e.g. to adjust instance sizes or replace unhealthy instances.

an instance can die at any time in principle, and ideally only that instance would need to be replaced rather than the entire cluster
this is definitely best handled by Auto Scaling rather than manually triggering similar actions, thus it's finally time to tackle #39
- once this is working, replacing unhealthy instances should work automatically even in many circumstances
- spot instance usage is out of scope still, but would be trivial to add thereafter
- likewise, actual scaling is out of scope still (i.e. the number of instances will be fixed), but is technically implied already, see #54 for details

dpb587 commented 11 years ago

This issue confuses me. It seems like two issues:

This is to verify that our current template approach can be used to change instance sizes/replace dropped instances.
Those individual instances should be auto-scaleable. That seems like it'd require multiple templates, one for each instance intent, currently three.

In terms of those two ideas...

I thought I'd be able to manually terminate a CloudFormed instance to emulate it dying, go through the Update Stack process with the same template and it would realize it needed to re-provision, but that's not the case - it just says no changes have been made and doesn't do anything. So now I'm confused whether it's feasible and could use some further discussion with you on that, @sopel.
Now it seems like auto scaling is the best approach, and I could definitely use some discussion advice on that route. CloudFormation templates seem slightly bulky and redundant to have multiple templates each requiring similar parameters.

sopel commented 11 years ago

This can be a bit confusing indeed (which might also be increased by my strong preference for handling everything with Auto Scaling, which is why I mention it everywhere, despite not being mandatory for every discussed subject ;) - the following is based on information/deductions from Updating AWS CloudFormation Stacks:

CloudFormation update handling is intrinsically complex and thus slightly non transparent as such - we already discussed/established at some point, that CloudFormation is not concerned about the health of a resource, rather only about the provisioned resources matching the declared ones, i.e. AWS CloudFormation updates only the resources that have changes specified in the template:
- this explains why, other than when using an Auto Scaling group, just terminating a regular instance doesn't yield another one to be provisioned for example
- once updating the template in a way that yields a provision vs. declaration mismatch, any affected resource is updated depending on the type of resource and, in some cases, depending on the nature of a particular resource property, which might nor might not imply a replacement of the resource
- accordingly one can simulate the Auto Scaling behavior by changing the instance size or availability zone etc.
I think the key aspect to realize is that (except for the implied delays) using Auto Scaling isn't that different from provisioning a single instance (which can easily be done via AS as well), I consider it more like EC2 provisioning 2.0 - as such it simply requires a few additional, yet boilerplate template fragments (and maybe a few things to be aware of), but implies significant improvements for various aspects of EC2 instance operation besides the scaling as such (e.g. health checks, metrics aggregation, spot instance support etc.)
- We should probably explore Nesting a Stack in a Template for managing the redundant template fragments (which are already in place right now too though) - haven't used it much so far, thus cannot really judge he gains/pros/cons yet.
- :information_source: Regarding both it's worth mentioning that CloudFormation creation/update handling has just been significantly improved as of today, see Parallel Stack Processing and Nested Stack Updates for AWS CloudFormation - the post also illustrates the update process a bit.

sopel commented 11 years ago

Still managed to miss the point of your question I think, which highlights my indeed confusing specs in the issue description - a corrective proposal:

let's focus on 1. (CloudFormation stack updates) here by ensuring any instance can be replaced with a differently sized one for example
let's handle 2. (Auto Scaling) via the existing dedicated issue #39

dpb587 commented 11 years ago

I was thinking about splitting up the CloudFormation templates into decomposed components. For example, put the different nodes into their own template with their own WaitCondition, Instance, Alarm.

node-common::parameters {
    AvailabilityZone = "us-east-1",
    ClusterName,
    KeyName,
    InstanceProfile = "",
    InstanceType,
    SecurityGroup = "default",
}

    node-es-ebs-default
    node-es-ephemeral-default
    node-broker-default

The node templates could be loaded via CloudFormation on their own (tweaked as an Auto Scaling Group/Node whatever), or they could be nested within a smarter, composite template equivalent to the current ppe-cluster template.

* secgrp-multi
* node-es-ebs-default
* node-es-ephemeral-default
* node-broker-default
* index-lag-alarm
* queue-size-alarm
* r53-zone

@sopel, what do you think about the approach?

sopel commented 11 years ago

Absolutely, I'd like to explore this path as well, should make things more manageable and also better reusable ideally (providing reusable components has been one goal of the - currently somewhat stalled - StackFormation project).

Operating the various templates manually is certainly an option too, but I think it would be preferable (if not required) to retain the 'build in one click' functionality, and while this could also be done via a script, this would forego some of the dependency management (and as of today parallel processing) capabilities of CloudFormation. Therefore I'd appreciate an attempt in composite template usage (though, as mentioned, I lack experience whether this is feasible/useful for complex composition scenarios).

sopel commented 11 years ago

To be clear about that, I'm certainly open for operating parts of the solution separately, as discussed not everything is necessarily tied to a particular stack deployment - for example one might to deploy canonical sub-domains and/or elastic IP addresses separately and only associate them with a specific deployment at runtime to allow hot environment switching. I'm mainly aiming to have a concerted deployment of the strongly connected tiers.

sopel commented 11 years ago

Considered Done due to goal 1. (CloudFormation stack updates) being available as such.

dpb587 commented 11 years ago

I have a question about this... I was in the process of testing this and a lot of things didn't go as I expected. I...

1) created and waited for a new CloudFormation stack based off ci-r53, 2) started a logstash task from my local machine to continuously push logs, 3) verified logs were going through, 3) initiated an Update Stack (via web console) and change ElasticsearchEbs0InstanceType from the default m1.large to c1.medium.

Behavior confused me when...

all instances were restarted, not just es-ebs whose parameter was the only one that changed;
all instances got new IP addresses, public and private, during restart;
the new EBS instance retained the same boot volume (maybe normal?);
the ephemeral instance got a new ephemeral disk (losing prior data along with the user-script created /mnt/app-data causing normal startup to fail with the missing symlink);
the logstash parser had to be manually restarted before it realized the updated DNS record for elasticsearch servers;
new instance IP addresses now mismatched the static, local /app/.env IP addresses.

I was expecting a nice easy stack update... but it wasn't so. Sad.

@sopel, can you tell me any more, perhaps if I'm missing knowledge or understanding. Or if you have any free time try the process yourself and see if it matches up.

dpb587 commented 11 years ago

Actually, I'm foolish and have to recant much of that. I think I touched the InstancePostScript parameter which affects the metadata which I presume would cause restarts. Don't waste time on this; I'll try and test this again more cleanly tomorrow. Whatever the case, there are already some additional changes that this prompts.

sopel commented 11 years ago

@dpb587 - either way your questions highlight that this isn't actually done yet, thanks for pointing that out (slight misunderstanding regarding you setting this to Done and me only doing partial tests rather the obvious one you attempted now); while I expect some of the points you mention to be caused by touching the InstancePostScript indeed, others like the instance IP addresses mismatch with /app/.env are probably systemic and might require integration of on instance update handling via cfn-hup hooks, which is one of the more complex aspects hinted on above.

Whether we should actually do the latter (vs. just documenting how to manually adjust things in case) depends on your findings tomorrow, which I'll await accordingly before touching this again ;)

dpb587 commented 11 years ago

Logstash DNS Caching Issue: https://logstash.jira.com/browse/LOGSTASH-760

sopel commented 11 years ago

@dpb587 - nice find, I wasn't aware of this JVM default, Jordan Sissel's comment is very much to the point:

Can you try disabling the jvm's dns caching? The default is to cache forever, which is a horrible tragedy of a default.

dpb587 commented 11 years ago

I've been able to scale up es-ebs, memory-wise, and it works automatically. Frontend/backend realize the new endpoints within two minutes without manual intervention and log events

When scaling down elasticsearch nodes, memory-wise, they start having problems. For example, going from an m1.large to a c1.medium. Seems like it's related to hard-coding the ES_HEAP_SIZE in /app/.env which occurs during provisioning based on the instance type mapping. By changing it manually and restarting the service, it comes back up without a problem. The following is the error log message:

Error occurred during initialization of VM
Could not reserve enough space for object heap

When scaling es-ephemeral, it always runs into issues due to /mnt/app-data being non-existant. That directory is created during the initial provisioning, but for a scaled/new instance it's no longer present.

When scaling the broker, it works automatically. As long as the local tunnel continues to re-attempt connection to the DNS name, it picks up the change within two minutes and the local shipper re-attempts failed messages.

So, the known problems are now:

The ES_HEAP_SIZE is written during provisioning so it's configurable. I propose that instead of writing that config, our logsearch Rakefile dynamically calculates and exports it (based on ~45% of installed memory) if it's not already defined.
ES Ephemeral somehow needs to automatically create the data directory if it doesn't already exist. I propose a (hacky) pre-step which gets the destination path of $APP_DATA_DIR and creates a directory if it doesn't exist, accounting for both directories and symlinks.

sopel commented 11 years ago

@dpb587 - thanks for the detailed tests/summary, an interesting experience - your proposed refactorings/workarounds sound good to me!

sopel commented 11 years ago

@dpb587 - with regard to the networkaddress.cache.ttl issue and my comment 3924156, OpenJDK seems to indeed default to cache for 30 seconds as long as a security manager is not set , at least as per the (fairly random) diff of changeset 5543 in OpenJDK 7:

+# The Java-level namelookup cache policy for successful lookups:
+#
+# any negative value: caching forever
+# any positive value: the number of seconds to cache an address for
+# zero: do not cache
+#
+# default value is forever (FOREVER). For security reasons, this
+# caching is made forever when a security manager is set. When a security
+# manager is not set, the default behavior in this implementation
+# is to cache for 30 seconds.
+#
+# NOTE: setting this to anything other than the default value can have
+# serious security implications. Do not set it unless
+# you are sure you are not exposed to DNS spoofing attack.
+#
+#networkaddress.cache.ttl=-1

The encountered and documented behavior would thus imply that a security manager is in place, which I doubt? Or are the former tests to disparate/inconclusive to deduce this?

No need to investigate here, I'm just curious and would like to know how this works to avoid respective errors in other scenarios, because I could imagine this to be a regularly encountered issue with elastic environments and resp. DNS update frequencies (of course the apparent new de facto default of 30 seconds is just fine now in this regard).

dpb587 commented 11 years ago

I created a ci-r53 stack, shipping local logs to it once complete. I ran the following sequence of stack updates, waiting between each step to ensure everything came back online:

ElasticsearchEphemeral0InstanceType: m1.large → c1.medium
Broker0InstanceType: c1.medium → m1.large
ElasticsearchEbs0InstanceType: m1.large → m1.xlarge
ElasticsearchEphemeral0InstanceType: c1.medium → m1.medium
Broker0InstanceType: m1.large → m1.small
ElasticsearchEbs0InstanceType: m1.xlarge → c1.medium

All components recovered automatically and without errors. Depending on instance types (i.e. slower instances took slightly longer), it averaged about 150 seconds of "downtime" (i.e. real-time data not streaming) until everything was streaming through in a timely manner. It's great when things work happily.

dpb587 commented 11 years ago

@sopel, very interesting findings. I re-tested my earlier remarks by reverting the TTL commit on the broker on the stack I was just using, restarting the app-logstash_redis service, and initiating a DNS-changing update. It seemed to work. I can only suppose that I must not have pushed my change in 57df1f8 to S3 by the time I started that test stack (which has higher DNS TTLs than I was waiting for... but I'm pretty sure I waited ~15 minutes before assuming the failure). If I don't think of an alternative explanation by morning, I'll revert ac5a634 since I can't seem to reproduce it.

sopel commented 11 years ago

@dpb587 - that's great news, thanks for the thorough tests! This means we can now scale vertically on demand, which is a major achievement already :)

Let's see whether we can extend that to automatic horizontal scaling as well via #39 - due to the stack complexity there are probably some subtleties involved ...

cityindex-attic / logsearch

Verify cluster can be updated via CloudFormation #119