Closed dpb587 closed 10 years ago
manually: ssh into the instances run `crontab -e` and add a '#' comment, exit and save
mrdavidlaing
which user?
dpb587
Or at the end of the post-provision script, something like: `touch /tmp/empty && sudo -H -u ubuntu crontab /tmp/empty`
user is ubuntu
I think this stems from us not having a concept of "production" vs "test".
Agreed concerning lack of clear separation of 'environments' being the source issue at hand - we have considered that concerning various aspects of the stack though, and e.g.https://github.com/cityindex/logsearch/issues/161#issuecomment-23940725 introduces a resp. template from a collectd/Librato PoV, which is supposedly flexible enough.
That template is mostly sufficient for both separating and correlating related metrics in various environments within Librato, because one can use wildcards to match metric substrings and thus either include or exclude related metrics as needed.
So handling it in a similar way for CloudWatch would solve the related issues there I think. However, it does not solve the StatusPage issue I'm afraid: The integration is based on Librato metrics in fact, but other than Librato itself, StatusPage doesn't seem to provide substring/wildcard matching for metrics, rather requires a dedicated one, which we'd loose like so.
Consequently some sort of deployment triggered automatic reconfiguration somewhere seems to be in order (so this is a bit related to #147) - I'm not sure yet what component is the best candidate for that, but my gut feeling asks for clearly identified separate metrics as the 'source of truth' and to apply dedicated customizations for derived stuff like the StatusPage metrics as needed only.
So... afternoon hackathon to make a StatusPage alternative with graphite as the backend where we can lookup metrics with wildcards? :)
Sounds good conceptually :) (if only there were enough spare afternoons for that kind of fun ;)
As an interim workaround we might also facilitate another significant and not yet properly covered new CoudFormation feature, namely Condition Declarations, which allows to write a template that creates a resource or assigns a property value only if a specified condition is met:
At stack creation or stack update, AWS CloudFormation evaluates all the conditions in your template before creating any resources. Any resources that are associated with a true condition are created. Any resources that are associated with a false condition are ignored.
Not surprisingly, their sample template includes an EnvType input parameter, where you can specify prod to create a stack for production or test to create a stack for testing, so we could likewise only create the metric for one environment, or one continuous for production and distinct one for each other deployment (the former might still not be desired, e.g. regarding the dangling alarms, just to complete the picture here).
How about an afternoon hackathon to publish graphite metrics into StatusPage via their "Submit data for a custom metric" API call - see http://doers.statuspage.io/api/v1/metrics/ , near the bottom.
Or CloudWatch metrics. Extra points if you can get StatusPage to take over maintaining it :)
Due to my shifted focus for the (start of the) week, I've added this to the current sprint again, because it is an impediment for the test deployments I'm about to tackle (I just feel uncomfortable interfering with the production system metrics, even if it can be manually remedied, sort of).
I'll only pursue addressing the test environment isolation for now, insofar the recently released New Stack Management and Template Features in AWS CloudFormation provide a new option to address exactly those kind of differences, see the example use case in section Language features for writing versatile templates:
You can now write a template that creates a resource or assigns a property value only if a specified condition is met. For example, you could use the same template for both a production and a development environment that would create a CloudWatch alarm only in the production environment. Learn more about Conditions.
I'd like to test that feature anyway and will switch the CloudWatch metrics names to be stack specific for test environments at least, thereby addressing the interference issue (but disregarding the metrics aggregation in production cluster updates).
Hmm, an extra thought on this... one of the reasons we don't change the ClusterName
is because the cluster name is used in the elasticsearch data path /app/data/elasticsearch/{ClusterName}
and we like creating new environments based on previous snapshots. What if we were to add a CloudFormation parameter flag to have it automatically rename the directory in /app/data/elasticsearch/
to whatever the deployed cluster name is? I believe I did that a long time ago when I needed to adjust cluster names and elasticsearch picked up on it as long as it was restarted. Or, we could also add the data directory rename to the cluster-specific post-provision script or post-provision parameter.
Good thinking, I've just been about to ask about the current state regarding ClusterName
in the context of #255, because it semantically seems to already be or at least related to the environment we are talking about. If ES picks this up automatically indeed, couldn't we just use the stack name for this and always rename the data directory?
ClusterName
is really only significant to elasticsearch. It is used for discovering other peer nodes (which would be an issue if we weren't creating them in separate security groups which are blocked from each other) and for the data directory structure.
Closed as Duplicated by #307.
Currently we're always pushing elasticsearch/redis metrics to CloudWatch using the same metric name. This is convenient because we have a consistent trend across time, which also makes it simpler for external stats like StatusPage. It's the original reason I went this direction.
However, this can cause problems when trying to deploy the stack in a test environment. For example, when @mrdavidlaing is creating a test cluster using the typical steps in the README, it will try to push its stats into the same metrics as our running cluster (I've been manually disabling the responsible cron jobs). I think this stems from us not having a concept of "production" vs "test".
We should figure out how to accomplish this, whether it's creating a separate config directory for testing, allowing for merging json configs, using stack-based names instead of global names, or something else.