Eximchain / terraform-aws-quorum-cluster

A tool to launch a quorum cluster in AWS leveraging Hashicorp software.
Other
13 stars 10 forks source link

Mechanism to provide more detail when alarms trigger #38

Open Lsquared13 opened 6 years ago

Lsquared13 commented 6 years ago

Problem

Currently, if we get an alarm notifying us of something like a crashed quorum process, we have no way to tell basic details like what instance it came from

Solution

We need a mechanism that can give us additional information about such things. Our initial goal should be getting the Instance ID, Public DNS, and Region when we get a metric for process crashes, but the design should be extensible to providing other data when metrics are emitted and/or alarms go off. It is acceptable for us to have to retrieve the information as long as we can do so in O(1) time with respect to the number of instances in the network.

I suspect CloudWatch logs are the best tool for the job, but we'll need to evaluate alternatives

Lsquared13 commented 5 years ago

Our Cloudwatch alarms can be found here

Lsquared13 commented 5 years ago

@EximChua this is probably the best thing to start working on next when you have bandwidth

ExcChua commented 5 years ago

In order to evaluate and devise a solution for this, I need access to the actual Cloudwatch logs to see what data can be extracted, or a sample of a Cloudwatch log for the alarm.

Please provide this access or a sample log.

Lsquared13 commented 5 years ago

I think they may need to be enabled and thus they may not exist yet. I would suggest making changes to a local copy of this repo to enable them and inducing things like crashed processes.

It's also very possible CloudWatch Logs aren't the best way to do this.

The basic things I think we'd like info on to start is identification for instances that cause crashed process alarms, low disk space alarms, or no peer alarms.

Lsquared13 commented 5 years ago

I wonder if this would be useful. @eximchain137 pointed me at gruntwork.io. Just starting to look into it but let us know if you have opinions.

ExcChua commented 5 years ago

@Lsquared13 It's difficult to say whether gruntwork.io would work or not, because there's insufficient details from its Github site, and from its website.

All its public sources contains ONLY 4 lines of boilerplate comments which says one must be a customer in order to see the private repo.

Lsquared13 commented 5 years ago

Okay, don't get hung up on it. It's just a thing @eximchain137 stumbled across, feel free to ignore that if you don't think it's worth the time. If you'd like to investigate it further we can look into getting some sort of a trial or even signing up for a bit. Your call here, I don't have a strong opinion on that either way.

ExcChua commented 5 years ago

Not sure if it's not worth the time if I can't see what it's doing. So, can you get me access please?

Lsquared13 commented 5 years ago

I need to discuss it, after giving it another look it doesn't seem like they have a free trial option or even a subscription without a commitment. I'll get back to you on this but in the meantime don't count on it. I'd hate to blow a year's subscription on something we don't end up really using.

ExcChua commented 5 years ago

Hi Louis,

This is no longer necessary.

My preliminary solution for the details for the alarms trigger is working now.

Thanks!

On Fri, Dec 21, 2018, 1:51 PM Louis Lamia <notifications@github.com wrote:

I need to discuss it, after giving it another look it doesn't seem like they have a free trial option or even a subscription without a commitment. I'll get back to you on this but in the meantime don't count on it. I'd hate to blow a year's subscription on something we don't end up really using.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Eximchain/terraform-aws-quorum-cluster/issues/38#issuecomment-449263485, or mute the thread https://github.com/notifications/unsubscribe-auth/AmU_zkTpyKS54Tf1e3ReGcTo1Njn_gE3ks5u7HbLgaJpZM4V6tNc .