[RFE] Customizable criticality strings in detectors program text

ghost commented 3 years ago

Is your feature request related to a problem? Please describe. Currently detectors program text publishes CRIT, MAJOR, or WARN strings (I do not see INFO or DEBUG); which in turn used by detect_label in the rules to send out notifications. I would like to propose to turn those into strings variables so that a CRIT can be overridden to be a MAJOR for example. The reason for this request is that an event may be CRIT for one team does not necessarily means it's also a CRIT for another team. A good example for this is the smart-agent_system-common detectors.

Describe the solution you'd like Using the smart-agent_system-common as an example.

In this file: https://github.com/claranet/terraform-signalfx-detectors/blob/master/modules/smart-agent_system-common/variables-gen.tf, define extra parameters for each detector. For example, heartbeat_crit_value, heartbeat_major_value, and heartbeat_warn_value. Majority of the modules will only need the first two extra parameters.

Then in the module's detector tf file, https://github.com/claranet/terraform-signalfx-detectors/blob/master/modules/smart-agent_system-common/detectors-gen.tf, update the program_text of each resource to .publish('${var.heartbeat_crit_value}') or .publish('${var.heartbeat_major_value}').

With the changes above a CRIT event can be turned into any event instead by just overriding the variable.

Describe alternatives you've considered I have read through the modules and have not found a solution for this request.

xp-1000 commented 3 years ago

Hello @chungktran

I understood the issue and this is legitimate to ask.

That said, the current configuration is an intended opinionated implementation. That doesn't mean it can't change but this is a tricky one and before go further I need to deeply understand the final use case behind and why the current behavior is not enough.

for now the severity of an alert follows the notification binding policy. so an alert has an unchanging severity, that is true, but its notification destination can change depending on the team / user.

if we take again your example, the system-common module has a cpu detector using unchangeable critical/major severity couple. It can make sens for static infrastructure like on premise but this can be undesirable for dynamic ones like cloud IAAS where cpu is not that important. supposing you use have different notifications services as standard like pagerduty for oncall and slack for notification so you can bind:

the global notifications variable to respect this standard (critical/major to pager et lower severity to slack)
the cpu_notifications variable to slack instead of pager to override the standard defined by the previous config

The alert will still has the same severity for true but there is no oncall raised, only a non disupritve slack message because it is not important in this case. I understand this does not address directly your issue but it can serve the same purpose (depending on yours obviously).

In addition to the complexity it can involve to make this dynamic, keep it static is also a way to standardize a recommended severity (and its notification destination) and force the contributor to think about the most relevant severity when it creates new detectors.

please tell me if this notification binding makes sens for you and else why it cannot work for you, thanks.

ghost commented 3 years ago

@xp-1000 Thanks for your feedback. I'll see if I can have my requirements met by doing what you suggested. If you don't mind; please keep this request open for now.

ghost commented 3 years ago

@xp-1000 I reviewed the variables and see if I can make what you suggested works for us; unfortunately it doesn't.

The reason for my use case is this. I have a team in OpsGenie that takes all alerts and make alert routing rules based on Priority. Priority in OpsGenie is the severity key in SignalFx detectors.

With the current design of the detectors I have to create a different team in OpsGenie for CRIT and MAJOR alerts because the detectors don't allow the severity key to be changed.

With my proposal I could have just one OpsGenie team and make CRIT detectors send Major for severity instead of being forced to send Critical for severity. So, with the ability to change the severity key in the detectors I can ultimately use OpsGenie to do all the routing decisions.

xp-1000 commented 3 years ago

hello @chungktran

Ok thank your for the feedback!

As I said before, the way we create / template our detectors now are opinionated and "per severity per rule" oriented so as you can imagine this will be difficult to make this fully customizable.

For example, I think about the variable naming itself which contain the severity rule name (e.g. my_detector_threshold_critical).

If I right understood your use case, if you are able to change the severity attribute only: https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs/resources/detector#severity this should be good for you, am I right ?

This will be a little weird because you will still have to use variables of the original severity for a detector/rule where you changed to another one. (e.g. changing a severity rule from critical to major will be possible but if you want to change the threshold you still will to update my_detector_threshold_critical variable).

Is this enough for you ?

honestly make the whole thing fully dynamic and able to change including variables names will be hard and may be the only solution is to use terraform cdk to enjoy all features from a real language.

claranet / terraform-signalfx-detectors

[RFE] Customizable criticality strings in detectors program text #305