StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.05k stars 745 forks source link

Develop sort of circuit break native in st2 #5075

Open marceloapps opened 3 years ago

marceloapps commented 3 years ago

Hi everyone, don't know if I'll be able to explain it correctly, but what I want with this is some sort of circuit break natively in stackstorm to prevent some payloads from flooding the system. For example: In a server, some application X service keeps going down and calling stackstorm for a remediation. What I want is to be able to set a threshold of how many times this is acceptable before stop executing anything and send some notice to the owner that there is a more deep problem that needs attention.

Splunk has more or less the same thing with its alerts, where you can configure it to throttle for couple hours if the problem is still the same as the last execution. image

This is something I myself want to develop, but I'd like to discuss it here first. And would appreciate some guidance about where to code it, I've already forked st2 repos into my account.

arm4b commented 3 years ago

Are you requesting the same as Policy to rate-limit action executions #3720 or something different? See the discussion in the referenced Issue.

marceloapps commented 3 years ago

Hi @armab, thanks for the reply. Don't know if I got it correctly, but policies will apply for everything running in stackstorm that invoke such action, am I right? Don't know if that works for me, as I want to control many servers and applications, so I was thinking about something we could set in the rules.yml file. Parameters would be pretty much the same as the other thread.

m4dcoder commented 3 years ago

@marceloapps

I think this is an interesting feature. I advise coming back with a proposal on how user will configure this setting, is this for all actions or only for specific actions, how will the system behave when the condition is met, and where this feature will be implemented.

Ideally, the action execution and the decision whether to cancel/delay or whatever impact should be traceable by user. There should be a record for the action execution and then a decision made by the system on the action execution.

My recommendation is to take a look at the st2scheduler first on design and see if it makes sense for the feature to be implemented there.

I don't think rules will work because you can have the action execution being triggered by rule as well as from the st2 CLI/API. The mechanism needs to be at a common location in the system where the action execution is controlled that's why I recommend starting at the st2scheduler first because every action execution has to be scheduled and we can add logic to decide whether to cancel or delay the execution.

A policy implementation will also work because policies are evaluated during scheduling. We can also design the policy to be scoped to specific action(s). It doesn't have to be system wide. This may be a good starting place as well.

marceloapps commented 3 years ago

Hello @m4dcoder ! Sorry for the delayed response.. i've been checking st2scheduler code and documentation regarding policies. One question, when I set attributes on a policy it means the action will get canceled/delayed when the action tries to execute X times with the same value in the attribute ?