StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.11k stars 746 forks source link

Additional metrics instrumentation for various services and code paths #4314

Open Kami opened 6 years ago

Kami commented 6 years ago

4310 made a lot of improvements to the instrumentation and metrics code. This means we now have instrumentation and metrics in place for most of the critical code paths.

This will provide us and users with a much better insight into workings of a StackStorm cluster.

Having said that, there are still places for which we don't have any instrumentation and we should improve that in the future:

Elvsy commented 6 years ago

Can you add st2.action.. to the list? Or is there already a counter or a way to infer how many times a specific action succeeded or failed?

Kami commented 6 years ago

We don't track such metric at the moment.

It's not a bad idea to track it, we just need to be careful in case there are many (many thousands) of unique actions in the systems.

None of the metrics systems (graphite, prometheus, etc.) were really designed for a large number of unique metrics so they would break down in case user had many thousand or tens of thousand of actions.

We had a similar issue recently - we tracked some metrics on per execution basis, but we dropped those metrics since they cause too big of a load on the metrics backend. Granted, user will probably never have as many action as there are executions, but still something we need to keep in mind.

alexandrejuma commented 5 years ago

Would it be possible to add some gauges for st2.action.executions.* specifically for intermediate states (i.e: scheduled and running)?

They are useful for graphic real time status of ongoing work being handled by st2.