elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.47k stars 8.04k forks source link

Should tasks that are retried be part of the task manager metrics? #180420

Open mikecote opened 2 months ago

mikecote commented 2 months ago

I noticed when opening https://github.com/elastic/kibana/issues/180419 (you can use the same steps to reproduce) that the metrics were not incrementing when the action was attempted a second and third time.

I'm not sure if this was by design or if it's a bug in our system so I opened this issue to discuss.

elasticmachine commented 2 months ago

Pinging @elastic/response-ops (Team:ResponseOps)

kobelb commented 2 months ago

I'm aware of this behavior, but I'm happy to revisit it as the decisions we made might've been wrong. We decided to handle it this way because on Serverless, we retry these actions 10 times. This was causing a single incident where a connector action failed to increase the failure count 10x, throwing off our SLOs. So only counting it as a failure the first time made the success to failure ratio more accurate.