elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.48k stars 8.05k forks source link

Observability should more clearly indicate to the user when an outage in a service is the root cause #183216

Open sorenlouv opened 1 month ago

sorenlouv commented 1 month ago

When running an application consisting of multiple services we should make it easier for the user to understand, if the root cause of a problem is caused by a specific service that's gone down.

Scenario When running the Otel-Demo the "checkout" service is killed on purpose. This causes the failure rate of the frontend service (and other services) to increase because they have a downstream dependency on the checkout service. This in turn causes alerts to be triggered.

Problem Nowhere in the UI do we show that the checkout service has gone down. The checkout service itself is not emitting any alerts because the failure rate for this service is not increasing (it is no longer receiving traffic so it might look like failure rate is declining). Navigating to the frontend service shows errors and alerts but these do not clearly indicate that the checkout service is the root cause.

Solution

Related: https://github.com/elastic/kibana/pull/183215

elasticmachine commented 1 month ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

emma-raffenne commented 1 month ago

cc @drewpost @roshan-elastic @smith This is a scenario we would like to investigate to provide an AI Assistant based solution. We don't know how much would that be a RCA workflow or if it would be in the scope of the ROO initiative as well. Comments and inputs welcome.

roshan-elastic commented 1 month ago

Thanks @emma-raffenne @sorenlouv (cc @chrisdistasio

I completely agree with this use case. Only a few weeks ago did I sketch out something with a customer where they wanted to see the sequence of which services failed to understand which ones was the cause vs symptom. I know this isn't exactly the same but I see a relation here.

@drewpost - The way I see this working is that we need a better 'status' indicator which can highlight if a service is having a problem. I think having a status is part of ROO but I think there is an opportunity for an RCA workflow to guide users to see the impact/dependencies here. Curious of your thoughts?

sorenlouv commented 1 month ago

where they wanted to see the sequence of which services failed to understand which ones was the cause vs symptom. I know this isn't exactly the same but I see a relation here.

@roshan-elastic That sounds very similar. At the moment our UIs often show alerts/errors for the dependent services, but not for the service that actually died/crashed. So while we show symptoms we don't show the cause. We also don't show anything OOTB meaning users have to setup the right rules. This could also be a significant barrier.

Zooming out, I think we as an Observability org should define multiple signal-agnostic scenarios that focus on common user problems that we can help them troubleshoot and understand OOTB.

Suggestions for scenarios:

In addition to defining these scenarios we should make it very easy for stakeholders to reproduce them. For reproducing the problem with a single service failure having cascading impact I simply used the OpenTelemetry demo and killed one of the docker containers (detailed setup notes here).

drewpost commented 1 month ago

Thanks everyone. There's some good stuff in here. I'm feeding this into the work we did at the offsite last week. Whilst I'm not sure this will ultimately live in the APM UI as we know it today, it absolutely will feed into RCA. Particularly the signal agnostic OOTB scenarios outlined.

roshan-elastic commented 1 month ago

Thanks @sorenlouv - this is a good suggestion. I think this is what you're saying but these could perhaps be test cases available in test environments we can validate our development against.

I'll think about this more when I have some headspace.