Observability should more clearly indicate to the user when an outage in a service is the root cause

sorenlouv commented 1 month ago

When running an application consisting of multiple services we should make it easier for the user to understand, if the root cause of a problem is caused by a specific service that's gone down.

Scenario When running the Otel-Demo the "checkout" service is killed on purpose. This causes the failure rate of the frontend service (and other services) to increase because they have a downstream dependency on the checkout service. This in turn causes alerts to be triggered.

Problem Nowhere in the UI do we show that the checkout service has gone down. The checkout service itself is not emitting any alerts because the failure rate for this service is not increasing (it is no longer receiving traffic so it might look like failure rate is declining). Navigating to the frontend service shows errors and alerts but these do not clearly indicate that the checkout service is the root cause.

Solution

AI Assistant insights
The UI should show clearly which service went down and is causing cascading failures
Detection mechanisms:
- throughput for the dead service is 0 for a longer period.
- Analysis of outgoing requests from upstream services will indicate that every request to the checkout service is failing.

elasticmachine commented 1 month ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

emma-raffenne commented 1 month ago

cc @drewpost @roshan-elastic @smith This is a scenario we would like to investigate to provide an AI Assistant based solution. We don't know how much would that be a RCA workflow or if it would be in the scope of the ROO initiative as well. Comments and inputs welcome.

roshan-elastic commented 1 month ago

Thanks @emma-raffenne @sorenlouv (cc @chrisdistasio

I completely agree with this use case. Only a few weeks ago did I sketch out something with a customer where they wanted to see the sequence of which services failed to understand which ones was the cause vs symptom. I know this isn't exactly the same but I see a relation here.

@drewpost - The way I see this working is that we need a better 'status' indicator which can highlight if a service is having a problem. I think having a status is part of ROO but I think there is an opportunity for an RCA workflow to guide users to see the impact/dependencies here. Curious of your thoughts?

sorenlouv commented 1 month ago

where they wanted to see the sequence of which services failed to understand which ones was the cause vs symptom. I know this isn't exactly the same but I see a relation here.

@roshan-elastic That sounds very similar. At the moment our UIs often show alerts/errors for the dependent services, but not for the service that actually died/crashed. So while we show symptoms we don't show the cause. We also don't show anything OOTB meaning users have to setup the right rules. This could also be a significant barrier.

Zooming out, I think we as an Observability org should define multiple signal-agnostic scenarios that focus on common user problems that we can help them troubleshoot and understand OOTB.

Suggestions for scenarios:

Single service failure with cascading impact (the issue discussed here)
Slow response due to resource starvation
Version upgrade causing a sudden increase in logs/errors/failure rate
Unexpected surge in user traffic

In addition to defining these scenarios we should make it very easy for stakeholders to reproduce them. For reproducing the problem with a single service failure having cascading impact I simply used the OpenTelemetry demo and killed one of the docker containers (detailed setup notes here).

drewpost commented 1 month ago

Thanks everyone. There's some good stuff in here. I'm feeding this into the work we did at the offsite last week. Whilst I'm not sure this will ultimately live in the APM UI as we know it today, it absolutely will feed into RCA. Particularly the signal agnostic OOTB scenarios outlined.

roshan-elastic commented 1 month ago

Thanks @sorenlouv - this is a good suggestion. I think this is what you're saying but these could perhaps be test cases available in test environments we can validate our development against.

I'll think about this more when I have some headspace.

elastic / kibana

Observability should more clearly indicate to the user when an outage in a service is the root cause #183216