elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.49k stars 8.05k forks source link

[RCA] Create investigation detail page #187286

Open jasonrhodes opened 2 weeks ago

jasonrhodes commented 1 week ago

@mgiota I just synced with @kdelemme and @benakansara and they've got some good context now from the other POC, so they're going to jump in on these investigation UI side tickets. Feel free to continue to be involved as we refine these (asking questions, syncing between the entry point flow and this flow). Thanks all.

benakansara commented 1 week ago

First iteration of Investigation detail page

Main components on Investigation detail page when starting a new investigation:

I'll put details of each component in comments below.

Note: The initial timerange would be same as timerange used for main chart in alert details page.

Future iterations

benakansara commented 1 week ago

Rule related charts in merged state

In merged state, y-axis is not shown. All y-values are normalized in a way different charts can be correlated.

Screenshot 2024-07-04 at 20 42 44

Custom threshold rule/Metric threshold/Log threshold with single or multiple conditions

APM Latency/Error count/Failed transaction/Anomaly rules

SLO Burn rate rule

benakansara commented 1 week ago

Related events from underlying dataview / index pattern

Compact view in alert details page

Screenshot 2024-07-04 at 22 20 45

Expanded view in investigation detail page

Screenshot 2024-07-04 at 22 20 13

Common for all use cases

1. Log rate ⬆️ / ⬇️ (based on all log documents) To find log rate, we need two timeblocks to compare and define internal threshold to indicate there is a significant increase or decrease in log rate. My suggestion is that we divide the timerange in blocks of time windows (based on rule lookback window?) and compare each time window with next to find if there is an increase or decrease in logs, calculate rate at which logs increased/decreased. If it is significant enough e.g. 1.5x or 2.0x, this would be an event.

When rule is not log based, we can check *log* index pattern filtered with source/entities and time range to find relevant logs.

2. Error rate ⬆️ / ⬇️ (based on documents with log.level: error) Same logic as "Log rate" to find Error rate events

There could be other fields that contribute to error rate:

3. Related alerts ([3] annotation in event timeline) Show number of alerts triggered for the same source/entities in compact view. In detail view, show short reason message for example, "Latency threshold breached" or simply "Latency increased", for each of the alerts.

4. SLO burn rate alert If there are SLO burn rate alerts for same source/entities, show it as event on event timeline.

Use case specific events

Divide timerange in blocks, check in each block for a set of fields to add in event timeline depending on the use case.

Use case: Log (Custom threshold / Log threshold) alert on kubernetes.pod.uid

Use case: Log (Error count / Custom threshold) alert on service.name

Use case: Log (Custom threshold / Log threshold) alert on container.id

Use case: Log (Custom threshold / Log threshold) alert on host.name

mgiota commented 1 week ago

Great work! I think you covered most of the main parts. Two small details that are missing, are the invited members and the escalated integrations (Jira ticket, Github actions). These don't have to be part of "first version" of course, but still I would add them to the list, and when we create the subtickets we prioritize accordingly.

benakansara commented 1 week ago

Suggested observations

Screenshot 2024-07-05 at 11 50 55

These can be app specific charts - from infra, APM, synthetics - and ML visualizations.

Alert type common source: host source: k8s pod source: container
Log
  • Log rate analysis
  • Log pattern analysis
Infra Metric
  • Change point detection
  • Memory usage
  • CPU usage
  • Disk usage
  • Network traffic
  • Memory usage
  • CPU usage
  • Network traffic
  • Memory usage
  • CPU usage
APM
  • Throughput
  • Time spent by span
  • Transactions table
  • Error occurrences
  • Errors table
  • Service map
benakansara commented 1 week ago

Add new observation

Screenshot 2024-07-05 at 12 06 52
benakansara commented 1 week ago

Investigation timeline

Add ability to link runbooks in rule form so that users can add runbook links when creating rules. In investigation detail page, the runbook link is shown at the top under Investigation timeline.

Users can create new hypothesis and start adding notes/screenshots to it. Multiple hypothesis can be created.

Screenshot 2024-07-05 at 12 13 04
benakansara commented 1 week ago

Dashboards

As per the design, for some of the events, users have possibility to go to relevant dashboard. For this, we can allow users to link dashboards in rule form. If we detect event related to entity (for example, container restart, node failure), we can show all dashboard links that users have added while creating rule. This is under assumption that users linked dashboards related to monitoring entities.

Alternatively, we can create a section on investigation detail page to show dashboard links that users added in rule form without attaching them to any particular event in event timeline.

Screenshot 2024-07-04 at 22 20 13
benakansara commented 1 week ago

Great work! I think you covered most of the main parts. Two small details that are missing, are the invited members and the escalated integrations (Jira ticket, Github actions). These don't have to be part of "first version" of course, but still I would add them to the list, and when we create the subtickets we prioritize accordingly.

I have updated this comment to add future iterations section. I added the points you mentioned plus some other topics.

mgiota commented 1 week ago

@benakansara I think you nailed it! I suggest we add a few more charts for SLO burn rate rule, for example error budget consumption, historical SLI, good & bad events, basically what we currently have in the SLO detail page. Unless we think some of these charts don't bring that value in the investigation process.

jasonrhodes commented 1 week ago

This is great, thanks so much, @benakansara !