[RCA] Create investigation detail page

jasonrhodes commented 1 week ago

@mgiota I just synced with @kdelemme and @benakansara and they've got some good context now from the other POC, so they're going to jump in on these investigation UI side tickets. Feel free to continue to be involved as we refine these (asking questions, syncing between the entry point flow and this flow). Thanks all.

benakansara commented 1 week ago

First iteration of Investigation detail page

Main components on Investigation detail page when starting a new investigation:

Rule related charts in merged state
Related events from underlying dataview/index pattern
Suggested observations (depending on rule type - app related visualizations and/or ML visualizations)
Ability to add ES|QL visualization
Ability to import visualizations from dashboards
Investigation timeline
- Runbook selected by user in rule form
- Hypothesis panel with ability to add notes and screenshots

I'll put details of each component in comments below.

Note: The initial timerange would be same as timerange used for main chart in alert details page.

Future iterations

Invite other users
Jira / Github action / other integrations for escalation
Add more charts automatically
Add ability to remove charts, adjust query of charts
Detect more events to show in event timeline
- look for events in other index patterns for same source/entities
Ability to add more runbooks in investigation with an option to update runbooks in rule as well for it to show in future alerts/other existing alerts investigations
Share investigation with other users
Lock visualizations to compare same visualization with different filters
AI assistant to auto-suggest observations/visualizations based on user activity / what's on screen
AI assistant to auto-suggest hypothesis

benakansara commented 1 week ago

Rule related charts in merged state

In merged state, y-axis is not shown. All y-values are normalized in a way different charts can be correlated.

Custom threshold rule/Metric threshold/Log threshold with single or multiple conditions

All condition charts

APM Latency/Error count/Failed transaction/Anomaly rules

Latency chart
Error distribution chart
Failed transaction rate chart

SLO Burn rate rule

Burn rate chart

benakansara commented 1 week ago

Related events from underlying dataview / index pattern

Compact view in alert details page

Expanded view in investigation detail page

Common for all use cases

1. Log rate ⬆️ / ⬇️ (based on all log documents) To find log rate, we need two timeblocks to compare and define internal threshold to indicate there is a significant increase or decrease in log rate. My suggestion is that we divide the timerange in blocks of time windows (based on rule lookback window?) and compare each time window with next to find if there is an increase or decrease in logs, calculate rate at which logs increased/decreased. If it is significant enough e.g. 1.5x or 2.0x, this would be an event.

When rule is not log based, we can check *log* index pattern filtered with source/entities and time range to find relevant logs.

2. Error rate ⬆️ / ⬇️ (based on documents with log.level: error) Same logic as "Log rate" to find Error rate events

There could be other fields that contribute to error rate:

http.response.status_code
...

3. Related alerts ([3] annotation in event timeline) Show number of alerts triggered for the same source/entities in compact view. In detail view, show short reason message for example, "Latency threshold breached" or simply "Latency increased", for each of the alerts.

4. SLO burn rate alert If there are SLO burn rate alerts for same source/entities, show it as event on event timeline.

Use case specific events

Divide timerange in blocks, check in each block for a set of fields to add in event timeline depending on the use case.

Use case: Log (Custom threshold / Log threshold) alert on kubernetes.pod.uid

kubernetes.event.reason: Unhealthy
kubernetes.container.status.restarts > 5? (Container restarts count)
kubernetes.container.status.phase: terminated/waiting
kubernetes.container.status.reason: Error, OOMKilled state

Use case: Log (Error count / Custom threshold) alert on service.name

service.version change - need to compare each block to find if there is any version upgrade/downgrade history
service.state: failed (need to confirm what are possible values)

Use case: Log (Custom threshold / Log threshold) alert on container.id

...

Use case: Log (Custom threshold / Log threshold) alert on host.name

...

mgiota commented 1 week ago

Great work! I think you covered most of the main parts. Two small details that are missing, are the invited members and the escalated integrations (Jira ticket, Github actions). These don't have to be part of "first version" of course, but still I would add them to the list, and when we create the subtickets we prioritize accordingly.

benakansara commented 1 week ago

Suggested observations

These can be app specific charts - from infra, APM, synthetics - and ML visualizations.

Alert type	common	source: host	source: k8s pod	source: container
Log	Log rate analysis Log pattern analysis
Infra Metric	Change point detection	Memory usage CPU usage Disk usage Network traffic	Memory usage CPU usage Network traffic	Memory usage CPU usage
APM	Throughput Time spent by span Transactions table Error occurrences Errors table Service map

benakansara commented 1 week ago

Add new observation

ES|QL - Allow users to write their own ES|QL query, see results of query in form of chart/table or both, add resulted visualizations to investigation
Import existing visualizations

benakansara commented 1 week ago

Investigation timeline

Add ability to link runbooks in rule form so that users can add runbook links when creating rules. In investigation detail page, the runbook link is shown at the top under Investigation timeline.

Users can create new hypothesis and start adding notes/screenshots to it. Multiple hypothesis can be created.

benakansara commented 1 week ago

Dashboards

As per the design, for some of the events, users have possibility to go to relevant dashboard. For this, we can allow users to link dashboards in rule form. If we detect event related to entity (for example, container restart, node failure), we can show all dashboard links that users have added while creating rule. This is under assumption that users linked dashboards related to monitoring entities.

Alternatively, we can create a section on investigation detail page to show dashboard links that users added in rule form without attaching them to any particular event in event timeline.

benakansara commented 1 week ago

Great work! I think you covered most of the main parts. Two small details that are missing, are the invited members and the escalated integrations (Jira ticket, Github actions). These don't have to be part of "first version" of course, but still I would add them to the list, and when we create the subtickets we prioritize accordingly.

I have updated this comment to add future iterations section. I added the points you mentioned plus some other topics.

mgiota commented 1 week ago

@benakansara I think you nailed it! I suggest we add a few more charts for SLO burn rate rule, for example error budget consumption, historical SLI, good & bad events, basically what we currently have in the SLO detail page. Unless we think some of these charts don't bring that value in the investigation process.

jasonrhodes commented 1 week ago

This is great, thanks so much, @benakansara !

elastic / kibana