Open jasonrhodes opened 2 weeks ago
Main components on Investigation detail page when starting a new investigation:
I'll put details of each component in comments below.
Note: The initial timerange would be same as timerange used for main chart in alert details page.
In merged state, y-axis is not shown. All y-values are normalized in a way different charts can be correlated.
Compact view in alert details page
Expanded view in investigation detail page
1. Log rate ⬆️ / ⬇️ (based on all log documents) To find log rate, we need two timeblocks to compare and define internal threshold to indicate there is a significant increase or decrease in log rate. My suggestion is that we divide the timerange in blocks of time windows (based on rule lookback window?) and compare each time window with next to find if there is an increase or decrease in logs, calculate rate at which logs increased/decreased. If it is significant enough e.g. 1.5x or 2.0x, this would be an event.
When rule is not log based, we can check *log*
index pattern filtered with source/entities and time range to find relevant logs.
2. Error rate ⬆️ / ⬇️ (based on documents with log.level: error
)
Same logic as "Log rate" to find Error rate events
There could be other fields that contribute to error rate:
http.response.status_code
3. Related alerts ([3]
annotation in event timeline)
Show number of alerts triggered for the same source/entities in compact view. In detail view, show short reason message for example, "Latency threshold breached" or simply "Latency increased", for each of the alerts.
4. SLO burn rate alert If there are SLO burn rate alerts for same source/entities, show it as event on event timeline.
Divide timerange in blocks, check in each block for a set of fields to add in event timeline depending on the use case.
Use case: Log (Custom threshold / Log threshold) alert on kubernetes.pod.uid
kubernetes.event.reason
: Unhealthykubernetes.container.status.restarts
> 5? (Container restarts count)kubernetes.container.status.phase
: terminated/waitingkubernetes.container.status.reason
: Error, OOMKilled stateUse case: Log (Error count / Custom threshold) alert on service.name
service.version
change - need to compare each block to find if there is any version upgrade/downgrade historyservice.state
: failed (need to confirm what are possible values)Use case: Log (Custom threshold / Log threshold) alert on container.id
Use case: Log (Custom threshold / Log threshold) alert on host.name
Great work! I think you covered most of the main parts. Two small details that are missing, are the invited members and the escalated integrations (Jira ticket, Github actions). These don't have to be part of "first version" of course, but still I would add them to the list, and when we create the subtickets we prioritize accordingly.
These can be app specific charts - from infra, APM, synthetics - and ML visualizations.
Alert type | common | source: host | source: k8s pod | source: container |
---|---|---|---|---|
Log |
|
|||
Infra Metric |
|
|
|
|
APM |
|
Add ability to link runbooks in rule form so that users can add runbook links when creating rules. In investigation detail page, the runbook link is shown at the top under Investigation timeline.
Users can create new hypothesis and start adding notes/screenshots to it. Multiple hypothesis can be created.
As per the design, for some of the events, users have possibility to go to relevant dashboard. For this, we can allow users to link dashboards in rule form. If we detect event related to entity (for example, container restart, node failure), we can show all dashboard links that users have added while creating rule. This is under assumption that users linked dashboards related to monitoring entities.
Alternatively, we can create a section on investigation detail page to show dashboard links that users added in rule form without attaching them to any particular event in event timeline.
Great work! I think you covered most of the main parts. Two small details that are missing, are the invited members and the escalated integrations (Jira ticket, Github actions). These don't have to be part of "first version" of course, but still I would add them to the list, and when we create the subtickets we prioritize accordingly.
I have updated this comment to add future iterations section. I added the points you mentioned plus some other topics.
@benakansara I think you nailed it! I suggest we add a few more charts for SLO burn rate rule, for example error budget consumption, historical SLI, good & bad events, basically what we currently have in the SLO detail page. Unless we think some of these charts don't bring that value in the investigation process.
This is great, thanks so much, @benakansara !
@mgiota I just synced with @kdelemme and @benakansara and they've got some good context now from the other POC, so they're going to jump in on these investigation UI side tickets. Feel free to continue to be involved as we refine these (asking questions, syncing between the entry point flow and this flow). Thanks all.