elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.66k stars 8.23k forks source link

[APM][ML] Anomaly status always shows unknown for some services #194106

Open cauemarcondes opened 1 month ago

cauemarcondes commented 1 month ago

On the Service Inventory page, we always try to show metrics from one of the "main" transaction types (request, page_load or mobile). If a service doesn't have one of these, we use the first transaction type returned and add a new column to the Service inventory table displaying its name:

Image

✔️ Acceptance criteria

1. Must Have

Must be delivered in this issue in order for the release to be valuable

Name Description
Service Health will show regardless of whether the anomaly is on a main transaction type or a custom one (e.g. foo) -
Where the service is not "healthy", the Service Inventory will filter the metrics for that service by the transaction type which has the anomalies -
The service views (e.g. "Overview") will prioritise the transaction type which has anomalies This will ensure the user will always see the transaction type which has anomalies causing it to not be 'healthy'

4. Will Not Have (for now)

Explicitly will not be looked at within this issue

Name Description
- -

The problem

As seen in the image above the packetbeat has an unknown health status. And that's because when we are fetching ML anomalies we add a filter to only retrieve anomalies from the main transaction types. This causes to any service displayed on the Service overview page that does NOT have one of the main transaction types to always have an unknown ML health status.

The solution

Instead of filtering by transaction types, we must group anomalies by it. By doing so we can pick the anomaly from the transaction type displayed on the page.

Question

When a service has two transaction types request and worker for example, by default we'll show metrics from the transaction type request, but what if the transaction type worker has a major or critical ML anomaly? Should we change and show metrics from the transaction type with a higher anomaly?

Like in this example: Image

Service b is showing metrics from the request transaction type (as there's no transaction type column on the table), but it has a critical ML health status coming from the worker transaction type.

ML response:

 "buckets": [
        {
          "key": {
            "serviceName": "a",
            "transactionType": "request",
            "jobId": "apm-production-799e-apm_tx_metrics"
          },
          "doc_count": 685,
          "metrics": {
            "top": [
              {
                "sort": [
                  90.73150283604922
                ],
                "metrics": {
                  "actual": 1000000,
                  "by_field_value": "request",
                  "result_type": "record",
                  "record_score": 90.73150283604922
                }
              }
            ]
          }
        },
        {
          "key": {
            "serviceName": "b",
            "transactionType": "Worker",
            "jobId": "apm-production-799e-apm_tx_metrics"
          },
          "doc_count": 685,
          "metrics": {
            "top": [
              {
                "sort": [
                  67.97050919994875
                ],
                "metrics": {
                  "actual": 1000000,
                  "by_field_value": "Worker",
                  "result_type": "record",
                  "record_score": 67.97050919994875
                }
              }
            ]
          }
        },
        {
          "key": {
            "serviceName": "b",
            "transactionType": "request",
            "jobId": "apm-production-799e-apm_tx_metrics"
          },
          "doc_count": 671,
          "metrics": {
            "top": [
              {
                "sort": [
                  "-Infinity"
                ],
                "metrics": {
                  "actual": 100000,
                  "by_field_value": "request",
                  "result_type": "model_plot",
                  "record_score": null
                }
              }
            ]
          }
        }
      ]
elasticmachine commented 1 month ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

roshan-elastic commented 1 month ago

Update

Prioritising as 'low' until we can get more information on when this happens (see comment).

Context : Trying to understand if this is an edge case of a common problem for many use cases/customers

roshan-elastic commented 1 month ago

After discussion with @jennypavlova and @cauemarcondes, moved this to top of refinement column and added AC in the description.

This can be updated during refinement if needed.