elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.62k stars 8.22k forks source link

[APM] ML integration: Dynamic Baselines #18472

Closed elasticmachine closed 6 years ago

elasticmachine commented 6 years ago

Original comment by @makwarth:

Updated March 23, 2018 after call with: @jgowdyelastic @blaklaybul @sqren @formgeist


Status

The ML team has created four ML jobs for APM:

The ML team has also created a know configuration for APM, so it's easy to get started with any of the above jobs. In itself these jobs are super useful, but this issue is focusing on integrating the ML apm-high_response_time job in the APM UI. Later we could have UI integrations for all four jobs.

On our call, we also discussed possibility of a fifth job to detect anomalous transactions with regards to their span count or shift in which span types that account for the most time.

First APM UI / ML integration feature: Dynamic Baselines

We want to integrate with ML to provide users with an (opt-in) dynamic baseline on service response time graphs in the APM UI. This will enable users to tell, if the current performance is nominal expected or abnormal. To enable this feature, platinum users will simply click a button to enable the ML job for the active APM service. Stretch goal is to have this feature done by 6.4.

Mockup of step 1: screen_shot_2018-03-14_at_15_43_30

Mockup of step 2: screen_shot_2018-03-14_at_15_43_30

High-level todo

Please comment / add on stuff I've forgotten.

TBD

elasticmachine commented 6 years ago

Original comment by @makwarth:

FYI @droberts195 @sophiec20 @stevedodson

elasticmachine commented 6 years ago

Original comment by @droberts195:

Is there an API to delete historic ML results? Do we need it? (Probably not)

We have a results_retention_days setting in the job config - search for it in https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-job-resource.html

So there's no API to request results be immediately deleted (other than deleting the entire job), but you can control how long after creation the automatic cleanup process will delete them.

elasticmachine commented 6 years ago

Original comment by @makwarth:

Got it, thanks @droberts195

elasticmachine commented 6 years ago

Original comment by @makwarth:

@jgowdyelastic Hey, just checking in regarding the added feature of attaching a query to the ML endpoint? (To filter by APM service in the APM index)

elasticmachine commented 6 years ago

Original comment by @jgowdyelastic:

@makwarth PR LINK REDACTED was merged this morning. So it is now possible to override the query object when calling our setup endpoint.

Setup module items POST /api/ml/modules/setup/ e.g. POST /api/ml/modules/setup/nginx Payload:

{
  "prefix": "new_",
  "indexPatternName": "filebeat-*",
  "query": {
    "bool": {
      "filter": [{
          "term": {
            "fileset.module": "nginx"
          }
        },
        {
          "term": {
            "fileset.name": "access"
          }
        }
      ]
    }
  }
}

The prefix specifies optional characters to be prepended to the beginning of all the jobs ids.

elasticmachine commented 6 years ago

Original comment by @makwarth:

@jgowdyelastic Oh, terrific! Thanks /cc @sqren

elasticmachine commented 6 years ago

Original comment by @makwarth:

Update after call with @blaklaybul re the ML/APM job:

cc @roncohen @formgeist

elasticmachine commented 6 years ago

Original comment by @formgeist:

I've updated my design card with some initial screens and a clickable prototype in InVision. This card is referenced, so perhaps we can close, keep the discussion going in the design card until we start implementation? LINK REDACTED

elasticmachine commented 6 years ago

Original comment by @stevedodson:

@blaklaybul - based on the datasets we've analysed so far, we should validate if 1m bucketspan is effective and limitations of this generically.

For example, for the APM data we demoed at Elastic{ON}18, I get these results with a 1m and a 15m bucketspan:

screen shot 2018-04-19 at 09 35 50

Zooming into the anomaly at 1m on the right hand side I get:

screen shot 2018-04-19 at 09 38 45

at 1m and

screen shot 2018-04-19 at 09 39 11

at 15m.

Some comments:

(this also leads into thoughts around baselines responsetimes from different endpoints in addition to this - to help mitigate the general variance)

Ideally we can experiment with the optimal configuration on a corpus of diverse real data. Until this is available, it may be better to implement a job based on a longer bucketspan - potentially with interim results?

elasticmachine commented 6 years ago

Original comment by @tveasey:

The best approach we have available to us at the moment for dealing with the effect of event rate variation on the 95th percentile is to use the mean aggregation and pass the summary count field along with the value. This isn't quite right, but we don't have a better options available without having a native percentile function in the backend. Hopefully, appropriate partitioning of the data (into classes of similar requests) should increase homogeneity of requests and mitigate this sort of problem.

Longer bucket lengths should also help with this problem: they are more likely to contain a representative sample of the population. Also, beware that the 95th percentile will be more susceptible to this than some other statistics since the outlying requests only need to constitute 5% of the bucket values to move this statistic. You may want to use longer bucket lengths for this statistic compared to say the median.

Going forward we could also consider creating a new rule type which excludes low count buckets altogether (this will need to be scale invariant, i.e. expressed as a function of mean bucket rate) if this proves to be important for APM. Any feedback here would be useful.

elasticmachine commented 6 years ago

Original comment by @stevedodson:

i.e.

"analysis_config": {
        "bucket_span": <bucket_span>,
        "summary_count_field_name": "doc_count",
        "detectors": [
          {
            "detector_description": "high 95th percentile of transaction duration",
            "function": "high_mean",
            "field_name": "transaction.duration.us"
          }
        ]
    }
...
"aggregations": {
      "buckets": {
        "date_histogram": {
          "field": EMAIL REDACTED
          "interval": <bucket_span>
        },
        "aggregations": {
          "transaction.duration.us": {
            "percentiles": {
              "field": "transaction.duration.us",
              "percents": [
                95
              ]
            }
          }
        }
      }
    }
makwarth commented 6 years ago

Closing this in favour of https://github.com/elastic/kibana/issues/18569 which summarizes the above discussion.