[APM] ML integration: Dynamic Baselines

elasticmachine commented 6 years ago

Original comment by @makwarth:

Updated March 23, 2018 after call with: @jgowdyelastic @blaklaybul @sqren @formgeist

Status

The ML team has created four ML jobs for APM:

apm-high_response_time
apm-unusual_errors
apm-unusual_request_rate
apm-unusual_users

The ML team has also created a know configuration for APM, so it's easy to get started with any of the above jobs. In itself these jobs are super useful, but this issue is focusing on integrating the ML apm-high_response_time job in the APM UI. Later we could have UI integrations for all four jobs.

On our call, we also discussed possibility of a fifth job to detect anomalous transactions with regards to their span count or shift in which span types that account for the most time.

First APM UI / ML integration feature: Dynamic Baselines

We want to integrate with ML to provide users with an (opt-in) dynamic baseline on service response time graphs in the APM UI. This will enable users to tell, if the current performance is nominal expected or abnormal. To enable this feature, platinum users will simply click a button to enable the ML job for the active APM service. Stretch goal is to have this feature done by 6.4.

Mockup of step 1: screen_shot_2018-03-14_at_15_43_30

Mockup of step 2: screen_shot_2018-03-14_at_15_43_30

High-level todo

[x] APM to call ML API to create new job for a given service
- [x] To be able to filter an index by APM service, we need to add an ES query. The ML team to add this functionality
- [x] Rasmus to research what detector to use
[x] Figure out where ML results should be indexed
- Probably in the shared ml-anomaly index (instead of dedicated index)
[x] Results will have an ID, like <ml-job-name>-apm-<apm-service-name>.
- APM service names are unique.
[x] APM users can stop and start ML jobs via API from the APM UI.
- Is there an API to delete historic ML results? Do we need it? (Probably not)
- See LINK REDACTED
[x] APM UI team to figure out if react-viz supports event annotations

Please comment / add on stuff I've forgotten.

TBD

Should we also (automatically or via the new APM flyout) create a Watch when we create the ML job?

elasticmachine commented 6 years ago

Original comment by @makwarth:

FYI @droberts195 @sophiec20 @stevedodson

elasticmachine commented 6 years ago

Original comment by @droberts195:

Is there an API to delete historic ML results? Do we need it? (Probably not)

We have a results_retention_days setting in the job config - search for it in https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-job-resource.html

So there's no API to request results be immediately deleted (other than deleting the entire job), but you can control how long after creation the automatic cleanup process will delete them.

elasticmachine commented 6 years ago

Original comment by @makwarth:

Got it, thanks @droberts195

elasticmachine commented 6 years ago

Original comment by @makwarth:

@jgowdyelastic Hey, just checking in regarding the added feature of attaching a query to the ML endpoint? (To filter by APM service in the APM index)

elasticmachine commented 6 years ago

Original comment by @jgowdyelastic:

@makwarth PR LINK REDACTED was merged this morning. So it is now possible to override the query object when calling our setup endpoint.

Setup module items POST /api/ml/modules/setup/ e.g. POST /api/ml/modules/setup/nginx Payload:

{
  "prefix": "new_",
  "indexPatternName": "filebeat-*",
  "query": {
    "bool": {
      "filter": [{
          "term": {
            "fileset.module": "nginx"
          }
        },
        {
          "term": {
            "fileset.name": "access"
          }
        }
      ]
    }
  }
}

The prefix specifies optional characters to be prepended to the beginning of all the jobs ids.

elasticmachine commented 6 years ago

Original comment by @makwarth:

@jgowdyelastic Oh, terrific! Thanks /cc @sqren

elasticmachine commented 6 years ago

Original comment by @makwarth:

Update after call with @blaklaybul re the ML/APM job:

We'll use 95th for the ML job.
Bucket span will be set to 1 minute.
ML results contains min/max bounds and timestamp annotations with severity level.
APM to experiment with simpler visualization (compared to ML's visualization): Line graph based on bounds avg, red chart background on periods with anomalies. anomaly severity annotations on hover tooltip. Will share when we have mockup.

cc @roncohen @formgeist

elasticmachine commented 6 years ago

Original comment by @formgeist:

I've updated my design card with some initial screens and a clickable prototype in InVision. This card is referenced, so perhaps we can close, keep the discussion going in the design card until we start implementation? LINK REDACTED

elasticmachine commented 6 years ago

Original comment by @stevedodson:

@blaklaybul - based on the datasets we've analysed so far, we should validate if 1m bucketspan is effective and limitations of this generically.

For example, for the APM data we demoed at Elastic{ON}18, I get these results with a 1m and a 15m bucketspan:

screen shot 2018-04-19 at 09 35 50

Zooming into the anomaly at 1m on the right hand side I get:

screen shot 2018-04-19 at 09 38 45

at 1m and

screen shot 2018-04-19 at 09 39 11

at 15m.

Some comments:

In general, bucketspan should be set to a similar size to the anomalies we care about. Do we care if the 95% responsetime has increased in 1m and then goes back down? Do we care more if the service slows down over a larger period of time (e.g. 5m, 15m, 1h)?
I assume there is generally natural variance in responsetime across endpoints (e.g. some endpoints have higher responsetimes) + there is a natural variance in request rate for endpoints. If the number of requests in a bucket is small, and/or the variance in responsetimes is high, there may be a large number of anomalies that are just reflecting normal behaviour.
We're currently not accounting for request rates (doc_count) in responsetime (which can help with the issue above) - @tveasey what is the best way to do this with 95% aggregation?
Longer bucketspans can result in longer time to alert (e.g. 15m to alert). Querying interim results can mitigate this so significant deviations can be identified prior to the end of the bucket.

(this also leads into thoughts around baselines responsetimes from different endpoints in addition to this - to help mitigate the general variance)

Ideally we can experiment with the optimal configuration on a corpus of diverse real data. Until this is available, it may be better to implement a job based on a longer bucketspan - potentially with interim results?

elasticmachine commented 6 years ago

Original comment by @tveasey:

The best approach we have available to us at the moment for dealing with the effect of event rate variation on the 95th percentile is to use the mean aggregation and pass the summary count field along with the value. This isn't quite right, but we don't have a better options available without having a native percentile function in the backend. Hopefully, appropriate partitioning of the data (into classes of similar requests) should increase homogeneity of requests and mitigate this sort of problem.

Longer bucket lengths should also help with this problem: they are more likely to contain a representative sample of the population. Also, beware that the 95th percentile will be more susceptible to this than some other statistics since the outlying requests only need to constitute 5% of the bucket values to move this statistic. You may want to use longer bucket lengths for this statistic compared to say the median.

Going forward we could also consider creating a new rule type which excludes low count buckets altogether (this will need to be scale invariant, i.e. expressed as a function of mean bucket rate) if this proves to be important for APM. Any feedback here would be useful.

elasticmachine commented 6 years ago

Original comment by @stevedodson:

i.e.

"analysis_config": {
        "bucket_span": <bucket_span>,
        "summary_count_field_name": "doc_count",
        "detectors": [
          {
            "detector_description": "high 95th percentile of transaction duration",
            "function": "high_mean",
            "field_name": "transaction.duration.us"
          }
        ]
    }
...
"aggregations": {
      "buckets": {
        "date_histogram": {
          "field": EMAIL REDACTED
          "interval": <bucket_span>
        },
        "aggregations": {
          "transaction.duration.us": {
            "percentiles": {
              "field": "transaction.duration.us",
              "percents": [
                95
              ]
            }
          }
        }
      }
    }

makwarth commented 6 years ago

Closing this in favour of https://github.com/elastic/kibana/issues/18569 which summarizes the above discussion.

elastic / kibana