elastic / kibana

Your window into the Elastic Stack
19.69k stars 8.24k forks source link

[Actionable Observability] [SPIKE] Investigate SLO definition #139213

Closed kdelemme closed 2 years ago

kdelemme commented 2 years ago

Epic: https://github.com/elastic/kibana/issues/137323 RFC: https://docs.google.com/document/d/1-9w1WW9HoOCG7I4WAtTFi1Hfnh7BT11dctLVOQs7iwc/edit?usp=sharing

๐Ÿ“ Summary

We want to define how the SLO definition will be stored in Kibana Saved Object. This SLO definition will be used later to generate a Transformer to aggregate the data.

As part of this epic, we want to focus on two type of SLOs:

๐Ÿงช Experimentation

Run Kibana and ES locally, and then follow the instruction on this repository to start generating APM data: https://github.com/fkanout/elastic-apm-api-alerts-generator

After a while, you'll notice some data under the o11y-app:


Now you need to create the following index mappings and settings that the rollup index will use.

Index mappings & settings ``` PUT _ilm/policy/slo-data-policy { "policy": { "phases": { "hot": { "actions": { "rollover": { "max_primary_shard_size": "50gb", "max_age": "30d" } } } } } } PUT _component_template/slo-data-mappings { "template": { "mappings": { "properties": { "@timestamp": { "type": "date", "format": "date_optional_time||epoch_millis" }, "slo": { "properties": { "id": { "type": "keyword", "ignore_above": 256 }, "numerator": { "type": "long" }, "denominator": { "type": "long" }, "context": { "properties": { "labels": { "properties": { "groupId": { "type": "keyword" } } } } } } } } } }, "_meta": { "description": "Mappings for SLO data" } } PUT _component_template/slo-data-settings { "template": { "settings": { "index.lifecycle.name": "slo-data-policy" } }, "_meta": { "description": "Settings for ILM" } } PUT _index_template/slo-data-template { "index_patterns": ["slo-data-*"], "composed_of": [ "slo-data-mappings", "slo-data-settings" ], "priority": 500, "_meta": { "description": "Template for SLO rollup data" } } ```

We can now start experimenting with aggregation and creating some transformers for the two SLOs:

Availability SLO

๐Ÿ’ก This SLO uses APM metrics

This will create buckets of transaction.name (request endpoint) with good defined as the number of requests with a http status code [2xx, 3xx, 4xx], and total defined as the total number of requests.

Search apm-metrics with aggregation ``` POST metrics-apm*/_search { "size": 0, "query": { "bool": { "filter": [ { "match": { "transaction.root": true } }, { "range": { "@timestamp": { "gte": "now-1h", "lte": "now" } } } ] } }, "aggs": { "transactions": { "composite": { "sources": [ { "transaction.name": { "terms": { "field": "transaction.name" } } }, { "service.name": { "terms": { "field": "service.name" } } } ] }, "aggs": { "good": { "filter": { "bool": { "should": [ { "match": { "transaction.result": "HTTP 2xx" } }, { "match": { "transaction.result": "HTTP 3xx" } }, { "match": { "transaction.result": "HTTP 4xx" } } ] } } }, "total": { "value_count": { "field": "transaction.duration.histogram" } }, "ratio": { "bucket_script": { "buckets_path": { "good": "good>_count", "total": "total" }, "script": "params.good / params.total" } } } } } } ```
Transformer ``` PUT _transform/apm-transaction-availability-example { "source": { "index": "metrics-apm*", "runtime_mappings": { "slo.id": { "type": "keyword", "script": { "source": "emit('uuid-slo-availability')" } } } }, "frequency": "1m", "dest": { "index": "slo-data-default" }, "settings": { "deduce_mappings": false }, "sync": { "time": { "field": "@timestamp", "delay": "60s" } }, "pivot": { "group_by": { "slo.context.transaction.name": { "terms": { "field": "transaction.name" } }, "slo.context.service.name": { "terms": { "field": "service.name" } }, "slo.id": { "terms": { "field": "slo.id" } }, "@timestamp": { "date_histogram": { "field": "@timestamp", "calendar_interval": "1m" } } }, "aggregations": { "slo.numerator": { "filter": { "bool": { "should": [ { "match": { "transaction.result": "HTTP 2xx" } }, { "match": { "transaction.result": "HTTP 3xx" } }, { "match": { "transaction.result": "HTTP 4xx" } } ] } } }, "slo.denominator": { "value_count": { "field": "transaction.duration.histogram" } } } } } POST _transform/apm-transaction-availability-example/_start POST _transform/apm-transaction-availability-example/_stop DELETE _transform/apm-transaction-availability-example DELETE slo-data-default POST slo-data-default/_search { "query": { "match": { "slo.id": "uuid-slo-availability" } } } ```

Latency SLO

๐Ÿ’ก This SLO uses APM metrics

This creates buckets of transaction.name (request endpoint) with good defined as the number of requests with a latency < 3000ms, and total defined as the total number of requests.

Search apm metrics with aggregation ``` POST metrics-apm*/_search { "size": 0, "query": { "bool": { "filter": [ { "match": { "transaction.root": true } }, { "range": { "@timestamp": { "gte": "now-1h", "lte": "now" } } } ] } }, "aggs": { "transactions": { "composite": { "sources": [ { "transaction.name": { "terms": { "field": "transaction.name" } } } ] }, "aggs": { "good": { "range": { "field": "transaction.duration.histogram", "ranges": [ { "to": 3000000 } ] } }, "total": { "value_count": { "field": "transaction.duration.histogram" } }, "ratio": { "bucket_script": { "buckets_path": { "good": "good['*-3000000.0']>_count", "total": "total" }, "script": "params.good / params.total" } } } } } } ```
Transformer ``` PUT _transform/apm-transaction-latency-example { "source": { "index": "metrics-apm*", "runtime_mappings": { "slo.id": { "type": "keyword", "script": { "source": "emit('uuid-slo-latency')" } } } }, "frequency": "1m", "dest": { "index": "slo-data-default" }, "settings": { "deduce_mappings": false }, "sync": { "time": { "field": "@timestamp", "delay": "60s" } }, "pivot": { "group_by": { "slo.context.transaction.name": { "terms": { "field": "transaction.name" } }, "slo.context.service.name": { "terms": { "field": "service.name" } }, "slo.id": { "terms": { "field": "slo.id" } }, "@timestamp": { "date_histogram": { "field": "@timestamp", "calendar_interval": "1m" } } }, "aggregations": { "_numerator": { "range": { "field": "transaction.duration.histogram", "ranges": [ { "to": 3000000 } ] } }, "slo.numerator": { "bucket_script": { "buckets_path": { "numerator": "_numerator['*-3000000.0']>_count" }, "script": "params.numerator" } }, "slo.denominator": { "value_count": { "field": "transaction.duration.histogram" } } } } } POST _transform/apm-transaction-latency-example/_start POST _transform/apm-transaction-latency-example/_stop DELETE _transform/apm-transaction-latency-example DELETE slo-data-default POST slo-data-default/_search { "query": { "match": { "slo.id": "uuid-slo-latency" } } } ```

Latency SLO for "o11y-app" service and "GET /slow" transaction

Transformer ``` PUT _transform/apm-transaction-latency-get-slow-example { "source": { "index": "metrics-apm*", "runtime_mappings": { "slo.id": { "type": "keyword", "script": { "source": "emit('uuid-slo-latency-get-slow')" } } }, "query": { "bool": { "filter": [ { "match": { "transaction.root": true } }, { "match": { "service.name": "o11y-app" } }, { "match": { "transaction.name": "GET /slow" } } ] } } }, "frequency": "1m", "dest": { "index": "slo-data-default" }, "settings": { "deduce_mappings": false }, "sync": { "time": { "field": "@timestamp", "delay": "60s" } }, "pivot": { "group_by": { "slo.context.transaction.name": { "terms": { "field": "transaction.name" } }, "slo.context.service.name": { "terms": { "field": "service.name" } }, "slo.id": { "terms": { "field": "slo.id" } }, "@timestamp": { "date_histogram": { "field": "@timestamp", "calendar_interval": "1m" } } }, "aggregations": { "_numerator": { "range": { "field": "transaction.duration.histogram", "ranges": [ { "to": 3000000 } ] } }, "slo.numerator": { "bucket_script": { "buckets_path": { "numerator": "_numerator['*-3000000.0']>_count" }, "script": "params.numerator" } }, "slo.denominator": { "value_count": { "field": "transaction.duration.histogram" } } } } } POST _transform/apm-transaction-latency-get-slow-example/_start POST _transform/apm-transaction-latency-get-slow-example/_stop DELETE _transform/apm-transaction-latency-get-slow-example POST slo-data-default/_search { "query": { "match": { "slo.id": "uuid-slo-latency-get-slow" } } } ```


We can then visualize the SLOs with a Lens (this lens is aggregating the metrics per hour, in a real life example we might use 1d, 7d, 30d instead). We could also visualize the SLO per transaction.name, e.g. latency SLO > GET /slow or availability SLO > GET /flaky


โ“ Questions

  1. When an SLO is edited, should we remove the transformer as well as the transformed data from the destination index? Indeed, if we keep the previously rollup data, we won't be able to differentiate it from the new one added.
elasticmachine commented 2 years ago

Pinging @elastic/actionable-observability (Team: Actionable Observability)

simianhacker commented 2 years ago

Defining and Registering a Saved Object in Kibana

Here is the Kibana Developer Guide for Saved Objects: https://docs.elastic.dev/kibana-dev-docs/key-concepts/saved-objects-intro

Here is a complete tutorial on defining a Saved Object and registering it: https://docs.elastic.dev/kibana-dev-docs/tutorials/saved-objects

Here is an example of a Saved Object type from the Infrastructure Monitoring UI


Here is an example of registering the type with the Saved Objects service


simianhacker commented 2 years ago

Here is where the routes for Observability are defined: https://github.com/elastic/kibana/tree/main/x-pack/plugins/observability/server/routes

simianhacker commented 2 years ago

After our discussion with the transform team, I think we should also use this pipeline to create monthly indices. We will need to modify the index_prefix_name to match the current Kibana space (default).

PUT _ingest/pipeline/slo-monthly-index-default
  "description": "Monthly date-time index naming for SLO data",
  "processors" : [
      "date_index_name" : {
        "field" : "@timestamp",
        "index_name_prefix" : "slo-data-default-",
        "date_rounding" : "M"

We will also need to add "pipeline": "slo-monthly-index-default" attribute to the transformer's dest property.

kdelemme commented 2 years ago

This spike has been completed and implementation started