Closed kdelemme closed 2 years ago
Pinging @elastic/actionable-observability (Team: Actionable Observability)
Here is the Kibana Developer Guide for Saved Objects: https://docs.elastic.dev/kibana-dev-docs/key-concepts/saved-objects-intro
Here is a complete tutorial on defining a Saved Object and registering it: https://docs.elastic.dev/kibana-dev-docs/tutorials/saved-objects
Here is an example of a Saved Object type from the Infrastructure Monitoring UI
Here is an example of registering the type with the Saved Objects service
Here is where the routes for Observability are defined: https://github.com/elastic/kibana/tree/main/x-pack/plugins/observability/server/routes
After our discussion with the transform team, I think we should also use this pipeline to create monthly indices. We will need to modify the index_prefix_name
to match the current Kibana space (default
).
PUT _ingest/pipeline/slo-monthly-index-default
{
"description": "Monthly date-time index naming for SLO data",
"processors" : [
{
"date_index_name" : {
"field" : "@timestamp",
"index_name_prefix" : "slo-data-default-",
"date_rounding" : "M"
}
}
]
}
We will also need to add "pipeline": "slo-monthly-index-default"
attribute to the transformer's dest
property.
This spike has been completed and implementation started
Epic: https://github.com/elastic/kibana/issues/137323 RFC: https://docs.google.com/document/d/1-9w1WW9HoOCG7I4WAtTFi1Hfnh7BT11dctLVOQs7iwc/edit?usp=sharing
๐ Summary
We want to define how the SLO definition will be stored in Kibana Saved Object. This SLO definition will be used later to generate a Transformer to aggregate the data.
As part of this epic, we want to focus on two type of SLOs:
๐งช Experimentation
Run Kibana and ES locally, and then follow the instruction on this repository to start generating APM data: https://github.com/fkanout/elastic-apm-api-alerts-generator
After a while, you'll notice some data under the o11y-app:
Now you need to create the following index mappings and settings that the rollup index will use.
Index mappings & settings
``` PUT _ilm/policy/slo-data-policy { "policy": { "phases": { "hot": { "actions": { "rollover": { "max_primary_shard_size": "50gb", "max_age": "30d" } } } } } } PUT _component_template/slo-data-mappings { "template": { "mappings": { "properties": { "@timestamp": { "type": "date", "format": "date_optional_time||epoch_millis" }, "slo": { "properties": { "id": { "type": "keyword", "ignore_above": 256 }, "numerator": { "type": "long" }, "denominator": { "type": "long" }, "context": { "properties": { "labels": { "properties": { "groupId": { "type": "keyword" } } } } } } } } } }, "_meta": { "description": "Mappings for SLO data" } } PUT _component_template/slo-data-settings { "template": { "settings": { "index.lifecycle.name": "slo-data-policy" } }, "_meta": { "description": "Settings for ILM" } } PUT _index_template/slo-data-template { "index_patterns": ["slo-data-*"], "composed_of": [ "slo-data-mappings", "slo-data-settings" ], "priority": 500, "_meta": { "description": "Template for SLO rollup data" } } ```We can now start experimenting with aggregation and creating some transformers for the two SLOs:
Availability SLO
๐ก This SLO uses APM metrics
This will create buckets of transaction.name (request endpoint) with good defined as the number of requests with a http status code [2xx, 3xx, 4xx], and total defined as the total number of requests.
Search apm-metrics with aggregation
``` POST metrics-apm*/_search { "size": 0, "query": { "bool": { "filter": [ { "match": { "transaction.root": true } }, { "range": { "@timestamp": { "gte": "now-1h", "lte": "now" } } } ] } }, "aggs": { "transactions": { "composite": { "sources": [ { "transaction.name": { "terms": { "field": "transaction.name" } } }, { "service.name": { "terms": { "field": "service.name" } } } ] }, "aggs": { "good": { "filter": { "bool": { "should": [ { "match": { "transaction.result": "HTTP 2xx" } }, { "match": { "transaction.result": "HTTP 3xx" } }, { "match": { "transaction.result": "HTTP 4xx" } } ] } } }, "total": { "value_count": { "field": "transaction.duration.histogram" } }, "ratio": { "bucket_script": { "buckets_path": { "good": "good>_count", "total": "total" }, "script": "params.good / params.total" } } } } } } ```Transformer
``` PUT _transform/apm-transaction-availability-example { "source": { "index": "metrics-apm*", "runtime_mappings": { "slo.id": { "type": "keyword", "script": { "source": "emit('uuid-slo-availability')" } } } }, "frequency": "1m", "dest": { "index": "slo-data-default" }, "settings": { "deduce_mappings": false }, "sync": { "time": { "field": "@timestamp", "delay": "60s" } }, "pivot": { "group_by": { "slo.context.transaction.name": { "terms": { "field": "transaction.name" } }, "slo.context.service.name": { "terms": { "field": "service.name" } }, "slo.id": { "terms": { "field": "slo.id" } }, "@timestamp": { "date_histogram": { "field": "@timestamp", "calendar_interval": "1m" } } }, "aggregations": { "slo.numerator": { "filter": { "bool": { "should": [ { "match": { "transaction.result": "HTTP 2xx" } }, { "match": { "transaction.result": "HTTP 3xx" } }, { "match": { "transaction.result": "HTTP 4xx" } } ] } } }, "slo.denominator": { "value_count": { "field": "transaction.duration.histogram" } } } } } POST _transform/apm-transaction-availability-example/_start POST _transform/apm-transaction-availability-example/_stop DELETE _transform/apm-transaction-availability-example DELETE slo-data-default POST slo-data-default/_search { "query": { "match": { "slo.id": "uuid-slo-availability" } } } ```Latency SLO
๐ก This SLO uses APM metrics
This creates buckets of transaction.name (request endpoint) with good defined as the number of requests with a latency < 3000ms, and total defined as the total number of requests.
Search apm metrics with aggregation
``` POST metrics-apm*/_search { "size": 0, "query": { "bool": { "filter": [ { "match": { "transaction.root": true } }, { "range": { "@timestamp": { "gte": "now-1h", "lte": "now" } } } ] } }, "aggs": { "transactions": { "composite": { "sources": [ { "transaction.name": { "terms": { "field": "transaction.name" } } } ] }, "aggs": { "good": { "range": { "field": "transaction.duration.histogram", "ranges": [ { "to": 3000000 } ] } }, "total": { "value_count": { "field": "transaction.duration.histogram" } }, "ratio": { "bucket_script": { "buckets_path": { "good": "good['*-3000000.0']>_count", "total": "total" }, "script": "params.good / params.total" } } } } } } ```Transformer
``` PUT _transform/apm-transaction-latency-example { "source": { "index": "metrics-apm*", "runtime_mappings": { "slo.id": { "type": "keyword", "script": { "source": "emit('uuid-slo-latency')" } } } }, "frequency": "1m", "dest": { "index": "slo-data-default" }, "settings": { "deduce_mappings": false }, "sync": { "time": { "field": "@timestamp", "delay": "60s" } }, "pivot": { "group_by": { "slo.context.transaction.name": { "terms": { "field": "transaction.name" } }, "slo.context.service.name": { "terms": { "field": "service.name" } }, "slo.id": { "terms": { "field": "slo.id" } }, "@timestamp": { "date_histogram": { "field": "@timestamp", "calendar_interval": "1m" } } }, "aggregations": { "_numerator": { "range": { "field": "transaction.duration.histogram", "ranges": [ { "to": 3000000 } ] } }, "slo.numerator": { "bucket_script": { "buckets_path": { "numerator": "_numerator['*-3000000.0']>_count" }, "script": "params.numerator" } }, "slo.denominator": { "value_count": { "field": "transaction.duration.histogram" } } } } } POST _transform/apm-transaction-latency-example/_start POST _transform/apm-transaction-latency-example/_stop DELETE _transform/apm-transaction-latency-example DELETE slo-data-default POST slo-data-default/_search { "query": { "match": { "slo.id": "uuid-slo-latency" } } } ```Latency SLO for "o11y-app" service and "GET /slow" transaction
Transformer
``` PUT _transform/apm-transaction-latency-get-slow-example { "source": { "index": "metrics-apm*", "runtime_mappings": { "slo.id": { "type": "keyword", "script": { "source": "emit('uuid-slo-latency-get-slow')" } } }, "query": { "bool": { "filter": [ { "match": { "transaction.root": true } }, { "match": { "service.name": "o11y-app" } }, { "match": { "transaction.name": "GET /slow" } } ] } } }, "frequency": "1m", "dest": { "index": "slo-data-default" }, "settings": { "deduce_mappings": false }, "sync": { "time": { "field": "@timestamp", "delay": "60s" } }, "pivot": { "group_by": { "slo.context.transaction.name": { "terms": { "field": "transaction.name" } }, "slo.context.service.name": { "terms": { "field": "service.name" } }, "slo.id": { "terms": { "field": "slo.id" } }, "@timestamp": { "date_histogram": { "field": "@timestamp", "calendar_interval": "1m" } } }, "aggregations": { "_numerator": { "range": { "field": "transaction.duration.histogram", "ranges": [ { "to": 3000000 } ] } }, "slo.numerator": { "bucket_script": { "buckets_path": { "numerator": "_numerator['*-3000000.0']>_count" }, "script": "params.numerator" } }, "slo.denominator": { "value_count": { "field": "transaction.duration.histogram" } } } } } POST _transform/apm-transaction-latency-get-slow-example/_start POST _transform/apm-transaction-latency-get-slow-example/_stop DELETE _transform/apm-transaction-latency-get-slow-example POST slo-data-default/_search { "query": { "match": { "slo.id": "uuid-slo-latency-get-slow" } } } ```Visualization
We can then visualize the SLOs with a Lens (this lens is aggregating the metrics per hour, in a real life example we might use 1d, 7d, 30d instead). We could also visualize the SLO per
transaction.name
, e.g.latency SLO > GET /slow
oravailability SLO > GET /flaky
โ Questions