elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.75k stars 8.15k forks source link

Proof-of-concept: background task worker utilization autoscaling metric #152945

Closed kobelb closed 1 year ago

kobelb commented 1 year ago

Summary

To determine whether background task worker utilization is a valid autoscaling metric, we should create a proof-of-concept that exposes the background-task worker utilization metric and see how it behaves during various load scenarios.

Calculating background task worker utilization

During each claiming cycle, the background task worker utilization will be calculated by the following formula: (# of workers already busy + claimed tasks) / max workers. The background task worker utilization should be retained for 15 seconds, so that we can take an average of the background task worker utilizations.

The background-task worker utilization should be returned from the existing internal/task_manager/_background_task_utilization endpoint.

Collecting the background task worker utilization

For the purpose of the proof of concept, the background task worker utilization can be collected any number of ways. The following are some options, but it's up to the individual doing the proof of concept to choose the path of least resistance:

Load scenarios

  1. Baseline background-tasks for a Kibana node, don't create any alerting rules or anything explicit
  2. Create a large number of alerting rules that run every 1 second
  3. Create 10 alerting rules that all run every 1 minute

In addition to seeing how these load scenarios influence the background task worker utilization, we should also see how they affect the CPU and memory utilization of Kibana to rule out using CPU and memory as background task autoscaling metrics.

elasticmachine commented 1 year ago

Pinging @elastic/response-ops (Team:ResponseOps)

ymao1 commented 1 year ago

Draft PR to expose background task worker utilization metric: https://github.com/elastic/kibana/pull/153600

Created a dev build of metricbeat to collect utilization metric: https://github.com/elastic/beats/compare/main...ymao1:beats:collect-task-manager-load?expand=1

Created dashboard to show utilization metric, along with OS load and heap utilization:

{"attributes":{"fieldAttrs":"{}","fieldFormatMap":"{}","fields":"[]","name":"monitoring","runtimeFieldMap":"{}","sourceFilters":"[]","timeFieldName":"@timestamp","title":".monitoring-kibana-8-mb","typeMeta":"{}"},"coreMigrationVersion":"8.0.0","created_at":"2023-03-23T12:42:56.518Z","id":"3e32c5d2-e2da-405c-bd2e-735b3f4d2187","migrationVersion":{"index-pattern":"8.0.0"},"references":[],"type":"index-pattern","updated_at":"2023-03-23T12:42:56.518Z","version":"WzE5NCwxXQ=="}
{"attributes":{"fieldAttrs":"{}","fieldFormatMap":"{}","fields":"[]","name":"metricbeat","runtimeFieldMap":"{}","sourceFilters":"[]","timeFieldName":"@timestamp","title":"metricbeat-*","typeMeta":"{}"},"coreMigrationVersion":"8.0.0","created_at":"2023-03-24T12:52:44.127Z","id":"58e058d1-31c3-41ad-a7bd-62af23ea70a5","migrationVersion":{"index-pattern":"8.0.0"},"references":[],"type":"index-pattern","updated_at":"2023-03-24T12:52:44.127Z","version":"WzIyNjQsMl0="}
{"attributes":{"description":"","kibanaSavedObjectMeta":{"searchSourceJSON":"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[]}"},"optionsJSON":"{\"useMargins\":true,\"syncColors\":false,\"syncCursor\":true,\"syncTooltips\":false,\"hidePanelTitles\":false}","panelsJSON":"[{\"version\":\"8.8.0\",\"type\":\"lens\",\"gridData\":{\"x\":0,\"y\":0,\"w\":24,\"h\":15,\"i\":\"bfe33546-5df6-4c23-b8c1-92a09240b98f\"},\"panelIndex\":\"bfe33546-5df6-4c23-b8c1-92a09240b98f\",\"embeddableConfig\":{\"attributes\":{\"title\":\"\",\"visualizationType\":\"lnsXY\",\"type\":\"lens\",\"references\":[{\"id\":\"3e32c5d2-e2da-405c-bd2e-735b3f4d2187\",\"name\":\"indexpattern-datasource-layer-41a27f6a-7b78-4089-a5dc-96a8cdefcaf6\",\"type\":\"index-pattern\"}],\"state\":{\"visualization\":{\"legend\":{\"isVisible\":false,\"position\":\"right\",\"showSingleSeries\":false},\"valueLabels\":\"hide\",\"fittingFunction\":\"Zero\",\"curveType\":\"LINEAR\",\"yLeftExtent\":{\"mode\":\"custom\",\"lowerBound\":0,\"upperBound\":100},\"axisTitlesVisibilitySettings\":{\"x\":true,\"yLeft\":true,\"yRight\":true},\"tickLabelsVisibilitySettings\":{\"x\":true,\"yLeft\":true,\"yRight\":true},\"labelsOrientation\":{\"x\":0,\"yLeft\":0,\"yRight\":0},\"gridlinesVisibilitySettings\":{\"x\":true,\"yLeft\":true,\"yRight\":true},\"preferredSeriesType\":\"line\",\"layers\":[{\"layerId\":\"41a27f6a-7b78-4089-a5dc-96a8cdefcaf6\",\"accessors\":[\"61ad013a-4dbe-42bd-8d62-fd14fd7d4712\"],\"position\":\"top\",\"seriesType\":\"line\",\"showGridlines\":false,\"layerType\":\"data\",\"xAccessor\":\"42c2786f-e34d-4824-95f3-5b0054905b9e\",\"yConfig\":[{\"forAccessor\":\"61ad013a-4dbe-42bd-8d62-fd14fd7d4712\",\"axisMode\":\"left\"}]}]},\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filters\":[],\"datasourceStates\":{\"formBased\":{\"layers\":{\"41a27f6a-7b78-4089-a5dc-96a8cdefcaf6\":{\"columns\":{\"42c2786f-e34d-4824-95f3-5b0054905b9e\":{\"label\":\"@timestamp\",\"dataType\":\"date\",\"operationType\":\"date_histogram\",\"sourceField\":\"@timestamp\",\"isBucketed\":true,\"scale\":\"interval\",\"params\":{\"interval\":\"auto\",\"includeEmptyRows\":true,\"dropPartials\":false}},\"61ad013a-4dbe-42bd-8d62-fd14fd7d4712\":{\"label\":\"Worker Load\",\"dataType\":\"number\",\"operationType\":\"last_value\",\"isBucketed\":false,\"scale\":\"ratio\",\"sourceField\":\"kibana.task_manager_utilization.load\",\"filter\":{\"query\":\"kibana.task_manager_utilization.load: *\",\"language\":\"kuery\"},\"params\":{\"showArrayValues\":false,\"sortField\":\"@timestamp\"},\"customLabel\":true}},\"columnOrder\":[\"42c2786f-e34d-4824-95f3-5b0054905b9e\",\"61ad013a-4dbe-42bd-8d62-fd14fd7d4712\"],\"incompleteColumns\":{},\"sampling\":1}}},\"textBased\":{\"layers\":{}}},\"internalReferences\":[],\"adHocDataViews\":{}}},\"hidePanelTitles\":false,\"enhancements\":{}},\"title\":\"Task Manager Utilization\"},{\"version\":\"8.8.0\",\"type\":\"lens\",\"gridData\":{\"x\":24,\"y\":0,\"w\":24,\"h\":15,\"i\":\"48d643e5-9a69-4de9-891c-e7b99a3b195c\"},\"panelIndex\":\"48d643e5-9a69-4de9-891c-e7b99a3b195c\",\"embeddableConfig\":{\"attributes\":{\"title\":\"\",\"visualizationType\":\"lnsXY\",\"type\":\"lens\",\"references\":[{\"id\":\"3e32c5d2-e2da-405c-bd2e-735b3f4d2187\",\"name\":\"indexpattern-datasource-layer-c4e83734-4240-496a-9965-d8b77c57a7ae\",\"type\":\"index-pattern\"}],\"state\":{\"visualization\":{\"title\":\"Empty XY chart\",\"legend\":{\"isVisible\":true,\"position\":\"right\",\"isInside\":false,\"verticalAlignment\":\"bottom\",\"horizontalAlignment\":\"right\",\"legendSize\":\"small\",\"shouldTruncate\":false},\"valueLabels\":\"hide\",\"preferredSeriesType\":\"line\",\"layers\":[{\"layerId\":\"c4e83734-4240-496a-9965-d8b77c57a7ae\",\"accessors\":[\"c39c1ce0-7311-440e-a22b-987522e24dc5\",\"fccd635b-7198-40f1-b5e7-b6e2d33163aa\"],\"position\":\"top\",\"seriesType\":\"line\",\"showGridlines\":false,\"layerType\":\"data\",\"xAccessor\":\"d00f1ef5-69c3-45c3-a3e2-3fd1e98d3194\",\"yConfig\":[{\"forAccessor\":\"fccd635b-7198-40f1-b5e7-b6e2d33163aa\",\"color\":\"#6092c0\"},{\"forAccessor\":\"c39c1ce0-7311-440e-a22b-987522e24dc5\",\"color\":\"#000000\"}]}],\"valuesInLegend\":false,\"axisTitlesVisibilitySettings\":{\"x\":true,\"yLeft\":false,\"yRight\":true}},\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filters\":[],\"datasourceStates\":{\"formBased\":{\"layers\":{\"c4e83734-4240-496a-9965-d8b77c57a7ae\":{\"columns\":{\"d00f1ef5-69c3-45c3-a3e2-3fd1e98d3194\":{\"label\":\"@timestamp\",\"dataType\":\"date\",\"operationType\":\"date_histogram\",\"sourceField\":\"@timestamp\",\"isBucketed\":true,\"scale\":\"interval\",\"params\":{\"interval\":\"auto\",\"includeEmptyRows\":true,\"dropPartials\":false}},\"fccd635b-7198-40f1-b5e7-b6e2d33163aa\":{\"label\":\"Heap Used [bytes]\",\"dataType\":\"number\",\"operationType\":\"last_value\",\"isBucketed\":false,\"scale\":\"ratio\",\"sourceField\":\"kibana.stats.process.memory.heap.used.bytes\",\"filter\":{\"query\":\"kibana.stats.process.memory.heap.used.bytes: *\",\"language\":\"kuery\"},\"params\":{\"sortField\":\"@timestamp\"},\"customLabel\":true},\"c39c1ce0-7311-440e-a22b-987522e24dc5\":{\"label\":\"Heap Limit [bytes]\",\"dataType\":\"number\",\"operationType\":\"last_value\",\"isBucketed\":false,\"scale\":\"ratio\",\"sourceField\":\"kibana.stats.process.memory.heap.size_limit.bytes\",\"filter\":{\"query\":\"kibana.stats.process.memory.heap.size_limit.bytes: *\",\"language\":\"kuery\"},\"params\":{\"sortField\":\"@timestamp\"},\"customLabel\":true}},\"columnOrder\":[\"d00f1ef5-69c3-45c3-a3e2-3fd1e98d3194\",\"c39c1ce0-7311-440e-a22b-987522e24dc5\",\"fccd635b-7198-40f1-b5e7-b6e2d33163aa\"],\"sampling\":1,\"incompleteColumns\":{}}}},\"textBased\":{\"layers\":{}}},\"internalReferences\":[],\"adHocDataViews\":{}}},\"hidePanelTitles\":false,\"enhancements\":{}},\"title\":\"Heap Usage\"},{\"version\":\"8.8.0\",\"type\":\"lens\",\"gridData\":{\"x\":24,\"y\":15,\"w\":24,\"h\":15,\"i\":\"68ac6c82-adb6-4bc7-ae71-08c40ebaa830\"},\"panelIndex\":\"68ac6c82-adb6-4bc7-ae71-08c40ebaa830\",\"embeddableConfig\":{\"attributes\":{\"title\":\"\",\"visualizationType\":\"lnsXY\",\"type\":\"lens\",\"references\":[{\"id\":\"3e32c5d2-e2da-405c-bd2e-735b3f4d2187\",\"name\":\"indexpattern-datasource-layer-37cc6bd5-51af-4f4f-9012-76f195e61f99\",\"type\":\"index-pattern\"}],\"state\":{\"visualization\":{\"title\":\"Empty XY chart\",\"legend\":{\"isVisible\":true,\"position\":\"right\"},\"valueLabels\":\"hide\",\"preferredSeriesType\":\"line\",\"layers\":[{\"layerId\":\"37cc6bd5-51af-4f4f-9012-76f195e61f99\",\"accessors\":[\"89b9c64b-ff34-40c3-87f0-f841e80cc810\"],\"position\":\"top\",\"seriesType\":\"line\",\"showGridlines\":false,\"layerType\":\"data\",\"xAccessor\":\"5c9ce36f-8ddb-470d-b632-7dc863f1108f\",\"yConfig\":[{\"forAccessor\":\"89b9c64b-ff34-40c3-87f0-f841e80cc810\",\"color\":\"#d36086\"}]}]},\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filters\":[],\"datasourceStates\":{\"formBased\":{\"layers\":{\"37cc6bd5-51af-4f4f-9012-76f195e61f99\":{\"columns\":{\"5c9ce36f-8ddb-470d-b632-7dc863f1108f\":{\"label\":\"@timestamp\",\"dataType\":\"date\",\"operationType\":\"date_histogram\",\"sourceField\":\"@timestamp\",\"isBucketed\":true,\"scale\":\"interval\",\"params\":{\"interval\":\"auto\",\"includeEmptyRows\":true,\"dropPartials\":false}},\"89b9c64b-ff34-40c3-87f0-f841e80cc810\":{\"label\":\"Median of kibana.stats.os.load.1m\",\"dataType\":\"number\",\"operationType\":\"median\",\"sourceField\":\"kibana.stats.os.load.1m\",\"isBucketed\":false,\"scale\":\"ratio\",\"params\":{\"emptyAsNull\":true}}},\"columnOrder\":[\"5c9ce36f-8ddb-470d-b632-7dc863f1108f\",\"89b9c64b-ff34-40c3-87f0-f841e80cc810\"],\"sampling\":1,\"incompleteColumns\":{}}}},\"textBased\":{\"layers\":{}}},\"internalReferences\":[],\"adHocDataViews\":{}}},\"hidePanelTitles\":false,\"enhancements\":{}},\"title\":\"OS Load [1m]\"},{\"version\":\"8.8.0\",\"type\":\"lens\",\"gridData\":{\"x\":0,\"y\":15,\"w\":24,\"h\":15,\"i\":\"d39e2f66-e969-4fe8-9d1c-087d135c4e2f\"},\"panelIndex\":\"d39e2f66-e969-4fe8-9d1c-087d135c4e2f\",\"embeddableConfig\":{\"attributes\":{\"title\":\"\",\"visualizationType\":\"lnsXY\",\"type\":\"lens\",\"references\":[{\"type\":\"index-pattern\",\"id\":\"58e058d1-31c3-41ad-a7bd-62af23ea70a5\",\"name\":\"indexpattern-datasource-layer-f93a9af4-b08a-4fdb-b885-220b12311caf\"}],\"state\":{\"visualization\":{\"title\":\"Empty XY chart\",\"legend\":{\"isVisible\":false,\"position\":\"right\",\"showSingleSeries\":false},\"valueLabels\":\"hide\",\"preferredSeriesType\":\"line\",\"layers\":[{\"layerId\":\"f93a9af4-b08a-4fdb-b885-220b12311caf\",\"accessors\":[\"048d6226-d1c8-4f15-a355-1db0e5da8365\"],\"position\":\"top\",\"seriesType\":\"line\",\"showGridlines\":false,\"layerType\":\"data\",\"xAccessor\":\"7518bed4-2518-4a18-b043-75c5b0b72d50\",\"yConfig\":[{\"forAccessor\":\"048d6226-d1c8-4f15-a355-1db0e5da8365\",\"color\":\"#e7664c\"}]}],\"fittingFunction\":\"Zero\",\"yLeftExtent\":{\"mode\":\"custom\",\"lowerBound\":0,\"upperBound\":0.1}},\"query\":{\"query\":\"process.args : \\\"/Users/ying/Code/kibana/scripts/kibana\\\"\",\"language\":\"kuery\"},\"filters\":[],\"datasourceStates\":{\"formBased\":{\"layers\":{\"f93a9af4-b08a-4fdb-b885-220b12311caf\":{\"columns\":{\"7518bed4-2518-4a18-b043-75c5b0b72d50\":{\"label\":\"@timestamp\",\"dataType\":\"date\",\"operationType\":\"date_histogram\",\"sourceField\":\"@timestamp\",\"isBucketed\":true,\"scale\":\"interval\",\"params\":{\"interval\":\"auto\",\"includeEmptyRows\":true,\"dropPartials\":false}},\"048d6226-d1c8-4f15-a355-1db0e5da8365\":{\"label\":\"system.process.cpu.total.pct\",\"dataType\":\"number\",\"operationType\":\"last_value\",\"isBucketed\":false,\"scale\":\"ratio\",\"sourceField\":\"system.process.cpu.total.pct\",\"filter\":{\"query\":\"system.process.cpu.total.pct: *\",\"language\":\"kuery\"},\"params\":{\"sortField\":\"@timestamp\"},\"customLabel\":true}},\"columnOrder\":[\"7518bed4-2518-4a18-b043-75c5b0b72d50\",\"048d6226-d1c8-4f15-a355-1db0e5da8365\"],\"sampling\":1,\"incompleteColumns\":{}}}},\"textBased\":{\"layers\":{}}},\"internalReferences\":[],\"adHocDataViews\":{}}},\"hidePanelTitles\":false,\"enhancements\":{}},\"title\":\"Process CPU Usage [%]\"}]","timeRestore":false,"title":"Task Manager Utilization","version":1},"coreMigrationVersion":"8.0.0","created_at":"2023-03-24T13:06:55.999Z","id":"33de8ed0-c8db-11ed-9d27-217297bfba30","migrationVersion":{"dashboard":"8.7.0"},"references":[{"id":"3e32c5d2-e2da-405c-bd2e-735b3f4d2187","name":"bfe33546-5df6-4c23-b8c1-92a09240b98f:indexpattern-datasource-layer-41a27f6a-7b78-4089-a5dc-96a8cdefcaf6","type":"index-pattern"},{"id":"3e32c5d2-e2da-405c-bd2e-735b3f4d2187","name":"48d643e5-9a69-4de9-891c-e7b99a3b195c:indexpattern-datasource-layer-c4e83734-4240-496a-9965-d8b77c57a7ae","type":"index-pattern"},{"id":"3e32c5d2-e2da-405c-bd2e-735b3f4d2187","name":"68ac6c82-adb6-4bc7-ae71-08c40ebaa830:indexpattern-datasource-layer-37cc6bd5-51af-4f4f-9012-76f195e61f99","type":"index-pattern"},{"id":"58e058d1-31c3-41ad-a7bd-62af23ea70a5","name":"d39e2f66-e969-4fe8-9d1c-087d135c4e2f:indexpattern-datasource-layer-f93a9af4-b08a-4fdb-b885-220b12311caf","type":"index-pattern"}],"type":"dashboard","updated_at":"2023-03-24T13:06:55.999Z","version":"WzI2MTgsMl0="}
{"excludedObjects":[],"excludedObjectsCount":0,"exportedCount":3,"missingRefCount":0,"missingReferences":[]}

Load Scenarios:

  1. Baseline background-tasks for a Kibana node, don't create any alerting rules or anything explicit

    task_manager_util_no_rules_updated
  2. 100 alerting rules that run every 1 second (Index threshold rule with no grouping)

    task_manager_util_100_rules_1s_updated
  3. 10 alerting rules that run every 1 minute (index threshold rule with no grouping)

    task_manager_util_10_rules_1m_updated
ymao1 commented 1 year ago

Would it be helpful to run additional scenarios where we have rules that take longer to execute or trigger a bunch of actions?

kobelb commented 1 year ago

Would it be helpful to run additional scenarios where we have rules that take longer to execute or trigger a bunch of actions?

If it's feasible to do, that'd be great!

ymao1 commented 1 year ago

10 alerting rules that run every 1 minute and triggers 10 actions every run.

task_manager_util_10_rules_1m_10_actions
ymao1 commented 1 year ago

Closing as POC is complete and we've moved onto implementation: