Closed kobelb closed 1 year ago
Pinging @elastic/response-ops (Team:ResponseOps)
Draft PR to expose background task worker utilization metric: https://github.com/elastic/kibana/pull/153600
Created a dev build of metricbeat to collect utilization metric: https://github.com/elastic/beats/compare/main...ymao1:beats:collect-task-manager-load?expand=1
Created dashboard to show utilization metric, along with OS load and heap utilization:
{"attributes":{"fieldAttrs":"{}","fieldFormatMap":"{}","fields":"[]","name":"monitoring","runtimeFieldMap":"{}","sourceFilters":"[]","timeFieldName":"@timestamp","title":".monitoring-kibana-8-mb","typeMeta":"{}"},"coreMigrationVersion":"8.0.0","created_at":"2023-03-23T12:42:56.518Z","id":"3e32c5d2-e2da-405c-bd2e-735b3f4d2187","migrationVersion":{"index-pattern":"8.0.0"},"references":[],"type":"index-pattern","updated_at":"2023-03-23T12:42:56.518Z","version":"WzE5NCwxXQ=="}
{"attributes":{"fieldAttrs":"{}","fieldFormatMap":"{}","fields":"[]","name":"metricbeat","runtimeFieldMap":"{}","sourceFilters":"[]","timeFieldName":"@timestamp","title":"metricbeat-*","typeMeta":"{}"},"coreMigrationVersion":"8.0.0","created_at":"2023-03-24T12:52:44.127Z","id":"58e058d1-31c3-41ad-a7bd-62af23ea70a5","migrationVersion":{"index-pattern":"8.0.0"},"references":[],"type":"index-pattern","updated_at":"2023-03-24T12:52:44.127Z","version":"WzIyNjQsMl0="}
{"attributes":{"description":"","kibanaSavedObjectMeta":{"searchSourceJSON":"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[]}"},"optionsJSON":"{\"useMargins\":true,\"syncColors\":false,\"syncCursor\":true,\"syncTooltips\":false,\"hidePanelTitles\":false}","panelsJSON":"[{\"version\":\"8.8.0\",\"type\":\"lens\",\"gridData\":{\"x\":0,\"y\":0,\"w\":24,\"h\":15,\"i\":\"bfe33546-5df6-4c23-b8c1-92a09240b98f\"},\"panelIndex\":\"bfe33546-5df6-4c23-b8c1-92a09240b98f\",\"embeddableConfig\":{\"attributes\":{\"title\":\"\",\"visualizationType\":\"lnsXY\",\"type\":\"lens\",\"references\":[{\"id\":\"3e32c5d2-e2da-405c-bd2e-735b3f4d2187\",\"name\":\"indexpattern-datasource-layer-41a27f6a-7b78-4089-a5dc-96a8cdefcaf6\",\"type\":\"index-pattern\"}],\"state\":{\"visualization\":{\"legend\":{\"isVisible\":false,\"position\":\"right\",\"showSingleSeries\":false},\"valueLabels\":\"hide\",\"fittingFunction\":\"Zero\",\"curveType\":\"LINEAR\",\"yLeftExtent\":{\"mode\":\"custom\",\"lowerBound\":0,\"upperBound\":100},\"axisTitlesVisibilitySettings\":{\"x\":true,\"yLeft\":true,\"yRight\":true},\"tickLabelsVisibilitySettings\":{\"x\":true,\"yLeft\":true,\"yRight\":true},\"labelsOrientation\":{\"x\":0,\"yLeft\":0,\"yRight\":0},\"gridlinesVisibilitySettings\":{\"x\":true,\"yLeft\":true,\"yRight\":true},\"preferredSeriesType\":\"line\",\"layers\":[{\"layerId\":\"41a27f6a-7b78-4089-a5dc-96a8cdefcaf6\",\"accessors\":[\"61ad013a-4dbe-42bd-8d62-fd14fd7d4712\"],\"position\":\"top\",\"seriesType\":\"line\",\"showGridlines\":false,\"layerType\":\"data\",\"xAccessor\":\"42c2786f-e34d-4824-95f3-5b0054905b9e\",\"yConfig\":[{\"forAccessor\":\"61ad013a-4dbe-42bd-8d62-fd14fd7d4712\",\"axisMode\":\"left\"}]}]},\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filters\":[],\"datasourceStates\":{\"formBased\":{\"layers\":{\"41a27f6a-7b78-4089-a5dc-96a8cdefcaf6\":{\"columns\":{\"42c2786f-e34d-4824-95f3-5b0054905b9e\":{\"label\":\"@timestamp\",\"dataType\":\"date\",\"operationType\":\"date_histogram\",\"sourceField\":\"@timestamp\",\"isBucketed\":true,\"scale\":\"interval\",\"params\":{\"interval\":\"auto\",\"includeEmptyRows\":true,\"dropPartials\":false}},\"61ad013a-4dbe-42bd-8d62-fd14fd7d4712\":{\"label\":\"Worker Load\",\"dataType\":\"number\",\"operationType\":\"last_value\",\"isBucketed\":false,\"scale\":\"ratio\",\"sourceField\":\"kibana.task_manager_utilization.load\",\"filter\":{\"query\":\"kibana.task_manager_utilization.load: *\",\"language\":\"kuery\"},\"params\":{\"showArrayValues\":false,\"sortField\":\"@timestamp\"},\"customLabel\":true}},\"columnOrder\":[\"42c2786f-e34d-4824-95f3-5b0054905b9e\",\"61ad013a-4dbe-42bd-8d62-fd14fd7d4712\"],\"incompleteColumns\":{},\"sampling\":1}}},\"textBased\":{\"layers\":{}}},\"internalReferences\":[],\"adHocDataViews\":{}}},\"hidePanelTitles\":false,\"enhancements\":{}},\"title\":\"Task Manager Utilization\"},{\"version\":\"8.8.0\",\"type\":\"lens\",\"gridData\":{\"x\":24,\"y\":0,\"w\":24,\"h\":15,\"i\":\"48d643e5-9a69-4de9-891c-e7b99a3b195c\"},\"panelIndex\":\"48d643e5-9a69-4de9-891c-e7b99a3b195c\",\"embeddableConfig\":{\"attributes\":{\"title\":\"\",\"visualizationType\":\"lnsXY\",\"type\":\"lens\",\"references\":[{\"id\":\"3e32c5d2-e2da-405c-bd2e-735b3f4d2187\",\"name\":\"indexpattern-datasource-layer-c4e83734-4240-496a-9965-d8b77c57a7ae\",\"type\":\"index-pattern\"}],\"state\":{\"visualization\":{\"title\":\"Empty XY chart\",\"legend\":{\"isVisible\":true,\"position\":\"right\",\"isInside\":false,\"verticalAlignment\":\"bottom\",\"horizontalAlignment\":\"right\",\"legendSize\":\"small\",\"shouldTruncate\":false},\"valueLabels\":\"hide\",\"preferredSeriesType\":\"line\",\"layers\":[{\"layerId\":\"c4e83734-4240-496a-9965-d8b77c57a7ae\",\"accessors\":[\"c39c1ce0-7311-440e-a22b-987522e24dc5\",\"fccd635b-7198-40f1-b5e7-b6e2d33163aa\"],\"position\":\"top\",\"seriesType\":\"line\",\"showGridlines\":false,\"layerType\":\"data\",\"xAccessor\":\"d00f1ef5-69c3-45c3-a3e2-3fd1e98d3194\",\"yConfig\":[{\"forAccessor\":\"fccd635b-7198-40f1-b5e7-b6e2d33163aa\",\"color\":\"#6092c0\"},{\"forAccessor\":\"c39c1ce0-7311-440e-a22b-987522e24dc5\",\"color\":\"#000000\"}]}],\"valuesInLegend\":false,\"axisTitlesVisibilitySettings\":{\"x\":true,\"yLeft\":false,\"yRight\":true}},\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filters\":[],\"datasourceStates\":{\"formBased\":{\"layers\":{\"c4e83734-4240-496a-9965-d8b77c57a7ae\":{\"columns\":{\"d00f1ef5-69c3-45c3-a3e2-3fd1e98d3194\":{\"label\":\"@timestamp\",\"dataType\":\"date\",\"operationType\":\"date_histogram\",\"sourceField\":\"@timestamp\",\"isBucketed\":true,\"scale\":\"interval\",\"params\":{\"interval\":\"auto\",\"includeEmptyRows\":true,\"dropPartials\":false}},\"fccd635b-7198-40f1-b5e7-b6e2d33163aa\":{\"label\":\"Heap Used [bytes]\",\"dataType\":\"number\",\"operationType\":\"last_value\",\"isBucketed\":false,\"scale\":\"ratio\",\"sourceField\":\"kibana.stats.process.memory.heap.used.bytes\",\"filter\":{\"query\":\"kibana.stats.process.memory.heap.used.bytes: *\",\"language\":\"kuery\"},\"params\":{\"sortField\":\"@timestamp\"},\"customLabel\":true},\"c39c1ce0-7311-440e-a22b-987522e24dc5\":{\"label\":\"Heap Limit [bytes]\",\"dataType\":\"number\",\"operationType\":\"last_value\",\"isBucketed\":false,\"scale\":\"ratio\",\"sourceField\":\"kibana.stats.process.memory.heap.size_limit.bytes\",\"filter\":{\"query\":\"kibana.stats.process.memory.heap.size_limit.bytes: *\",\"language\":\"kuery\"},\"params\":{\"sortField\":\"@timestamp\"},\"customLabel\":true}},\"columnOrder\":[\"d00f1ef5-69c3-45c3-a3e2-3fd1e98d3194\",\"c39c1ce0-7311-440e-a22b-987522e24dc5\",\"fccd635b-7198-40f1-b5e7-b6e2d33163aa\"],\"sampling\":1,\"incompleteColumns\":{}}}},\"textBased\":{\"layers\":{}}},\"internalReferences\":[],\"adHocDataViews\":{}}},\"hidePanelTitles\":false,\"enhancements\":{}},\"title\":\"Heap Usage\"},{\"version\":\"8.8.0\",\"type\":\"lens\",\"gridData\":{\"x\":24,\"y\":15,\"w\":24,\"h\":15,\"i\":\"68ac6c82-adb6-4bc7-ae71-08c40ebaa830\"},\"panelIndex\":\"68ac6c82-adb6-4bc7-ae71-08c40ebaa830\",\"embeddableConfig\":{\"attributes\":{\"title\":\"\",\"visualizationType\":\"lnsXY\",\"type\":\"lens\",\"references\":[{\"id\":\"3e32c5d2-e2da-405c-bd2e-735b3f4d2187\",\"name\":\"indexpattern-datasource-layer-37cc6bd5-51af-4f4f-9012-76f195e61f99\",\"type\":\"index-pattern\"}],\"state\":{\"visualization\":{\"title\":\"Empty XY chart\",\"legend\":{\"isVisible\":true,\"position\":\"right\"},\"valueLabels\":\"hide\",\"preferredSeriesType\":\"line\",\"layers\":[{\"layerId\":\"37cc6bd5-51af-4f4f-9012-76f195e61f99\",\"accessors\":[\"89b9c64b-ff34-40c3-87f0-f841e80cc810\"],\"position\":\"top\",\"seriesType\":\"line\",\"showGridlines\":false,\"layerType\":\"data\",\"xAccessor\":\"5c9ce36f-8ddb-470d-b632-7dc863f1108f\",\"yConfig\":[{\"forAccessor\":\"89b9c64b-ff34-40c3-87f0-f841e80cc810\",\"color\":\"#d36086\"}]}]},\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filters\":[],\"datasourceStates\":{\"formBased\":{\"layers\":{\"37cc6bd5-51af-4f4f-9012-76f195e61f99\":{\"columns\":{\"5c9ce36f-8ddb-470d-b632-7dc863f1108f\":{\"label\":\"@timestamp\",\"dataType\":\"date\",\"operationType\":\"date_histogram\",\"sourceField\":\"@timestamp\",\"isBucketed\":true,\"scale\":\"interval\",\"params\":{\"interval\":\"auto\",\"includeEmptyRows\":true,\"dropPartials\":false}},\"89b9c64b-ff34-40c3-87f0-f841e80cc810\":{\"label\":\"Median of kibana.stats.os.load.1m\",\"dataType\":\"number\",\"operationType\":\"median\",\"sourceField\":\"kibana.stats.os.load.1m\",\"isBucketed\":false,\"scale\":\"ratio\",\"params\":{\"emptyAsNull\":true}}},\"columnOrder\":[\"5c9ce36f-8ddb-470d-b632-7dc863f1108f\",\"89b9c64b-ff34-40c3-87f0-f841e80cc810\"],\"sampling\":1,\"incompleteColumns\":{}}}},\"textBased\":{\"layers\":{}}},\"internalReferences\":[],\"adHocDataViews\":{}}},\"hidePanelTitles\":false,\"enhancements\":{}},\"title\":\"OS Load [1m]\"},{\"version\":\"8.8.0\",\"type\":\"lens\",\"gridData\":{\"x\":0,\"y\":15,\"w\":24,\"h\":15,\"i\":\"d39e2f66-e969-4fe8-9d1c-087d135c4e2f\"},\"panelIndex\":\"d39e2f66-e969-4fe8-9d1c-087d135c4e2f\",\"embeddableConfig\":{\"attributes\":{\"title\":\"\",\"visualizationType\":\"lnsXY\",\"type\":\"lens\",\"references\":[{\"type\":\"index-pattern\",\"id\":\"58e058d1-31c3-41ad-a7bd-62af23ea70a5\",\"name\":\"indexpattern-datasource-layer-f93a9af4-b08a-4fdb-b885-220b12311caf\"}],\"state\":{\"visualization\":{\"title\":\"Empty XY chart\",\"legend\":{\"isVisible\":false,\"position\":\"right\",\"showSingleSeries\":false},\"valueLabels\":\"hide\",\"preferredSeriesType\":\"line\",\"layers\":[{\"layerId\":\"f93a9af4-b08a-4fdb-b885-220b12311caf\",\"accessors\":[\"048d6226-d1c8-4f15-a355-1db0e5da8365\"],\"position\":\"top\",\"seriesType\":\"line\",\"showGridlines\":false,\"layerType\":\"data\",\"xAccessor\":\"7518bed4-2518-4a18-b043-75c5b0b72d50\",\"yConfig\":[{\"forAccessor\":\"048d6226-d1c8-4f15-a355-1db0e5da8365\",\"color\":\"#e7664c\"}]}],\"fittingFunction\":\"Zero\",\"yLeftExtent\":{\"mode\":\"custom\",\"lowerBound\":0,\"upperBound\":0.1}},\"query\":{\"query\":\"process.args : \\\"/Users/ying/Code/kibana/scripts/kibana\\\"\",\"language\":\"kuery\"},\"filters\":[],\"datasourceStates\":{\"formBased\":{\"layers\":{\"f93a9af4-b08a-4fdb-b885-220b12311caf\":{\"columns\":{\"7518bed4-2518-4a18-b043-75c5b0b72d50\":{\"label\":\"@timestamp\",\"dataType\":\"date\",\"operationType\":\"date_histogram\",\"sourceField\":\"@timestamp\",\"isBucketed\":true,\"scale\":\"interval\",\"params\":{\"interval\":\"auto\",\"includeEmptyRows\":true,\"dropPartials\":false}},\"048d6226-d1c8-4f15-a355-1db0e5da8365\":{\"label\":\"system.process.cpu.total.pct\",\"dataType\":\"number\",\"operationType\":\"last_value\",\"isBucketed\":false,\"scale\":\"ratio\",\"sourceField\":\"system.process.cpu.total.pct\",\"filter\":{\"query\":\"system.process.cpu.total.pct: *\",\"language\":\"kuery\"},\"params\":{\"sortField\":\"@timestamp\"},\"customLabel\":true}},\"columnOrder\":[\"7518bed4-2518-4a18-b043-75c5b0b72d50\",\"048d6226-d1c8-4f15-a355-1db0e5da8365\"],\"sampling\":1,\"incompleteColumns\":{}}}},\"textBased\":{\"layers\":{}}},\"internalReferences\":[],\"adHocDataViews\":{}}},\"hidePanelTitles\":false,\"enhancements\":{}},\"title\":\"Process CPU Usage [%]\"}]","timeRestore":false,"title":"Task Manager Utilization","version":1},"coreMigrationVersion":"8.0.0","created_at":"2023-03-24T13:06:55.999Z","id":"33de8ed0-c8db-11ed-9d27-217297bfba30","migrationVersion":{"dashboard":"8.7.0"},"references":[{"id":"3e32c5d2-e2da-405c-bd2e-735b3f4d2187","name":"bfe33546-5df6-4c23-b8c1-92a09240b98f:indexpattern-datasource-layer-41a27f6a-7b78-4089-a5dc-96a8cdefcaf6","type":"index-pattern"},{"id":"3e32c5d2-e2da-405c-bd2e-735b3f4d2187","name":"48d643e5-9a69-4de9-891c-e7b99a3b195c:indexpattern-datasource-layer-c4e83734-4240-496a-9965-d8b77c57a7ae","type":"index-pattern"},{"id":"3e32c5d2-e2da-405c-bd2e-735b3f4d2187","name":"68ac6c82-adb6-4bc7-ae71-08c40ebaa830:indexpattern-datasource-layer-37cc6bd5-51af-4f4f-9012-76f195e61f99","type":"index-pattern"},{"id":"58e058d1-31c3-41ad-a7bd-62af23ea70a5","name":"d39e2f66-e969-4fe8-9d1c-087d135c4e2f:indexpattern-datasource-layer-f93a9af4-b08a-4fdb-b885-220b12311caf","type":"index-pattern"}],"type":"dashboard","updated_at":"2023-03-24T13:06:55.999Z","version":"WzI2MTgsMl0="}
{"excludedObjects":[],"excludedObjectsCount":0,"exportedCount":3,"missingRefCount":0,"missingReferences":[]}
Baseline background-tasks for a Kibana node, don't create any alerting rules or anything explicit
100 alerting rules that run every 1 second (Index threshold rule with no grouping)
10 alerting rules that run every 1 minute (index threshold rule with no grouping)
Would it be helpful to run additional scenarios where we have rules that take longer to execute or trigger a bunch of actions?
Would it be helpful to run additional scenarios where we have rules that take longer to execute or trigger a bunch of actions?
If it's feasible to do, that'd be great!
10 alerting rules that run every 1 minute and triggers 10 actions every run.
Closing as POC is complete and we've moved onto implementation:
Summary
To determine whether background task worker utilization is a valid autoscaling metric, we should create a proof-of-concept that exposes the background-task worker utilization metric and see how it behaves during various load scenarios.
Calculating background task worker utilization
During each claiming cycle, the background task worker utilization will be calculated by the following formula:
(# of workers already busy + claimed tasks) / max workers
. The background task worker utilization should be retained for 15 seconds, so that we can take an average of the background task worker utilizations.The background-task worker utilization should be returned from the existing
internal/task_manager/_background_task_utilization
endpoint.Collecting the background task worker utilization
For the purpose of the proof of concept, the background task worker utilization can be collected any number of ways. The following are some options, but it's up to the individual doing the proof of concept to choose the path of least resistance:
Load scenarios
In addition to seeing how these load scenarios influence the background task worker utilization, we should also see how they affect the CPU and memory utilization of Kibana to rule out using CPU and memory as background task autoscaling metrics.