[Task Manager][Health] Warn Runtime Status for high Drift

stefnestor commented 1 year ago

Summary

👋🏼 howdy, team!

I've noticed across a couple clusters that Kibana can end up in a degraded status due to capacity_estimation which really sources in informatively from high runtime > drift usually drift_by_type of alerting:* (aka. Expensive Rules).

The (I really feel is more) bug or (could be labelled instead as) FR I have is that even if drift is p50 backed up by 3mins usu. with load.p50: 100 then runtime still reports status: OK. Can we put some logic in there to flip this to warn/error at some point?

Example

I've dealt with this situation with a couple of users, most egregious situations have been air-gapped so I can't share those examples. However, sharing a low-medium example output in full:

[A]

I wrote an automation to root-cause problematic plugin so reports:

# undocumented, internal API ( https://github.com/elastic/support-diagnostics/blob/main/src/main/resources/kibana-rest.yml ) 
> GET KIBANA_URL/status # ui
> GET kbn:/api/status # api

# overall
$ cat kibana_status.json | jq '{ overall: .status.overall.level }'
{
    overall: "degraded"
}

# core
$ cat kibana_status.json | jq -r '.status.core|{ elasticsearch: .elasticsearch.level, savedObjects:.core.savedObjects }'
{
    elasticsearch: "available",
    savedObjects:  "available"
}

# main cascading plugins
$ cat kibana_status.json | jq -r '.status.plugins|{ taskManager:.taskManager.level, savedObject:.savedOjbects.level, security:.security.level, reporting:.reporting.level }'
{
    taskManager:  "degraded",
    savedObject:  "degraded",
    security:     "degraded",
    ruleRegistry: "degraded",
    reporting:    "degraded"
}

Root plugins: ['taskManager']

$ cat kibana_status.json | jq -r '.status.plugins.taskManager'
{
    "level": "degraded",
    "summary": "Task Manager is unhealthy"
}

Task Manager: Health, Troubleshooting

# https://www.elastic.co/guide/en/kibana/current/task-manager-api-health.html
> GET kbn:api/task_manager/_health
$ cat kibana_task_manager_health.json | jq -r '{ overall:.status }'
{
    overall: "warn"
}

$ cat kibana_task_manager_health.json | jq -r '{ capacity:.stats.capacity_estimation.status, config:.stats.configuration.status, runtime:.stats.runtime.status, workload:.stats.workload.status }'
{
    capacity:  "warn",
    config:    "OK",
    runtime:   "OK",
    workload:  "OK"
}

Troubleshoot specific areas: capacity, config, runtime, workload

...

My report automation goes on, but pivoting towards applicability for this Github, e.g. doc: Evaluate the Runtime quotes section

Theory: Kibana is polling as frequently as it should, but that isn’t often enough to keep up with the workload ... For details on achieving higher throughput by adjusting your scaling strategy, see Scaling guidance.

In our example(s) the load compared to this example doc section is instead actually p50: 100 and drifted by >1min. In a recent air-gapped example (not represented just below) it was >3min drifted:

$ cat kibana_task_manager_health.json | jq '.stats.runtime.value|{drift, load}'
{
  "drift": {
    "p50": 74544,
    "p90": 80640,
    "p95": 80640,
    "p99": 80640
  },
  "load": {
    "p50": 100,
    "p90": 100,
    "p95": 100,
    "p99": 100
  }
}

So overall, it makes sense that this drift+load cascades into capacity_estimation messages since that's where the docs point. However for API response interpretation/usability or diagnostic automations, it doesn't really make sense that runtime didn't flag as status: warn or something more problematic since the root-cause of the problem was something inside runtime cascaded into capacity_estimation.

Request

Unknown literal values but some logic like

IF runtime.drift.p50 > 60000 then runtime.status: warn.

IF runtime.load.p50: 100 then runtime.status: error

🙏🏼

stefnestor commented 1 year ago

Task Manager Health status:warn:

[A]

--- ``` { "id": "3fd07b39-edb5-46a8-8875-1b55c9e1a32d", "last_update": "2023-06-15T20:51:11.047Z", "stats": { "capacity_estimation": { "status": "warn", "timestamp": "2023-06-15T20:51:11.330Z", "value": { "observed": { "avg_recurring_required_throughput_per_minute": 129, "avg_recurring_required_throughput_per_minute_per_kibana": 129, "avg_required_throughput_per_minute": 329, "avg_required_throughput_per_minute_per_kibana": 329, "max_throughput_per_minute": 200, "max_throughput_per_minute_per_kibana": 200, "minutes_to_drain_overdue": 291, "observed_kibana_instances": 1 }, "proposed": { "avg_recurring_required_throughput_per_minute_per_kibana": 65, "avg_required_throughput_per_minute_per_kibana": 165, "min_required_kibana": 1, "provisioned_kibana": 2 } } }, "configuration": { "status": "OK", "timestamp": "2023-06-07T00:09:58.697Z", "value": { "max_poll_inactivity_cycles": 10, "max_workers": 10, "monitored_aggregated_stats_refresh_rate": 60000, "monitored_stats_running_average_window": 50, "monitored_task_execution_thresholds": { "custom": {}, "default": { "error_threshold": 90, "warn_threshold": 80 } }, "poll_interval": 3000, "request_capacity": 1000 } }, "runtime": { "status": "OK", "timestamp": "2023-06-15T20:51:11.046Z", "value": { "drift": { "p50": 74544, "p90": 80640, "p95": 80640, "p99": 80640 }, "drift_by_type": { "Fleet-Usage-Logger": { "p50": 165, "p90": 74228, "p95": 106823, "p99": 108317 }, "Fleet-Usage-Sender": { "p50": 383, "p90": 46934, "p95": 62560, "p99": 66442 }, "ML:saved-objects-sync": { "p50": 366.5, "p90": 1963.5, "p95": 52798, "p99": 74200 }, "actions:.server-log": { "p50": 2281, "p90": 22912.4, "p95": 23499, "p99": 23499 }, "actions:.webhook": { "p50": 74544, "p90": 80640, "p95": 80640, "p99": 80640 }, "actions_telemetry": { "p50": 868.5, "p90": 50222.10000000001, "p95": 70689, "p99": 70689 }, "alerting:.es-query": { "p50": 53568, "p90": 63144.5, "p95": 65387, "p99": 66127 }, "alerting:logs.alert.document.count": { "p50": 52146, "p90": 62983.5, "p95": 63115, "p99": 66035 }, "alerting:metrics.alert.inventory.threshold": { "p50": 51205.5, "p90": 63087.5, "p95": 63294, "p99": 66127 }, "alerting:monitoring_alert_cluster_health": { "p50": 53449.5, "p90": 64685.5, "p95": 66029, "p99": 68195 }, "alerting:monitoring_alert_cpu_usage": { "p50": 53450, "p90": 64135.5, "p95": 66008, "p99": 66042 }, "alerting:monitoring_alert_disk_usage": { "p50": 51926, "p90": 63807.5, "p95": 66009, "p99": 68195 }, "alerting:monitoring_alert_elasticsearch_version_mismatch": { "p50": 51785, "p90": 63015.5, "p95": 63294, "p99": 66036 }, "alerting:monitoring_alert_jvm_memory_usage": { "p50": 53044.5, "p90": 63109, "p95": 65401, "p99": 66028 }, "alerting:monitoring_alert_kibana_version_mismatch": { "p50": 53744, "p90": 63478, "p95": 65946, "p99": 68196 }, "alerting:monitoring_alert_license_expiration": { "p50": 51230, "p90": 63092, "p95": 63294, "p99": 66102 }, "alerting:monitoring_alert_logstash_version_mismatch": { "p50": 53560, "p90": 63186.5, "p95": 65946, "p99": 67528 }, "alerting:monitoring_alert_missing_monitoring_data": { "p50": 53591, "p90": 63190.5, "p95": 63362, "p99": 65645 }, "alerting:monitoring_alert_nodes_changed": { "p50": 53864.5, "p90": 64535, "p95": 66028, "p99": 66036 }, "alerting:monitoring_alert_thread_pool_search_rejections": { "p50": 53641, "p90": 63117.5, "p95": 66029, "p99": 68520 }, "alerting:monitoring_alert_thread_pool_write_rejections": { "p50": 52097.5, "p90": 63092, "p95": 65645, "p99": 66029 }, "alerting:monitoring_ccr_read_exceptions": { "p50": 53164, "p90": 63153.5, "p95": 64429, "p99": 68334 }, "alerting:monitoring_shard_size": { "p50": 53646, "p90": 64965.5, "p95": 65387, "p99": 66126 }, "alerting:siem.eqlRule": { "p50": 56349, "p90": 56757, "p95": 57101, "p99": 57118 }, "alerting:siem.mlRule": { "p50": 35189, "p90": 105096, "p95": 105096, "p99": 108616 }, "alerting:siem.newTermsRule": { "p50": 42077.5, "p90": 47693.5, "p95": 48268, "p99": 56368 }, "alerting:siem.queryRule": { "p50": 56424, "p90": 56586, "p95": 56587, "p99": 107950 }, "alerting:siem.thresholdRule": { "p50": 42399.5, "p90": 56292, "p95": 56587, "p99": 64218 }, "alerting:xpack.uptime.alerts.monitorStatus": { "p50": 55484.5, "p90": 66126, "p95": 66157, "p99": 68507 }, "alerting_health_check": { "p50": 375, "p90": 25610, "p95": 56234, "p99": 72323 }, "alerting_telemetry": { "p50": 868, "p90": 50222.500000000015, "p95": 70690, "p99": 70690 }, "alerts_invalidate_api_keys": { "p50": 42106.5, "p90": 46104.5, "p95": 48026, "p99": 56335 }, "apm-source-map-migration-task": { "p50": 17770, "p90": 17770, "p95": 17770, "p99": 17770 }, "apm-telemetry-task": { "p50": 2239.5, "p90": 23865, "p95": 23865, "p99": 23865 }, "cases-telemetry-task": { "p50": 32153, "p90": 32153, "p95": 32153, "p99": 32153 }, "cleanup_failed_action_executions": { "p50": 374.5, "p90": 1693.5, "p95": 56390, "p99": 62546 }, "dashboard_telemetry": { "p50": 868, "p90": 50221.80000000001, "p95": 70689, "p99": 70689 }, "endpoint:metadata-check-transforms-task": { "p50": 681.5, "p90": 1860, "p95": 23852, "p99": 76208 }, "endpoint:user-artifact-packager": { "p50": 51402.5, "p90": 63686.5, "p95": 65354, "p99": 67615 }, "fleet:check-deleted-files-task": { "p50": 4262, "p90": 42038, "p95": 42038, "p99": 42038 }, "osquery:telemetry-configs": { "p50": 1489, "p90": 2750, "p95": 2750, "p99": 2750 }, "osquery:telemetry-packs": { "p50": 655, "p90": 2745, "p95": 2745, "p99": 2745 }, "osquery:telemetry-saved-queries": { "p50": 656, "p90": 1758.3999999999999, "p95": 1832, "p99": 1832 }, "reports:monitor": { "p50": 26993, "p90": 90171, "p95": 92536, "p99": 92879 }, "security:endpoint-diagnostics": { "p50": 40907, "p90": 47764.5, "p95": 48099, "p99": 56335 }, "security:endpoint-meta-telemetry": { "p50": 652, "p90": 2729, "p95": 2729, "p99": 2729 }, "security:telemetry-configuration": { "p50": 378.5, "p90": 1462.5, "p95": 41152, "p99": 60387 }, "security:telemetry-detection-rules": { "p50": 653, "p90": 2730, "p95": 2730, "p99": 2730 }, "security:telemetry-filterlist-artifact": { "p50": 301.5, "p90": 34767, "p95": 39446, "p99": 44473 }, "security:telemetry-lists": { "p50": 5725, "p90": 42038, "p95": 42038, "p99": 42038 }, "security:telemetry-prebuilt-rule-alerts": { "p50": 377.5, "p90": 2548, "p95": 56390, "p99": 62546 }, "security:telemetry-timelines": { "p50": 999, "p90": 1570.7999999999997, "p95": 28771.899999999838, "p99": 78584 }, "session_cleanup": { "p50": 372.5, "p90": 22218, "p95": 56235, "p99": 68049 } }, "execution": { "duration": { "Fleet-Usage-Logger": { "p50": 78, "p90": 163, "p95": 172, "p99": 186 }, "Fleet-Usage-Sender": { "p50": 240, "p90": 748.5, "p95": 1091, "p99": 1300 }, "ML:saved-objects-sync": { "p50": 31.5, "p90": 70.5, "p95": 80, "p99": 108 }, "actions:.server-log": { "p50": 113, "p90": 180.8, "p95": 182, "p99": 182 }, "actions:.webhook": { "p50": 217, "p90": 248, "p95": 250, "p99": 1222 }, "actions_telemetry": { "p50": 2267, "p90": 5196.6, "p95": 5199, "p99": 5199 }, "alerting:.es-query": { "p50": 712, "p90": 1564, "p95": 1598, "p99": 1812 }, "alerting:logs.alert.document.count": { "p50": 2022.5, "p90": 2781, "p95": 2918, "p99": 3464 }, "alerting:metrics.alert.inventory.threshold": { "p50": 2186, "p90": 3115, "p95": 3499, "p99": 3680 }, "alerting:monitoring_alert_cluster_health": { "p50": 397, "p90": 523, "p95": 644, "p99": 854 }, "alerting:monitoring_alert_cpu_usage": { "p50": 387.5, "p90": 569.5, "p95": 599, "p99": 621 }, "alerting:monitoring_alert_disk_usage": { "p50": 380.5, "p90": 588.5, "p95": 627, "p99": 675 }, "alerting:monitoring_alert_elasticsearch_version_mismatch": { "p50": 417.5, "p90": 980, "p95": 1258, "p99": 1345 }, "alerting:monitoring_alert_jvm_memory_usage": { "p50": 384, "p90": 512, "p95": 593, "p99": 1353 }, "alerting:monitoring_alert_kibana_version_mismatch": { "p50": 790.5, "p90": 1358, "p95": 1383, "p99": 1529 }, "alerting:monitoring_alert_license_expiration": { "p50": 425.5, "p90": 987, "p95": 1188, "p99": 1269 }, "alerting:monitoring_alert_logstash_version_mismatch": { "p50": 938, "p90": 1478, "p95": 1850, "p99": 2010 }, "alerting:monitoring_alert_missing_monitoring_data": { "p50": 456, "p90": 681, "p95": 905, "p99": 1369 }, "alerting:monitoring_alert_nodes_changed": { "p50": 405, "p90": 486.5, "p95": 606, "p99": 859 }, "alerting:monitoring_alert_thread_pool_search_rejections": { "p50": 398.5, "p90": 1129, "p95": 1334, "p99": 1851 }, "alerting:monitoring_alert_thread_pool_write_rejections": { "p50": 399.5, "p90": 598, "p95": 1263, "p99": 1918 }, "alerting:monitoring_ccr_read_exceptions": { "p50": 391, "p90": 623, "p95": 855, "p99": 1286 }, "alerting:monitoring_shard_size": { "p50": 583, "p90": 1434, "p95": 1458, "p99": 1847 }, "alerting:siem.eqlRule": { "p50": 729.5, "p90": 864.5, "p95": 882, "p99": 2668 }, "alerting:siem.mlRule": { "p50": 525, "p90": 759.5, "p95": 772, "p99": 776 }, "alerting:siem.newTermsRule": { "p50": 513, "p90": 1294.5, "p95": 1432, "p99": 2526 }, "alerting:siem.queryRule": { "p50": 541.5, "p90": 1876, "p95": 2367, "p99": 2372 }, "alerting:siem.thresholdRule": { "p50": 708.5, "p90": 5687.5, "p95": 5737, "p99": 5948 }, "alerting:xpack.uptime.alerts.monitorStatus": { "p50": 1002, "p90": 4170.5, "p95": 4253, "p99": 4434 }, "alerting_health_check": { "p50": 38, "p90": 93.5, "p95": 102, "p99": 125 }, "alerting_telemetry": { "p50": 6376.5, "p90": 8378.1, "p95": 8670, "p99": 8670 }, "alerts_invalidate_api_keys": { "p50": 19.5, "p90": 31, "p95": 44, "p99": 2117 }, "apm-source-map-migration-task": { "p50": 26, "p90": 26, "p95": 26, "p99": 26 }, "apm-telemetry-task": { "p50": 1402, "p90": 1525, "p95": 1525, "p99": 1525 }, "cases-telemetry-task": { "p50": 973, "p90": 973, "p95": 973, "p99": 973 }, "cleanup_failed_action_executions": { "p50": 10, "p90": 19, "p95": 22, "p99": 24 }, "dashboard_telemetry": { "p50": 215.5, "p90": 365.2, "p95": 379, "p99": 379 }, "endpoint:metadata-check-transforms-task": { "p50": 37.5, "p90": 87.5, "p95": 106, "p99": 120 }, "endpoint:user-artifact-packager": { "p50": 16.5, "p90": 23, "p95": 23, "p99": 63 }, "fleet:check-deleted-files-task": { "p50": 15, "p90": 16, "p95": 16, "p99": 16 }, "osquery:telemetry-configs": { "p50": 12, "p90": 14, "p95": 14, "p99": 14 }, "osquery:telemetry-packs": { "p50": 6, "p90": 10, "p95": 10, "p99": 10 }, "osquery:telemetry-saved-queries": { "p50": 7, "p90": 13.6, "p95": 14, "p99": 14 }, "reports:monitor": { "p50": 19, "p90": 27.5, "p95": 36, "p99": 1392 }, "security:endpoint-diagnostics": { "p50": 10, "p90": 13, "p95": 14, "p99": 17 }, "security:endpoint-meta-telemetry": { "p50": 3, "p90": 4, "p95": 4, "p99": 4 }, "security:telemetry-configuration": { "p50": 3, "p90": 8, "p95": 10, "p99": 14 }, "security:telemetry-detection-rules": { "p50": 2, "p90": 3, "p95": 3, "p99": 3 }, "security:telemetry-filterlist-artifact": { "p50": 4, "p90": 9, "p95": 10, "p99": 16 }, "security:telemetry-lists": { "p50": 8, "p90": 9, "p95": 9, "p99": 9 }, "security:telemetry-prebuilt-rule-alerts": { "p50": 3, "p90": 9, "p95": 11, "p99": 14 }, "security:telemetry-timelines": { "p50": 3, "p90": 7.9999999999999964, "p95": 12.699999999999996, "p99": 14 }, "session_cleanup": { "p50": 13.5, "p90": 40, "p95": 56, "p99": 71 } }, "duration_by_persistence": { "non_recurring": { "p50": 217, "p90": 248, "p95": 250, "p99": 1222 }, "recurring": { "p50": 696.5, "p90": 1118, "p95": 1519, "p99": 3643 } }, "persistence": { "ephemeral": 0, "non_recurring": 100, "recurring": 0 }, "result_frequency_percent_as_number": { "Fleet-Usage-Logger": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "Fleet-Usage-Sender": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "ML:saved-objects-sync": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "actions:.server-log": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "actions:.webhook": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "actions_telemetry": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:.es-query": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:logs.alert.document.count": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:metrics.alert.inventory.threshold": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:monitoring_alert_cluster_health": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:monitoring_alert_cpu_usage": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:monitoring_alert_disk_usage": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:monitoring_alert_elasticsearch_version_mismatch": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:monitoring_alert_jvm_memory_usage": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:monitoring_alert_kibana_version_mismatch": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:monitoring_alert_license_expiration": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:monitoring_alert_logstash_version_mismatch": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:monitoring_alert_missing_monitoring_data": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:monitoring_alert_nodes_changed": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:monitoring_alert_thread_pool_search_rejections": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:monitoring_alert_thread_pool_write_rejections": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:monitoring_ccr_read_exceptions": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:monitoring_shard_size": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:siem.eqlRule": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:siem.mlRule": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:siem.newTermsRule": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:siem.queryRule": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:siem.thresholdRule": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting:xpack.uptime.alerts.monitorStatus": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting_health_check": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerting_telemetry": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "alerts_invalidate_api_keys": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "apm-source-map-migration-task": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "apm-telemetry-task": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "cases-telemetry-task": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "cleanup_failed_action_executions": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "dashboard_telemetry": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "endpoint:metadata-check-transforms-task": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "endpoint:user-artifact-packager": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "fleet:check-deleted-files-task": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "osquery:telemetry-configs": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "osquery:telemetry-packs": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "osquery:telemetry-saved-queries": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "reports:monitor": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "security:endpoint-diagnostics": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "security:endpoint-meta-telemetry": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "security:telemetry-configuration": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "security:telemetry-detection-rules": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "security:telemetry-filterlist-artifact": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "security:telemetry-lists": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "security:telemetry-prebuilt-rule-alerts": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "security:telemetry-timelines": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" }, "session_cleanup": { "Failed": 0, "RetryScheduled": 0, "Success": 100, "status": "OK" } } }, "load": { "p50": 100, "p90": 100, "p95": 100, "p99": 100 }, "polling": { "claim_conflicts": { "p50": 0, "p90": 0, "p95": 0, "p99": 0 }, "claim_duration": { "p50": 99, "p90": 115.5, "p95": 145, "p99": 145 }, "claim_mismatches": { "p50": 0, "p90": 0, "p95": 0, "p99": 0 }, "duration": { "p50": 207.5, "p90": 297.5, "p95": 307, "p99": 458 }, "last_polling_delay": "2023-06-07T00:10:01.343Z", "last_successful_poll": "2023-06-15T20:51:10.336Z", "persistence": { "non_recurring": 100, "recurring": 0 }, "result_frequency_percent_as_number": { "Failed": 0, "NoAvailableWorkers": 0, "NoTasksClaimed": 0, "PoolFilled": 0, "RanOutOfCapacity": 66, "RunningAtCapacity": 34 } } } }, "workload": { "status": "OK", "timestamp": "2023-06-15T20:51:10.414Z", "value": { "capacity_requirements": { "per_day": 41, "per_hour": 3913, "per_minute": 63 }, "count": 909, "estimated_schedule_density": [ 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 8, 15, 21, 20, 20, 20, 20, 20, 9, 1, 10 ], "non_recurring": 909, "overdue": 291, "overdue_non_recurring": 291, "owner_ids": 1, "schedule": [ [ "3s", 1 ], [ "10s", 1 ], [ "1m", 35 ], [ "60s", 2 ], [ "3m", 1 ], [ "5m", 316 ], [ "10m", 1 ], [ "15m", 20 ], [ "30m", 1 ], [ "45m", 1 ], [ "1h", 6 ], [ "60m", 4 ], [ "3600s", 2 ], [ "2h", 1 ], [ "3h", 1 ], [ "720m", 2 ], [ "1d", 9 ], [ "24h", 8 ] ], "task_types": { "Fleet-Usage-Logger": { "count": 1, "status": { "idle": 1 } }, "Fleet-Usage-Sender": { "count": 1, "status": { "idle": 1 } }, "ML:saved-objects-sync": { "count": 1, "status": { "idle": 1 } }, "actions:.webhook": { "count": 492, "status": { "claiming": 10, "idle": 179, "running": 303 } }, "actions_telemetry": { "count": 1, "status": { "idle": 1 } }, "alerting:.es-query": { "count": 2, "status": { "idle": 2 } }, "alerting:logs.alert.document.count": { "count": 2, "status": { "idle": 2 } }, "alerting:metrics.alert.inventory.threshold": { "count": 1, "status": { "idle": 1 } }, "alerting:monitoring_alert_cluster_health": { "count": 3, "status": { "idle": 3 } }, "alerting:monitoring_alert_cpu_usage": { "count": 3, "status": { "idle": 3 } }, "alerting:monitoring_alert_disk_usage": { "count": 2, "status": { "idle": 2 } }, "alerting:monitoring_alert_elasticsearch_version_mismatch": { "count": 3, "status": { "idle": 3 } }, "alerting:monitoring_alert_jvm_memory_usage": { "count": 2, "status": { "idle": 2 } }, "alerting:monitoring_alert_kibana_version_mismatch": { "count": 3, "status": { "idle": 3 } }, "alerting:monitoring_alert_license_expiration": { "count": 3, "status": { "idle": 3 } }, "alerting:monitoring_alert_logstash_version_mismatch": { "count": 3, "status": { "idle": 3 } }, "alerting:monitoring_alert_missing_monitoring_data": { "count": 2, "status": { "idle": 2 } }, "alerting:monitoring_alert_nodes_changed": { "count": 3, "status": { "idle": 3 } }, "alerting:monitoring_alert_thread_pool_search_rejections": { "count": 2, "status": { "idle": 2 } }, "alerting:monitoring_alert_thread_pool_write_rejections": { "count": 2, "status": { "idle": 2 } }, "alerting:monitoring_ccr_read_exceptions": { "count": 2, "status": { "idle": 2 } }, "alerting:monitoring_shard_size": { "count": 2, "status": { "idle": 2 } }, "alerting:siem.eqlRule": { "count": 266, "status": { "idle": 266 } }, "alerting:siem.mlRule": { "count": 19, "status": { "idle": 19 } }, "alerting:siem.newTermsRule": { "count": 2, "status": { "idle": 2 } }, "alerting:siem.queryRule": { "count": 47, "status": { "idle": 47 } }, "alerting:siem.thresholdRule": { "count": 5, "status": { "idle": 5 } }, "alerting:xpack.uptime.alerts.monitorStatus": { "count": 4, "status": { "idle": 4 } }, "alerting_health_check": { "count": 1, "status": { "idle": 1 } }, "alerting_telemetry": { "count": 1, "status": { "idle": 1 } }, "alerts_invalidate_api_keys": { "count": 1, "status": { "idle": 1 } }, "apm-telemetry-task": { "count": 1, "status": { "idle": 1 } }, "cases-telemetry-task": { "count": 1, "status": { "idle": 1 } }, "cleanup_failed_action_executions": { "count": 1, "status": { "idle": 1 } }, "dashboard_telemetry": { "count": 1, "status": { "idle": 1 } }, "endpoint:metadata-check-transforms-task": { "count": 1, "status": { "idle": 1 } }, "endpoint:user-artifact-packager": { "count": 1, "status": { "idle": 1 } }, "fleet:check-deleted-files-task": { "count": 1, "status": { "idle": 1 } }, "lens_telemetry": { "count": 1, "status": { "idle": 1 } }, "osquery:telemetry-configs": { "count": 1, "status": { "idle": 1 } }, "osquery:telemetry-packs": { "count": 2, "status": { "idle": 2 } }, "osquery:telemetry-saved-queries": { "count": 2, "status": { "idle": 2 } }, "reports:monitor": { "count": 1, "status": { "idle": 1 } }, "search_sessions_cleanup": { "count": 1, "status": { "unrecognized": 1 } }, "search_sessions_expire": { "count": 1, "status": { "unrecognized": 1 } }, "search_sessions_monitor": { "count": 1, "status": { "unrecognized": 1 } }, "security:endpoint-diagnostics": { "count": 1, "status": { "idle": 1 } }, "security:endpoint-meta-telemetry": { "count": 1, "status": { "idle": 1 } }, "security:telemetry-configuration": { "count": 1, "status": { "idle": 1 } }, "security:telemetry-detection-rules": { "count": 1, "status": { "idle": 1 } }, "security:telemetry-filterlist-artifact": { "count": 1, "status": { "idle": 1 } }, "security:telemetry-lists": { "count": 1, "status": { "idle": 1 } }, "security:telemetry-prebuilt-rule-alerts": { "count": 1, "status": { "idle": 1 } }, "security:telemetry-timelines": { "count": 1, "status": { "idle": 1 } }, "session_cleanup": { "count": 1, "status": { "idle": 1 } }, "vis_telemetry": { "count": 1, "status": { "failed": 1 } } } } } }, "status": "warn", "timestamp": "2023-06-15T20:51:11.331Z" } ``` ---

elasticmachine commented 1 year ago

Pinging @elastic/response-ops (Team:ResponseOps)

elastic / kibana