elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.67k stars 8.23k forks source link

Throttle the frequency of `Task Manager is unhealthy` warning messages #201261

Open mikecote opened 3 days ago

mikecote commented 3 days ago

Description

Currently, the getHealthStatus function logs the health status messages every time it is called when Task Manager is unhealthy. Since this function is invoked frequently, these log messages can flood the console whenever task manager is constantly unhealthy.

Expected Behaviour

The logging of health status warning messages should be throttled to reduce the server log noise while retaining meaningful insights. We should introduce a time-based throttle that won't log the message more than once per 1-5 minutes while falling back to debug logs whenever the message is throttled.

Steps to Reproduce

Apply the following diff, which will show how often the function gets called an logs whenever the conditions are unhealthy.

diff --git a/x-pack/plugins/task_manager/server/monitoring/capacity_estimation.ts b/x-pack/plugins/task_manager/server/monitoring/capacity_estimation.ts
index acbf1284b21..d3280b0b6c3 100644
--- a/x-pack/plugins/task_manager/server/monitoring/capacity_estimation.ts
+++ b/x-pack/plugins/task_manager/server/monitoring/capacity_estimation.ts
@@ -242,11 +242,11 @@ function getHealthStatus(
     assumedAverageRecurringRequiredThroughputPerMinutePerKibana,
     capacityPerMinutePerKibana,
   } = params;
-  if (assumedRequiredThroughputPerMinutePerKibana < capacityPerMinutePerKibana) {
-    const reason = `Task Manager is healthy, the assumedRequiredThroughputPerMinutePerKibana (${assumedRequiredThroughputPerMinutePerKibana}) < capacityPerMinutePerKibana (${capacityPerMinutePerKibana})`;
-    logger.debug(reason);
-    return { status: HealthStatus.OK, reason };
-  }
+  // if (assumedRequiredThroughputPerMinutePerKibana < capacityPerMinutePerKibana) {
+  //   const reason = `Task Manager is healthy, the assumedRequiredThroughputPerMinutePerKibana (${assumedRequiredThroughputPerMinutePerKibana}) < capacityPerMinutePerKibana (${capacityPerMinutePerKibana})`;
+  //   logger.debug(reason);
+  //   return { status: HealthStatus.OK, reason };
+  // }

   if (assumedAverageRecurringRequiredThroughputPerMinutePerKibana < capacityPerMinutePerKibana) {
     const reason = `Task Manager is unhealthy, the assumedAverageRecurringRequiredThroughputPerMinutePerKibana (${assumedAverageRecurringRequiredThroughputPerMinutePerKibana}) < capacityPerMinutePerKibana (${capacityPerMinutePerKibana})`;

Definition of Done

elasticmachine commented 3 days ago

Pinging @elastic/response-ops (Team:ResponseOps)