Currently, the getHealthStatus function logs the health status messages every time it is called when Task Manager is unhealthy. Since this function is invoked frequently, these log messages can flood the console whenever task manager is constantly unhealthy.
Expected Behaviour
The logging of health status warning messages should be throttled to reduce the server log noise while retaining meaningful insights. We should introduce a time-based throttle that won't log the message more than once per 1-5 minutes while falling back to debug logs whenever the message is throttled.
Steps to Reproduce
Apply the following diff, which will show how often the function gets called an logs whenever the conditions are unhealthy.
diff --git a/x-pack/plugins/task_manager/server/monitoring/capacity_estimation.ts b/x-pack/plugins/task_manager/server/monitoring/capacity_estimation.ts
index acbf1284b21..d3280b0b6c3 100644
--- a/x-pack/plugins/task_manager/server/monitoring/capacity_estimation.ts
+++ b/x-pack/plugins/task_manager/server/monitoring/capacity_estimation.ts
@@ -242,11 +242,11 @@ function getHealthStatus(
assumedAverageRecurringRequiredThroughputPerMinutePerKibana,
capacityPerMinutePerKibana,
} = params;
- if (assumedRequiredThroughputPerMinutePerKibana < capacityPerMinutePerKibana) {
- const reason = `Task Manager is healthy, the assumedRequiredThroughputPerMinutePerKibana (${assumedRequiredThroughputPerMinutePerKibana}) < capacityPerMinutePerKibana (${capacityPerMinutePerKibana})`;
- logger.debug(reason);
- return { status: HealthStatus.OK, reason };
- }
+ // if (assumedRequiredThroughputPerMinutePerKibana < capacityPerMinutePerKibana) {
+ // const reason = `Task Manager is healthy, the assumedRequiredThroughputPerMinutePerKibana (${assumedRequiredThroughputPerMinutePerKibana}) < capacityPerMinutePerKibana (${capacityPerMinutePerKibana})`;
+ // logger.debug(reason);
+ // return { status: HealthStatus.OK, reason };
+ // }
if (assumedAverageRecurringRequiredThroughputPerMinutePerKibana < capacityPerMinutePerKibana) {
const reason = `Task Manager is unhealthy, the assumedAverageRecurringRequiredThroughputPerMinutePerKibana (${assumedAverageRecurringRequiredThroughputPerMinutePerKibana}) < capacityPerMinutePerKibana (${capacityPerMinutePerKibana})`;
Description
Currently, the
getHealthStatus
function logs the health status messages every time it is called when Task Manager is unhealthy. Since this function is invoked frequently, these log messages can flood the console whenever task manager is constantly unhealthy.Expected Behaviour
The logging of health status warning messages should be throttled to reduce the server log noise while retaining meaningful insights. We should introduce a time-based throttle that won't log the message more than once per 1-5 minutes while falling back to debug logs whenever the message is throttled.
Steps to Reproduce
Apply the following diff, which will show how often the function gets called an logs whenever the conditions are unhealthy.
Definition of Done