Open pmuellr opened 2 years ago
I should mention, it's possible that some of the changes made in PR https://github.com/elastic/kibana/pull/109741 will end up improving the situation - for instance, cutting down on the number of times the message is generated over time (because we relaxed the conditions considered problematic). But I think we'll need to see over time.
Pinging @elastic/kibana-alerting-services (Team:Alerting Services)
I think this is going to become a meta issue, realized I needed a place to vent on my other concerns regarding the task manager perf logging:
we dump the heath record too often, especially if you just have debug logging set. I think we probably should NOT dump it, if the error came from capacity concerns, And probably if the reason hasn't changed since the last time it was dumped.
the health record we're dropping is not useful for diagnostic reasons, as it's stringified JSON (with escaped "
chars). I think we can add this to the log via a meta
object field, which would be easier to access. But I think there's a down-side that then it won't work with log appenders that don't deal directly with JSON (which I assume is the default, and how the logger is set up at dev time).
the warning message with the doc link (from the latest PR) doesn't get printed on capacity sizing issues, but I think it should
the warning message with the doc link has a fairly naive throttle - it only prints on a transition from an "OK" to "not OK" state. I think it should be time-based - every minute even sounds like too much, if I had to look through a day's worth of logs. Maybe an hour?
we need to plumb the "reason" we're setting the Kibana status with the actual level; today we're just logging it, but then it won't show up in the status UI or report
I was also thinking yesterday it might be useful to make use of the event log. But I think it would have to be conditional, otherwise it's going to get REAL busy.
What would we add? Of that, I'm not sure. It might be a good place to put the health documents, but I think they would have to be a new object/enabled: false field. Or perhaps flattened. Maybe a different shape that would be better for KQL queries. Task start/ends documents might be good. Wondering if we could use this to do better estimation of the number of active Kibanas.
PR https://github.com/elastic/kibana/pull/109741 removes some of the "observability of task manager", by "hiding" the "potential performance problem" log warning, by turning it into a debug warning.
The original issue that spawned the PR is here https://github.com/elastic/kibana/issues/109095 and contains references to other issues where this message has appeared and caused undo alarm.
It would be nice to "promote" this message back to a "warn", but I think we need to feel pretty confident that the message is only logged when we really know we have a problem.
Some specific problems we've seen:
that message was logged every 10 seconds, a number of times. We need to apply some throttling. It's especially bad when there are multiple Kibana instances, as typically they all come to the same "conclusion" about generating the log message
the calculation of some of the values used in the health report are suspect. TM guesses at how many active Kibana instances there are, based on UUIDs found in the task manager documents, but these can change when Kibana instances are rebooted. We've seen cases where TM guesses ~2x the number of instances, which we're guessing is when a cluster is rebooted, and each instance gets a new server UUID. Likewise, we think there are cases where it can undercount. We should see if we can find a more precise way of guessing this.