Multi instance Task Manager issues after 8.15

tttttx2 commented 3 days ago

Kibana version: 8.15+

Elasticsearch version: 8.15+

Server OS version: Docker on Debian

Browser version: N/A

Browser OS version: N/A

Original install method (e.g. download page, yum, from source, etc.): Docker

Describe the bug: I have seen multiple clusters that are throwing Taskmanager errors (Degraded, even though not overloaded at all, and HealthStatus.Error because of expired hot timestamps). Furthermore, they only show a single 'observed_kibana_instances' on api/task_manager/_health API Endpoints. However, on Stack Monitoring all kibana instances are shown.

I guess something is regularly killing my task managers on multiple instances, and somehow they don't appear to 'talk' to each other.

I haven't observed this before 8.15, and a cluster on 8.14 is still working fine (with pretty much identical config)

Steps to reproduce:

upgrade past 8.15
Have multiple kibana instances
setup kibana for load balancing, still only 1 observed in health api

Expected behavior:

Multiple kibana instances should be shown on the health api, and task manager should not be regularly degraded (once every 1-2 minutesor so)

Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

Any additional context:

elasticmachine commented 1 day ago

Pinging @elastic/response-ops (Team:ResponseOps)

mikecote commented 1 day ago

Regarding the observed_kibana_instances issue, this will be fixed when 8.16 goes out via https://github.com/elastic/kibana/issues/192568. The issue goes back to 8.8 and you'll mainly observe Task Manager is unhealthy errors on clusters that have good volume of background tasks running and the capacity estimation thinks there's only one instance running when there are actually more.

I don't believe the fix mentioned above will solve the HealthStatus.Error because of expired hot timestamps issues. That one is mainly caused when the Task Manager health report contains a last_update or stats.runtime.value.polling.last_successful_poll that is older than I believe 4 seconds (by default). Usually this would be caused by scenarios like; errors returned by Elasticsearch when the Kibana Task Manager is looking for tasks to run, Kibana CPU is high / event loop blocked, etc.

tttttx2 commented 1 day ago

Thanks @mikecote for your reply. So if I understand this correctly it's a mostly cosmetic issue that'll be fixed soon, but the task manager is actually working fine in the mean time and I can just ignore it if I don't need a proper capacity estimation / health status reporting.

If the expired warnings are unrelated and therefore something I gotta investigate further myself, this issue can be closed again then.

Thanks a lot for the help :)

mikecote commented 23 hours ago

So if I understand this correctly it's a mostly cosmetic issue that'll be fixed soon, but the task manager is actually working fine in the mean time and I can just ignore it if I don't need a proper capacity estimation / health status reporting.

That is correct, the calculations are based on the wrong number of observed Kibana instances so it's providing false warnings.

If the expired warnings are unrelated and therefore something I gotta investigate further myself, this issue can be closed again then.

That is my thinking, look for "failed to poll for work" logs or others coming from the task manager plugin, it should help find the underlying cause. I'll leave the issue opened a bit longer just in case they end up being related in your case.

elastic / kibana

Multi instance Task Manager issues after 8.15 #197145