Open thecoop opened 1 year ago
Pinging @elastic/es-core-infra (Team:Core/Infra)
The wording of the message is a little misleading. These are from GET /_cluster/pending_tasks
which shows the master's task queue, but (apart from the single executing=true
one at the top) they're not actually doing anything yet.
However the other logs already indicate that cluster state processing is desperately slow in this case:
[2023-01-19T08:10:06,724][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [.deprecation-indexing-ilm-policy]
[2023-01-19T08:10:07,216][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [watch-history-ilm-policy-16]
[2023-01-19T08:10:07,400][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [ilm-history-ilm-policy]
[2023-01-19T08:10:08,629][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [slm-history-ilm-policy]
[2023-01-19T08:10:10,134][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [logs]
[2023-01-19T08:10:10,247][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [synthetics]
[2023-01-19T08:10:11,138][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [metrics]
[2023-01-19T08:10:13,841][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [7-days-default]
[2023-01-19T08:10:16,098][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [90-days-default]
[2023-01-19T08:10:18,806][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [180-days-default]
[2023-01-19T08:10:27,187][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [30-days-default]
[2023-01-19T08:10:34,112][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [365-days-default]
[2023-01-19T08:10:39,117][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [.monitoring-8-ilm-policy]
[2023-01-19T08:10:41,591][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [ml-size-based-ilm-policy]
[2023-01-19T08:10:43,899][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [.fleet-actions-results-ilm-policy]
These tasks are all queued up pretty much at the same time, so there shouldn't be seconds+ between each one. It's not quite slow enough to trigger more detailed warnings, which typically kick in at 10s, but we can make those warnings more sensitive with settings like these (try e.g. 1s
):
gateway.slow_write_logging_threshold
cluster.service.slow_master_task_logging_threshold
cluster.service.slow_task_logging_threshold
Actually it'd be even better if we could run these nodes via async-profiler (in wall-clock mode). That'd tell us much more clearly where all the time is going. Is that feasible?
It might be a little difficult. The pending tasks check right now works via an assertBusy, which simply throws an assertion if any pending tasks exist. We could add in an async profiler call, but it might add overhead for the common case. Only using the async profiler say after pending tasks have existed for 10 seconds would require reworking how retries are done there. Doable, just not trivial.
I mean for the whole run of this node, not just the assertBusy()
bit at the end. The node was only up for a few minutes and seems to have been egregiously slow for most of that time. The remaining pending tasks are just leftovers from the normal startup process that should have happened within the first few seconds, but because of how slowly it is running they're still being processed at the end of the test.
When there are still-running tasks at the end of a test, the test fails with some not-very-useful information, eg:
It would be better if we could log more information, such as what exactly the task is doing, what it was created from, and how long it has been running for