Log more information when there are pending tasks at the end of a test

thecoop commented 1 year ago

When there are still-running tasks at the end of a test, the test fails with some not-very-useful information, eg:

java.lang.AssertionError: there are still running tasks:
{time_in_queue=41.3s, time_in_queue_millis=41307, source=put-lifecycle-.monitoring-8-ilm-policy, executing=true, priority=NORMAL, insert_order=31}
{time_in_queue=41.3s, time_in_queue_millis=41306, source=put-lifecycle-ml-size-based-ilm-policy, executing=false, priority=NORMAL, insert_order=33}
{time_in_queue=41.3s, time_in_queue_millis=41300, source=put-lifecycle-.fleet-actions-results-ilm-policy, executing=false, priority=NORMAL, insert_order=44}
{time_in_queue=41.2s, time_in_queue_millis=41297, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=50}
{time_in_queue=41.2s, time_in_queue_millis=41299, source=create persistent task, executing=false, priority=NORMAL, insert_order=47}
{time_in_queue=41.2s, time_in_queue_millis=41287, source=cluster_reroute(state recovered), executing=false, priority=NORMAL, insert_order=51}
{time_in_queue=34.4s, time_in_queue_millis=34428, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=57}
{time_in_queue=31s, time_in_queue_millis=31040, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=61}
{time_in_queue=32.2s, time_in_queue_millis=32269, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=60}
{time_in_queue=41.2s, time_in_queue_millis=41298, source=health-metadata-update-master-election, executing=false, priority=NORMAL, insert_order=49}
{time_in_queue=32.4s, time_in_queue_millis=32453, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=59}
{time_in_queue=29.5s, time_in_queue_millis=29535, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=62}
{time_in_queue=25.8s, time_in_queue_millis=25827, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=65}
{time_in_queue=32.9s, time_in_queue_millis=32949, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=58}
{time_in_queue=23.5s, time_in_queue_millis=23571, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=66}
{time_in_queue=12.4s, time_in_queue_millis=12481, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=68}
{time_in_queue=29.4s, time_in_queue_millis=29421, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=63}
{time_in_queue=20.8s, time_in_queue_millis=20862, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=67}
{time_in_queue=551ms, time_in_queue_millis=551, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=70}
{time_in_queue=28.5s, time_in_queue_millis=28530, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=64}
{time_in_queue=5.5s, time_in_queue_millis=5556, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=69}

It would be better if we could log more information, such as what exactly the task is doing, what it was created from, and how long it has been running for

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

DaveCTurner commented 1 year ago

The wording of the message is a little misleading. These are from GET /_cluster/pending_tasks which shows the master's task queue, but (apart from the single executing=true one at the top) they're not actually doing anything yet.

However the other logs already indicate that cluster state processing is desperately slow in this case:

[2023-01-19T08:10:06,724][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [.deprecation-indexing-ilm-policy]  
[2023-01-19T08:10:07,216][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [watch-history-ilm-policy-16]   
[2023-01-19T08:10:07,400][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [ilm-history-ilm-policy]    
[2023-01-19T08:10:08,629][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [slm-history-ilm-policy]    
[2023-01-19T08:10:10,134][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [logs]  
[2023-01-19T08:10:10,247][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [synthetics]    
[2023-01-19T08:10:11,138][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [metrics]   
[2023-01-19T08:10:13,841][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [7-days-default]    
[2023-01-19T08:10:16,098][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [90-days-default]   
[2023-01-19T08:10:18,806][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [180-days-default]  
[2023-01-19T08:10:27,187][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [30-days-default]   
[2023-01-19T08:10:34,112][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [365-days-default]  
[2023-01-19T08:10:39,117][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [.monitoring-8-ilm-policy]  
[2023-01-19T08:10:41,591][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [ml-size-based-ilm-policy]  
[2023-01-19T08:10:43,899][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [LinuxDarwinHostname] adding index lifecycle policy [.fleet-actions-results-ilm-policy]

These tasks are all queued up pretty much at the same time, so there shouldn't be seconds+ between each one. It's not quite slow enough to trigger more detailed warnings, which typically kick in at 10s, but we can make those warnings more sensitive with settings like these (try e.g. 1s):

gateway.slow_write_logging_threshold
cluster.service.slow_master_task_logging_threshold
cluster.service.slow_task_logging_threshold

DaveCTurner commented 1 year ago

Actually it'd be even better if we could run these nodes via async-profiler (in wall-clock mode). That'd tell us much more clearly where all the time is going. Is that feasible?

rjernst commented 1 year ago

It might be a little difficult. The pending tasks check right now works via an assertBusy, which simply throws an assertion if any pending tasks exist. We could add in an async profiler call, but it might add overhead for the common case. Only using the async profiler say after pending tasks have existed for 10 seconds would require reworking how retries are done there. Doable, just not trivial.

DaveCTurner commented 1 year ago

I mean for the whole run of this node, not just the assertBusy() bit at the end. The node was only up for a few minutes and seems to have been egregiously slow for most of that time. The remaining pending tasks are just leftovers from the normal startup process that should have happened within the first few seconds, but because of how slowly it is running they're still being processed at the end of the test.

elastic / elasticsearch

Log more information when there are pending tasks at the end of a test #93222