elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.98k stars 24.75k forks source link

WatcherService thread stops running if querying .watches index takes more than 30 seconds #115157

Open masseyke opened 1 week ago

masseyke commented 1 week ago

Problem Description

I artificially triggered this one, and it probably doesn't happen in practice much. But I had put a breakpoint in TickerScheduleTriggerEngine::start to look at a completely unrelated problem. I paused the code there for more than 30 seconds, and then let it run again. I saw this error in the log, and watcher was not running any more. It looks like the watcher service died and did not automatically restart.

[2024-10-18T13:04:18,603][ERROR][o.e.x.w.WatcherService   ] [runTask-0] error reloading watcher org.elasticsearch.ElasticsearchTimeoutException: java.util.concurrent.TimeoutException: Timeout waiting for task.
        at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:68)
        at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture.actionGet(PlainActionFuture.java:171)
        at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture.actionGet(PlainActionFuture.java:165)
        at org.elasticsearch.xpack.watcher.WatcherService.loadWatches(WatcherService.java:337)
        at org.elasticsearch.xpack.watcher.WatcherService.reloadInner(WatcherService.java:268)
        at org.elasticsearch.xpack.watcher.WatcherService.lambda$reload$1(WatcherService.java:224)
        at org.elasticsearch.xpack.watcher.WatcherService$1.doRun(WatcherService.java:450)
        at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1023)
        at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1570)
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
        at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture$Sync.get(PlainActionFuture.java:250)
        at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture.get(PlainActionFuture.java:74)
        at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:66)
        ... 11 more

I'm not sure what the best fix would be. We could restart the thread on failure. Or we could just not use that timeout -- i'm not sure why it's there.

elasticsearchmachine commented 1 week ago

Pinging @elastic/es-data-management (Team:Data Management)