I artificially triggered this one, and it probably doesn't happen in practice much. But I had put a breakpoint in TickerScheduleTriggerEngine::start to look at a completely unrelated problem. I paused the code there for more than 30 seconds, and then let it run again. I saw this error in the log, and watcher was not running any more. It looks like the watcher service died and did not automatically restart.
[2024-10-18T13:04:18,603][ERROR][o.e.x.w.WatcherService ] [runTask-0] error reloading watcher org.elasticsearch.ElasticsearchTimeoutException: java.util.concurrent.TimeoutException: Timeout waiting for task.
at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:68)
at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture.actionGet(PlainActionFuture.java:171)
at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture.actionGet(PlainActionFuture.java:165)
at org.elasticsearch.xpack.watcher.WatcherService.loadWatches(WatcherService.java:337)
at org.elasticsearch.xpack.watcher.WatcherService.reloadInner(WatcherService.java:268)
at org.elasticsearch.xpack.watcher.WatcherService.lambda$reload$1(WatcherService.java:224)
at org.elasticsearch.xpack.watcher.WatcherService$1.doRun(WatcherService.java:450)
at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1023)
at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1570)
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture$Sync.get(PlainActionFuture.java:250)
at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture.get(PlainActionFuture.java:74)
at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:66)
... 11 more
I'm not sure what the best fix would be. We could restart the thread on failure. Or we could just not use that timeout -- i'm not sure why it's there.
Problem Description
I artificially triggered this one, and it probably doesn't happen in practice much. But I had put a breakpoint in
TickerScheduleTriggerEngine::start
to look at a completely unrelated problem. I paused the code there for more than 30 seconds, and then let it run again. I saw this error in the log, and watcher was not running any more. It looks like the watcher service died and did not automatically restart.I'm not sure what the best fix would be. We could restart the thread on failure. Or we could just not use that timeout -- i'm not sure why it's there.