Open henningandersen opened 4 weeks ago
Pinging @elastic/ml-core (Team:ML)
Inference runner does not appear to wait for itself, it only appears to fork to another thread in TrainedModelProvider which I am assuming is the transport thread?
Same as TrainedModel... in DeploymentManager
Doesn't appear to be an immediate risk, but we can still remove the use of PlainActionFuture
You may be right about it having a similar underlying cause, but InferenceRunner
waits on a future on a ml_utility
thread that is then notified on another ml_utility
thread (I do not have the details, but if you convert the UnsafePlainActionFuture
to a PlainActionFuture
and run CI, you should see it (perhaps need a few times)). If you were so unfortunate to run out of ml_utility
threads that are all blocked on such a future, it would be a deadlock.
I do have some details for TrainedModelAssignmentNodeService
:
java.lang.AssertionError: cannot complete future on thread Thread[#114,elasticsearch[yamlRestTest-0][ml_utility][T#2],5,main] with waiter on thread Thread[#88,elasticsearch[yamlRestTest-0][ml_utility][T#1],5,main], could deadlock if pool was full
at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1099)
at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture$Sync.get(PlainActionFuture.java:278)
at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture.get(PlainActionFuture.java:96)
at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:45)
at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture.actionGet(PlainActionFuture.java:157)
at org.elasticsearch.ml@8.15.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentNodeService.loadQueuedModels(TrainedModelAssignmentNodeService.java:212)
at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:223)
at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1570)
---
at org.elasticsearch.action.support.PlainActionFuture.assertCompleteAllowed(PlainActionFuture.java:416) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
at org.elasticsearch.action.support.PlainActionFuture.set(PlainActionFuture.java:137) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
at org.elasticsearch.action.support.PlainActionFuture.onResponse(PlainActionFuture.java:37) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager.lambda$startDeployment$3(DeploymentManager.java:181) ~[?:?]
at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:248) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager$ProcessContext.lambda$startAndLoad$2(DeploymentManager.java:546) ~[?:?]
at org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(ActionListenerImplementations.java:245) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager$ProcessContext.lambda$loadModel$12(DeploymentManager.java:753) ~[?:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1570) ~[?:?]
Elasticsearch Version
8.15
Installed Plugins
No response
Java Version
bundled
OS Version
Linux
Problem Description
In #108934 we added assertions to ensure we do not complete a future on the same executor that waits for it, since this can lead to deadlocks. Two ML usages were identified that need to be fixed:
Ideally those would be converted to asynchronous waits instead.
Steps to Reproduce
NA
Logs (if relevant)
No response