elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.47k stars 24.32k forks source link

Machine learning avoid thread pool deadlocks #109134

Open henningandersen opened 4 weeks ago

henningandersen commented 4 weeks ago

Elasticsearch Version

8.15

Installed Plugins

No response

Java Version

bundled

OS Version

Linux

Problem Description

In #108934 we added assertions to ensure we do not complete a future on the same executor that waits for it, since this can lead to deadlocks. Two ML usages were identified that need to be fixed:

Ideally those would be converted to asynchronous waits instead.

Steps to Reproduce

NA

Logs (if relevant)

No response

### Tasks
- [x] Convert InferenceRunner to Async
- [ ] Convert TrainedModelBlahBlah to Async
elasticsearchmachine commented 4 weeks ago

Pinging @elastic/ml-core (Team:ML)

prwhelan commented 3 weeks ago

Inference runner does not appear to wait for itself, it only appears to fork to another thread in TrainedModelProvider which I am assuming is the transport thread?

Same as TrainedModel... in DeploymentManager

Doesn't appear to be an immediate risk, but we can still remove the use of PlainActionFuture

henningandersen commented 3 weeks ago

You may be right about it having a similar underlying cause, but InferenceRunner waits on a future on a ml_utility thread that is then notified on another ml_utility thread (I do not have the details, but if you convert the UnsafePlainActionFuture to a PlainActionFuture and run CI, you should see it (perhaps need a few times)). If you were so unfortunate to run out of ml_utility threads that are all blocked on such a future, it would be a deadlock.

I do have some details for TrainedModelAssignmentNodeService:

java.lang.AssertionError: cannot complete future on thread Thread[#114,elasticsearch[yamlRestTest-0][ml_utility][T#2],5,main] with waiter on thread Thread[#88,elasticsearch[yamlRestTest-0][ml_utility][T#1],5,main], could deadlock if pool was full
        at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1099)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture$Sync.get(PlainActionFuture.java:278)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture.get(PlainActionFuture.java:96)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:45)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture.actionGet(PlainActionFuture.java:157)
        at org.elasticsearch.ml@8.15.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentNodeService.loadQueuedModels(TrainedModelAssignmentNodeService.java:212)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:223)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1570)
---
        at org.elasticsearch.action.support.PlainActionFuture.assertCompleteAllowed(PlainActionFuture.java:416) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at org.elasticsearch.action.support.PlainActionFuture.set(PlainActionFuture.java:137) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at org.elasticsearch.action.support.PlainActionFuture.onResponse(PlainActionFuture.java:37) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager.lambda$startDeployment$3(DeploymentManager.java:181) ~[?:?]
        at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:248) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager$ProcessContext.lambda$startAndLoad$2(DeploymentManager.java:546) ~[?:?]
        at org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(ActionListenerImplementations.java:245) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager$ProcessContext.lambda$loadModel$12(DeploymentManager.java:753) ~[?:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1570) ~[?:?]