bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
22.98k stars 4.03k forks source link

deadlock in worker resource management #22578

Open benjaminp opened 3 months ago

benjaminp commented 3 months ago

I was running a build with 463f80979127b39bb41c9afde5b5863914a7609d, and it hung. The stacktraces reveal what looks like a hang checking out a worker:

"skyframe-evaluator-execution-6" #8044 [81650] daemon prio=5 os_prio=0 cpu=54.16ms elapsed=273.90s tid=0x00007fa40800e470 nid=81650 waiting for monitor entry  [0x00007fa4600fc000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at com.google.devtools.build.lib.actions.ResourceManager.release(ResourceManager.java:537)
        - waiting to lock <0x00000002904115a0> (a com.google.devtools.build.lib.actions.ResourceManager)
        at com.google.devtools.build.lib.actions.ResourceManager.releaseResources(ResourceManager.java:485)
        at com.google.devtools.build.lib.actions.ResourceManager$ResourceHandle.close(ResourceManager.java:114)
        at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:114)
        at com.google.devtools.build.lib.sandbox.SandboxModule$SandboxFallbackSpawnRunner.exec(SandboxModule.java:471)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:159)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:118)
        at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)
        at com.google.devtools.build.lib.analysis.actions.SpawnAction.execute(SpawnAction.java:251)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1169)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1082)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:166)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:95)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:567)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:900)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:190)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:468)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:426)
        at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(java.base@21.0.3/ForkJoinTask.java:1423)
        at java.util.concurrent.ForkJoinTask.doExec(java.base@21.0.3/ForkJoinTask.java:387)
        at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@21.0.3/ForkJoinPool.java:1312)
        at java.util.concurrent.ForkJoinPool.scan(java.base@21.0.3/ForkJoinPool.java:1843)
        at java.util.concurrent.ForkJoinPool.runWorker(java.base@21.0.3/ForkJoinPool.java:1808)
        at java.util.concurrent.ForkJoinWorkerThread.run(java.base@21.0.3/ForkJoinWorkerThread.java:188)

"skyframe-evaluator-execution-3" #8041 [81644] daemon prio=5 os_prio=0 cpu=40.43ms elapsed=273.91s tid=0x00007fa380025360 nid=81644 waiting on condition  [0x00007fa4626fc000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@21.0.3/Native Method)
        - parking to wait for  <0x00000002a20010d0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(java.base@21.0.3/LockSupport.java:371)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(java.base@21.0.3/AbstractQueuedSynchronizer.java:519)
        at java.util.concurrent.ForkJoinPool.compensatedBlock(java.base@21.0.3/ForkJoinPool.java:3740)
        at java.util.concurrent.ForkJoinPool.managedBlock(java.base@21.0.3/ForkJoinPool.java:3723)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@21.0.3/AbstractQueuedSynchronizer.java:1707)
        at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:594)
        at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:355)
        at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:270)
        at com.google.devtools.build.lib.worker.SimpleWorkerPool.borrowObject(SimpleWorkerPool.java:87)
        at com.google.devtools.build.lib.worker.WorkerPoolImplLegacy.borrowObject(WorkerPoolImplLegacy.java:236)
        at com.google.devtools.build.lib.actions.ResourceManager.incrementResources(ResourceManager.java:442)
        - locked <0x00000002904115a0> (a com.google.devtools.build.lib.actions.ResourceManager)
        at com.google.devtools.build.lib.actions.ResourceManager.acquire(ResourceManager.java:508)
        - locked <0x00000002904115a0> (a com.google.devtools.build.lib.actions.ResourceManager)
        at com.google.devtools.build.lib.actions.ResourceManager.acquireResources(ResourceManager.java:385)
        at com.google.devtools.build.lib.worker.WorkerSpawnRunner.execInWorker(WorkerSpawnRunner.java:413)
        at com.google.devtools.build.lib.worker.WorkerSpawnRunner.exec(WorkerSpawnRunner.java:210)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:159)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:118)
        at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)
        at com.google.devtools.build.lib.rules.java.JavaCompileAction.execute(JavaCompileAction.java:413)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1169)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1082)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:166)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:95)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:567)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:900)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:190)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:468)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:426)
        at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(java.base@21.0.3/ForkJoinTask.java:1423)
        at java.util.concurrent.ForkJoinTask.doExec(java.base@21.0.3/ForkJoinTask.java:387)
        at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@21.0.3/ForkJoinPool.java:1312)
        at java.util.concurrent.ForkJoinPool.scan(java.base@21.0.3/ForkJoinPool.java:1843)
        at java.util.concurrent.ForkJoinPool.runWorker(java.base@21.0.3/ForkJoinPool.java:1808)
        at java.util.concurrent.ForkJoinWorkerThread.run(java.base@21.0.3/ForkJoinWorkerThread.java:188)
meisterT commented 3 months ago

cc @wilwell @zhengwei143

wilwell commented 3 months ago

It looks like that one thread is waiting for the idle worker in borrowObject and meanwhile another can't return the worker in synchronized release .

By code logic it should not happen, because we should check that we have enough idle workers before grabbing the new one.

I saw that recenlty @zhengwei143 rewrote this code with new method hasAvailableQuota in cl/627316722 (commit 5742e69c6f8d5705c645b94ca8fe745370ef6c92). By the first view it doesn't seem incorrect with race conditions, but the code in ResourceManager and Workers is tricky. Could it be that this change caused the deadlock?

@benjaminp how often does it reproduce?

wilwell commented 3 months ago

This deadlock could be resolved with async return of the worker to the pool, but probably there is an error in the code logic, which triggers this deadlock

Probably we need to add the assert if we are trying to borrow new worker, but there is no free one.

benjaminp commented 3 months ago

One thing I noticed is that the build that first the this deadlock printed the UI message about starting a new Javac multiplex worker. So, I'm wondering if the problem is all in the worker creation code being stuck when a worker is (re)created.

wilwell commented 3 months ago

@benjaminp could you please attach info log if you catch this behaviour again?

I suspect that functionality could have bugs in with multiplex workers