Open nitins17 opened 9 months ago
cc: @vam-google, @learning-to-play, @meteorcloudy
Can you capture the stack traces with jstack
when Bazel gets stuck?
Can you capture the stack traces with
jstack
when Bazel gets stuck?
Is there a way to pass this to Bazel so that it prints the stack trace after it gets stuck? We don't have interactive access to our CI VMs so it would be tricky to capture the stack trace if I have to run jstack
after Bazel gets stuck.
Is something like BAZEL_PID=$(bazel info server_pid); (while /bin/true; do sleep 60; jstack $BAZEL_PID; done) & <everything else>
an option? It's not nice, but it's simple.
https://blog.bazel.build/2023/10/06/bwob-in-bazel-7.html
Have you also tried --remote_download_toplevel
?
Is something like
BAZEL_PID=$(bazel info server_pid); (while /bin/true; do sleep 60; jstack $BAZEL_PID; done) & <everything else>
an option? It's not nice, but it's simple.
Thanks, let me give it a try.
Have you also tried --remote_download_toplevel?
Yep, unfortunately, the build still gets stuck.
Is something like
BAZEL_PID=$(bazel info server_pid); (while /bin/true; do sleep 60; jstack $BAZEL_PID; done) & <everything else>
an option? It's not nice, but it's simple.
It worked, thanks! Here's the stack trace when Bazel gets stuck.
It's not clear from the stack trace what's going wrong. All remote-executor-*
threads are waiting for tasks, as well as all skyframe-executor-*
except one:
"skyframe-evaluator-110": running
at [java.util.concurrent.ConcurrentHashMap.get(java.base@11.0.6/Unknown Source)](https://spotify.github.io/threaddump-analyzer/#java.util.concurrent.ConcurrentHashMap.get(java.base%4011.0.6%2FUnknown%20Source))
at com.google.devtools.build.skyframe.InMemoryGraphImpl$$Lambda$261/0x00000008002ca840.get(Unknown Source)
at com.google.devtools.build.skyframe.SkyFunctionEnvironment.batchPrefetch(SkyFunctionEnvironment.java:222)
at com.google.devtools.build.skyframe.SkyFunctionEnvironment.<init>(SkyFunctionEnvironment.java:203)
at com.google.devtools.build.skyframe.SkyFunctionEnvironment.create(SkyFunctionEnvironment.java:154)
at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:535)
at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:365)
at java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(java.base@11.0.6/Unknown Source)
at java.util.concurrent.ForkJoinTask.doExec(java.base@11.0.6/Unknown Source)
at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@11.0.6/Unknown Source)
at java.util.concurrent.ForkJoinPool.scan(java.base@11.0.6/Unknown Source)
at java.util.concurrent.ForkJoinPool.runWorker(java.base@11.0.6/Unknown Source)
at java.util.concurrent.ForkJoinWorkerThread.run(java.base@11.0.6/Unknown Source)
Looking at the code, InMemoryGraphImpl$$Lambda$261/0x00000008002ca840.get(Unknown Source)
shouldn't block. It might just be a red herring.
Can you capture more stack traces so that we know whether this particular thread can make progress?
Sure, saved some more stack traces in here. The entire log was too big to save so I had to trim it to the last ~25%.
Thanks! With more stack traces, it's clear that Bazel indeed stuck at:
"skyframe-evaluator-110" #713 daemon prio=5 os_prio=31 cpu=12505629.49ms elapsed=40986.47s tid=0x00007f80eb1bd800 nid=0x3bf07 runnable [0x0000700020168000]
java.lang.Thread.State: RUNNABLE
at java.util.concurrent.ConcurrentHashMap.get(java.base@11.0.6/Unknown Source)
at com.google.devtools.build.skyframe.InMemoryGraphImpl$$Lambda$261/0x00000008002ca840.get(Unknown Source)
at com.google.devtools.build.skyframe.SkyFunctionEnvironment.batchPrefetch(SkyFunctionEnvironment.java:222)
at com.google.devtools.build.skyframe.SkyFunctionEnvironment.<init>(SkyFunctionEnvironment.java:203)
at com.google.devtools.build.skyframe.SkyFunctionEnvironment.create(SkyFunctionEnvironment.java:154)
at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:535)
at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:365)
at java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(java.base@11.0.6/Unknown Source)
at java.util.concurrent.ForkJoinTask.doExec(java.base@11.0.6/Unknown Source)
at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@11.0.6/Unknown Source)
at java.util.concurrent.ForkJoinPool.scan(java.base@11.0.6/Unknown Source)
at java.util.concurrent.ForkJoinPool.runWorker(java.base@11.0.6/Unknown Source)
at java.util.concurrent.ForkJoinWorkerThread.run(java.base@11.0.6/Unknown Source)
I don't think we expect Bazel to be literally stuck due to a thread having that stack trace, since it's a CPU-bound operation. Maybe what's going on here is that there's a node with a ton (millions?) of direct deps, and #batchPrefetch
is iterating through them sequentially. We still wouldn't expect this to take ~seconds though... Maybe what's really happening is the node has lots of Skyframe restarts, each of which is dominated by #batchPrefetch
, and the thread dumps are simply showing that. Would be really interesting to hack up Bazel to print out some information about the specific (single?) node in question.
@yuyue730 You've been coincidentally looking at a #batchPrefetch
issue internally. Can you please take a look at this too?
Hey @yuyue730 , thanks for looking into this. Did you already have any findings?
Would be really interesting to hack up Bazel to print out some information about the specific (single?) node in question.
If you have a patch, I can help build a binary that allows TF team to test and collect more useful information for debugging.
A similar hanging issue is happening in bazel downstream pipeline: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3697#018e0c5c-9fcd-462e-a55d-9f9256bf25d0 Both on macOS and Windows.
After investigating and trying to reproduce this issue, I feel what @coeuvre said in https://github.com/bazelbuild/bazel/issues/21134#issuecomment-1928907979 that "it's clear that Bazel indeed stuck at SkyFunctionEnvironment#batchPrefetch
" might not be accurate.
I used a Google virtual Linux desktop to reproduce this issue, which is much more powerful than the Mac mentioned in https://github.com/bazelbuild/bazel/issues/21134#issue-2106837526. If build gets stuck at SkyFunctionEnvironment#batchPrefetch
and it is a CPU-bound operation, we expect this will show up in the CPU profile.
@nitins17 provided a TensorFlow branch and some script to build the TensorFlow target. I used YourKit to capture the CPU profile while bazel was building it. The build succeeds within ~1018 seconds. The CPU profile did not exhibit any performance hotspot around SkyFunctionEnvironment
, disapproving SkyFunctionEnvironment#batchPrefetch
gets stuck.
The image below shows the only SkyFunctionEnvironment
call captured by YourKit.
See https://gist.github.com/yuyue730/99090975370101da67df345fda40c1c2#file-gistfile1-txt for the command executed and the traces outputted by bazel.
Reading more carefully into the log provided by @nitins17, if we focus on thread "skyframe-evaluator-110", it is clear that SkyFunctionEnvironment#batchPrefetch
is never executed for a very long time. Surveying other threads shows similar evidence.
I am more inclined to think that the mac machine running the CI being very powerless is the reason for the build to take forever. The specs of the Mac ("Proc Cores is 2", "Proc Logical CPUs is 4" and 16GB RAM) is much weaker than the Linux machine which I tried to reproduce the issue which (24 vCPU and 96GB of RAM).
Description of the bug:
Hello,
We (ML DevInfra team at Google) have been trying to set up a build for TensorFlow that can cross-compile for macOS x86 on remote Linux x86 VMs using RBE. The build currently runs into issues if we set
--jobs
to be > 100. It seems to timeout at the final 2 actions. It is hard to tell what is happening from the description of the action ("checking cached actions") and running with-s
does not seem to give any additional details. Can anyone on the Bazel side help us understand why it times out at the final 2 actions and if/how it can be fixed?Which category does this issue belong to?
Remote Execution
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Please note that we have only ran into this issue in our CI build and running with our RBE setup would require additional permissions so this might be hard to reproduce but here are the commands to run nonetheless
Which operating system are you running Bazel on?
macOS Ventura
What is the output of
bazel info release
?release 6.5.0
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
We have a cross-compile build running for Linux Arm64 as well where we have seen this action appear in the log but that build does not get stuck indefinitely. The Mac machines we have in our CI are not powerful (for instance, they have 16GB RAM compared to 192 GB on our Linux Arm64 VMs) so our current suspicion is that the host Mac machine might be too weak to support remote building on > 100 machines.
Specs of the Mac machines in our CI: OS | macOS 13.6 22G120 Proc Type | Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz Proc Desc. | Intel x86_64 Family 6 Model 85 Stepping 4 Proc Cores | 2 Proc Logical CPUs | 4 Total Memory (GB) | 16.00