Build being very slow when cross-compiling for macOS using RBE, might due to Mac machine being very powerless

nitins17 commented 9 months ago

Description of the bug:

Hello,

We (ML DevInfra team at Google) have been trying to set up a build for TensorFlow that can cross-compile for macOS x86 on remote Linux x86 VMs using RBE. The build currently runs into issues if we set --jobs to be > 100. It seems to timeout at the final 2 actions. It is hard to tell what is happening from the description of the action ("checking cached actions") and running with -s does not seem to give any additional details. Can anyone on the Bazel side help us understand why it times out at the final 2 actions and if/how it can be fixed?

[45,720 / 45,736] [Sched] Linking tensorflow/python/_pywrap_tfcompile.so [for tool]; 7s ... (4 actions, 0 running)
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions
[45,734 / 45,736] checking cached actions

ERROR: Aborting VM command due to timeout of 43200 seconds

Which category does this issue belong to?

Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Please note that we have only ran into this issue in our CI build and running with our RBE setup would require additional permissions so this might be hard to reproduce but here are the commands to run nonetheless

1. git clone https://github.com/tensorflow/tensorflow.git && cd tensorflow
# Replace "test:" cross-compile configs with "build:"
2. sed -i '' 's/test:macos_x86_pycpp_test_filters/build:macos_x86_pycpp_test_filters/g' .bazelrc && sed -i '' 's/test:macos_x86_pycpp_test/build:macos_x86_pycpp_test/g' .bazelrc  && sed -i '' 's/test:cross_compile_macos_x86_pycpp_test/build:cross_compile_macos_x86_pycpp_test/g' .bazelrc
3. `=bazel build --repo_env=TF_PYTHON_VERSION=3.11 --noremote_accept_cached --noenable_runfiles --jobs=300 --config rbe_cross_compile_macos_x86  --config=cross_compile_macos_x86_pycpp_test

Which operating system are you running Bazel on?

macOS Ventura

What is the output of `bazel info release`?

release 6.5.0

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

We have a cross-compile build running for Linux Arm64 as well where we have seen this action appear in the log but that build does not get stuck indefinitely. The Mac machines we have in our CI are not powerful (for instance, they have 16GB RAM compared to 192 GB on our Linux Arm64 VMs) so our current suspicion is that the host Mac machine might be too weak to support remote building on > 100 machines.

nitins17 commented 9 months ago

cc: @vam-google, @learning-to-play, @meteorcloudy

coeuvre commented 9 months ago

Can you capture the stack traces with jstack when Bazel gets stuck?

nitins17 commented 9 months ago

Can you capture the stack traces with jstack when Bazel gets stuck?

Is there a way to pass this to Bazel so that it prints the stack trace after it gets stuck? We don't have interactive access to our CI VMs so it would be tricky to capture the stack trace if I have to run jstack after Bazel gets stuck.

lberki commented 9 months ago

Is something like BAZEL_PID=$(bazel info server_pid); (while /bin/true; do sleep 60; jstack $BAZEL_PID; done) & <everything else> an option? It's not nice, but it's simple.

meteorcloudy commented 9 months ago

https://blog.bazel.build/2023/10/06/bwob-in-bazel-7.html

Have you also tried --remote_download_toplevel?

nitins17 commented 9 months ago

Is something like BAZEL_PID=$(bazel info server_pid); (while /bin/true; do sleep 60; jstack $BAZEL_PID; done) & <everything else> an option? It's not nice, but it's simple.

Thanks, let me give it a try.

Have you also tried --remote_download_toplevel?

Yep, unfortunately, the build still gets stuck.

nitins17 commented 9 months ago

Is something like BAZEL_PID=$(bazel info server_pid); (while /bin/true; do sleep 60; jstack $BAZEL_PID; done) & <everything else> an option? It's not nice, but it's simple.

It worked, thanks! Here's the stack trace when Bazel gets stuck.

coeuvre commented 9 months ago

It's not clear from the stack trace what's going wrong. All remote-executor-* threads are waiting for tasks, as well as all skyframe-executor-* except one:

"skyframe-evaluator-110": running
    at [java.util.concurrent.ConcurrentHashMap.get(java.base@11.0.6/Unknown Source)](https://spotify.github.io/threaddump-analyzer/#java.util.concurrent.ConcurrentHashMap.get(java.base%4011.0.6%2FUnknown%20Source))
    at com.google.devtools.build.skyframe.InMemoryGraphImpl$$Lambda$261/0x00000008002ca840.get(Unknown Source)
    at com.google.devtools.build.skyframe.SkyFunctionEnvironment.batchPrefetch(SkyFunctionEnvironment.java:222)
    at com.google.devtools.build.skyframe.SkyFunctionEnvironment.<init>(SkyFunctionEnvironment.java:203)
    at com.google.devtools.build.skyframe.SkyFunctionEnvironment.create(SkyFunctionEnvironment.java:154)
    at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:535)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:365)
    at java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(java.base@11.0.6/Unknown Source)
    at java.util.concurrent.ForkJoinTask.doExec(java.base@11.0.6/Unknown Source)
    at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@11.0.6/Unknown Source)
    at java.util.concurrent.ForkJoinPool.scan(java.base@11.0.6/Unknown Source)
    at java.util.concurrent.ForkJoinPool.runWorker(java.base@11.0.6/Unknown Source)
    at java.util.concurrent.ForkJoinWorkerThread.run(java.base@11.0.6/Unknown Source)

Looking at the code, InMemoryGraphImpl$$Lambda$261/0x00000008002ca840.get(Unknown Source) shouldn't block. It might just be a red herring.

Can you capture more stack traces so that we know whether this particular thread can make progress?

nitins17 commented 9 months ago

Sure, saved some more stack traces in here. The entire log was too big to save so I had to trim it to the last ~25%.

coeuvre commented 9 months ago

Thanks! With more stack traces, it's clear that Bazel indeed stuck at:

"skyframe-evaluator-110" #713 daemon prio=5 os_prio=31 cpu=12505629.49ms elapsed=40986.47s tid=0x00007f80eb1bd800 nid=0x3bf07 runnable  [0x0000700020168000]
   java.lang.Thread.State: RUNNABLE
    at java.util.concurrent.ConcurrentHashMap.get(java.base@11.0.6/Unknown Source)
    at com.google.devtools.build.skyframe.InMemoryGraphImpl$$Lambda$261/0x00000008002ca840.get(Unknown Source)
    at com.google.devtools.build.skyframe.SkyFunctionEnvironment.batchPrefetch(SkyFunctionEnvironment.java:222)
    at com.google.devtools.build.skyframe.SkyFunctionEnvironment.<init>(SkyFunctionEnvironment.java:203)
    at com.google.devtools.build.skyframe.SkyFunctionEnvironment.create(SkyFunctionEnvironment.java:154)
    at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:535)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:365)
    at java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(java.base@11.0.6/Unknown Source)
    at java.util.concurrent.ForkJoinTask.doExec(java.base@11.0.6/Unknown Source)
    at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@11.0.6/Unknown Source)
    at java.util.concurrent.ForkJoinPool.scan(java.base@11.0.6/Unknown Source)
    at java.util.concurrent.ForkJoinPool.runWorker(java.base@11.0.6/Unknown Source)
    at java.util.concurrent.ForkJoinWorkerThread.run(java.base@11.0.6/Unknown Source)

haxorz commented 9 months ago

I don't think we expect Bazel to be literally stuck due to a thread having that stack trace, since it's a CPU-bound operation. Maybe what's going on here is that there's a node with a ton (millions?) of direct deps, and #batchPrefetch is iterating through them sequentially. We still wouldn't expect this to take ~seconds though... Maybe what's really happening is the node has lots of Skyframe restarts, each of which is dominated by #batchPrefetch, and the thread dumps are simply showing that. Would be really interesting to hack up Bazel to print out some information about the specific (single?) node in question.

@yuyue730 You've been coincidentally looking at a #batchPrefetch issue internally. Can you please take a look at this too?

meteorcloudy commented 8 months ago

Hey @yuyue730 , thanks for looking into this. Did you already have any findings?

Would be really interesting to hack up Bazel to print out some information about the specific (single?) node in question.

If you have a patch, I can help build a binary that allows TF team to test and collect more useful information for debugging.

meteorcloudy commented 8 months ago

A similar hanging issue is happening in bazel downstream pipeline: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3697#018e0c5c-9fcd-462e-a55d-9f9256bf25d0 Both on macOS and Windows.

yuyue730 commented 8 months ago

After investigating and trying to reproduce this issue, I feel what @coeuvre said in https://github.com/bazelbuild/bazel/issues/21134#issuecomment-1928907979 that "it's clear that Bazel indeed stuck at SkyFunctionEnvironment#batchPrefetch" might not be accurate.

I used a Google virtual Linux desktop to reproduce this issue, which is much more powerful than the Mac mentioned in https://github.com/bazelbuild/bazel/issues/21134#issue-2106837526. If build gets stuck at SkyFunctionEnvironment#batchPrefetch and it is a CPU-bound operation, we expect this will show up in the CPU profile.

@nitins17 provided a TensorFlow branch and some script to build the TensorFlow target. I used YourKit to capture the CPU profile while bazel was building it. The build succeeds within ~1018 seconds. The CPU profile did not exhibit any performance hotspot around SkyFunctionEnvironment, disapproving SkyFunctionEnvironment#batchPrefetch gets stuck.

The image below shows the only SkyFunctionEnvironment call captured by YourKit.

See https://gist.github.com/yuyue730/99090975370101da67df345fda40c1c2#file-gistfile1-txt for the command executed and the traces outputted by bazel.

Reading more carefully into the log provided by @nitins17, if we focus on thread "skyframe-evaluator-110", it is clear that SkyFunctionEnvironment#batchPrefetch is never executed for a very long time. Surveying other threads shows similar evidence.

I am more inclined to think that the mac machine running the CI being very powerless is the reason for the build to take forever. The specs of the Mac ("Proc Cores is 2", "Proc Logical CPUs is 4" and 16GB RAM) is much weaker than the Linux machine which I tried to reproduce the issue which (24 vCPU and 96GB of RAM).

bazelbuild / bazel