Open dkashyn-sfdc opened 8 months ago
scratch_45.txt here is the thread dump of Bazel server when this is happening
I recommend not using --experimental_remote_execution_keepalive
.
Let us try out to remove this!
Should it TO even with keep alive
though?
Still happening even without keep alive
cat .bazelrc |grep keep
#common:remote --experimental_remote_execution_keepalive
[65,553 / 65,563] .... 1966s remote
bazel_server_dump.txt
another dump with keep_alive
off
FYI @werkt this is my colleague and we're using Buildfarm
From both stack dumps, it seems like bazel is waiting for the remote execution to be done. Did it happen for Bazel 6? Do the logs from remote server tell anything?
We are only starting to experiment with RBE so we don't have pre-7.0 baseline, unfortunately.
Regarding remote server doing something: I've checked BuildFarm workers and there were no active tasks there.
Another suspicious stack frame is:
"remote-executor-15" #3372 [22233] prio=5 os_prio=0 cpu=238.42ms elapsed=12851.39s allocated=100633K defined_classes=0 tid=0x00007fb6ac44d050 nid=22233 in Object.wait() [0x00007fb7562ac000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait0(java.base@21.0.1/Native Method)
- waiting on <no object reference available>
at java.lang.Object.wait(java.base@21.0.1/Object.java:366)
at java.lang.Object.wait(java.base@21.0.1/Object.java:339)
at com.google.devtools.build.lib.server.GrpcServerImpl$BlockingStreamObserver.onNext(GrpcServerImpl.java:175)
- locked <0x000000042f67b158> (a com.google.devtools.build.lib.server.GrpcServerImpl$BlockingStreamObserver)
at com.google.devtools.build.lib.server.GrpcServerImpl$RpcOutputStream.write(GrpcServerImpl.java:253)
at com.google.devtools.build.lib.util.io.DelegatingOutErr$DelegatingOutputStream.write(DelegatingOutErr.java:89)
at java.io.BufferedOutputStream.flushBuffer(java.base@21.0.1/BufferedOutputStream.java:125)
at java.io.BufferedOutputStream.implFlush(java.base@21.0.1/BufferedOutputStream.java:252)
at java.io.BufferedOutputStream.flush(java.base@21.0.1/BufferedOutputStream.java:240)
at com.google.devtools.build.lib.runtime.UiEventHandler$FullyBufferedOutputStream.flush(UiEventHandler.java:146)
at com.google.devtools.build.lib.util.io.AnsiTerminal.flush(AnsiTerminal.java:192)
at com.google.devtools.build.lib.runtime.UiEventHandler.handleLocked(UiEventHandler.java:373)
- locked <0x000000042f67b2f0> (a com.google.devtools.build.lib.runtime.UiEventHandler)
at com.google.devtools.build.lib.runtime.UiEventHandler.handleInternal(UiEventHandler.java:443)
at com.google.devtools.build.lib.runtime.UiEventHandler.handle(UiEventHandler.java:471)
at com.google.devtools.build.lib.events.Reporter.handle(Reporter.java:127)
at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.reportUploadError(ByteStreamBuildEventArtifactUploader.java:342)
at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.lambda$uploadLocalFiles$5(ByteStreamBuildEventArtifactUploader.java:371)
at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader$$Lambda/0x00007fb75c9f5568.apply(Unknown Source)
at io.reactivex.rxjava3.internal.operators.single.SingleResumeNext$ResumeMainSingleObserver.onError(SingleResumeNext.java:73)
at io.reactivex.rxjava3.internal.operators.completable.CompletableToSingle$ToSingle.onError(CompletableToSingle.java:73)
at io.reactivex.rxjava3.internal.operators.completable.CompletableCreate$Emitter.tryOnError(CompletableCreate.java:91)
at io.reactivex.rxjava3.internal.operators.completable.CompletableCreate$Emitter.onError(CompletableCreate.java:77)
at com.google.devtools.build.lib.remote.util.RxFutures$OnceCompletableOnSubscribe$1.onFailure(RxFutures.java:102)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1119)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21.0.1/ThreadPoolExecutor.java:1144)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21.0.1/ThreadPoolExecutor.java:642)
at java.lang.Thread.runWith(java.base@21.0.1/Thread.java:1596)
at java.lang.Thread.run(java.base@21.0.1/Thread.java:1583)
It seems like the Bazel server has trouble sending message to the client. Can you share the list of flags your build uses? And maybe try disabling BEP/BES?
BEP is async and this is confirmed to be working properly in an async way starting from 7.0
build --bes_backend=grpcs://...
build --bes_upload_mode=fully_async
build --build_event_text_file_upload_mode=fully_async
build --build_event_binary_file_upload_mode=fully_async
build --build_event_json_file_upload_mode=fully_async
Here is the full list of remote-related configs
common:remote_sfw_uswest2 --remote_executor=grpcs://bazel-buildfarm....
common:remote_sfw_uswest2 --remote_cache=grpcs://bazel-buildfarm....
common:remote_sfw_uswest2 --config=remote
# remote defaults.
common:remote --define=EXECUTOR=remote
common:remote --experimental_remote_cache_compression # https://github.com/buildbuddy-io/buildbuddy/blob/master/.bazelrc#L32
common:remote --jobs=300
common:remote --remote_timeout=3600
common:remote --build_metadata=RBE=True
common:remote --platforms=@rbe_default//config:platform
common:remote --host_platform=@rbe_default//config:platform
common:remote --extra_execution_platforms=@rbe_default//config:platform
common:remote --crosstool_top=@rbe_default//cc:toolchain
common:remote --extra_toolchains=@rbe_default//config:cc-toolchain
common:remote --disk_cache=
common:remote --cpu=k8 --host_cpu=k8 # force tools to use intel x86_64 CPU (in case you are trying on an Apple Mac)
#common:remote --experimental_remote_execution_keepalive
From the stack trace above, Bazel encountered errors when uploading blobs to BES/BEP. The errors were not printed out because Bazel server stuck at sending the message to client.
I suggested to try disabling BES/BEP because by doing so we know whether this (the upload error) is the reason why Bazel stuck.
I'll try to repro it without BES/BEP in next couple of days. The issue still should be fixed though even if this is because of BEP/BES ;) We specifically asking for async
.
What I remember from server logs earlier: there were no traces of any uploads in other Bazel server logs.
❯ cat .bazelrc | grep -e bes -e event
#build --bes_backend=grpcs://...
#build --bes_upload_mode=fully_async
#build --build_event_text_file_upload_mode=fully_async
#build --build_event_binary_file_upload_mode=fully_async
#build --build_event_json_file_upload_mode=fully_async
I see
This time RBE server might be not really healthy yet in this case I would expect Bazel to fail after a reasonable timeout.
I'll try to repro it without BES/BEP in next couple of days. The issue still should be fixed though even if this is because of BEP/BES ;) We specifically asking for
async
.
Right, we are still investigate what is the root cause and I am trying to remove the variances as much as possible.
From the new stack trace:
skyframe-evaluator-execution-150" #12989 [603421] daemon prio=5 os_prio=0 cpu=3342.37ms elapsed=1629.24s allocated=1920M defined_classes=0 tid=0x00007f7c5c24ee30 nid=603421 waiting on condition [0x00007f7b13432000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@21.0.1/Native Method)
- parking to wait for <0x000000056776f8d0> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@21.0.1/LockSupport.java:269)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@21.0.1/AbstractQueuedSynchronizer.java:756)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@21.0.1/AbstractQueuedSynchronizer.java:1126)
at java.util.concurrent.CountDownLatch.await(java.base@21.0.1/CountDownLatch.java:276)
at io.reactivex.rxjava3.internal.observers.BlockingMultiObserver.blockingAwait(BlockingMultiObserver.java:134)
at io.reactivex.rxjava3.core.Completable.blockingAwait(Completable.java:1497)
at com.google.devtools.build.lib.remote.RemoteExecutionCache.ensureInputsPresent(RemoteExecutionCache.java:117)
at com.google.devtools.build.lib.remote.RemoteExecutionService.uploadInputsIfNotPresent(RemoteExecutionService.java:1451)
at com.google.devtools.build.lib.remote.RemoteSpawnRunner.lambda$exec$2(RemoteSpawnRunner.java:272)
at com.google.devtools.build.lib.remote.RemoteSpawnRunner$$Lambda/0x00007f7d0c88b4e8.call(Unknown Source)
at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:245)
at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:127)
at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:116)
at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:265)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:159)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
...
it seems like Bazel stuck at uploading inputs to remote server. Since you set --remote_timeout=3600
, Bazel will only time out after 1 hour. Can you try with a lower value to check whether Bazel can time out?
It seems relate to --experimental_throttle_remote_action_building
. Can you try with --noexperimental_throttle_remote_action_building
?
As you can see here https://github.com/bazelbuild/bazel/issues/21626#issuecomment-1994457946 we don't have this enabled. I still can turn it off if required though.
Ok... I see it is enabled by default. Let me turn that off
@Option(
name = "experimental_throttle_remote_action_building",
defaultValue = "true",
What to do with BES/BEP and keep_alive
that were turned off earlier ?
Let's keep them off for now.
This flag is 8.x only, based on https://github.com/bazelbuild/bazel/commit/294c904c30fe305f60e548a438940d5ab60a15b4
I got
ERROR: --noexperimental_throttle_remote_action_building :: Unrecognized option: --noexperimental_throttle_remote_action_building
for
cat .bazelversion
7.0.1
So I'd say this is definitely not this property.
So I'd say this is definitely not this property.
Although the flag was not available for 7.0.1, the feature is still there and on.
The flag is available in Bazel 7.1.0. Can you use that version instead and try?
It might take a couple of weeks for us to upgrade to 7.1 . We definitely going to do this in either case yet amount of features we currently disabled is someway worrying...
Do you really think this "throttling" is a prime suspect here?
From the stack trace, it's highly likely. Would you mind patching https://github.com/bazelbuild/bazel/commit/294c904c30fe305f60e548a438940d5ab60a15b4 on top of 7.0.1 to give it a try?
Ok, let us focus on testing this particular condition next week then.
We were able to use 7.1
and TO is still there
I have another run stuck in
[65,743 / 65,759] ...; 3777s remote, disk-cache, remote-cache
with
#common:remote --experimental_remote_execution_keepalive
common:remote --noexperimental_throttle_remote_action_building
rbe_stuck_1k_seconds.txt rbe_stuck_2k_seconds.txt rbe_stuck_3and6k_seconds.txt rbe_stuck_3k_seconds.txt
Disabling the async task cache fixes this problem after inspiring up to 100 writes per action on builds. Hangs every time with the task cache enabled, completes every time with it disabled.
diff --git b/src/main/java/com/google/devtools/build/lib/remote/util/AsyncTaskCache.java a/src/main/java/com/google/devtools/build/lib/remote/util/AsyncTaskCache.java
index cd91a72b50..ce52000954 100644
--- b/src/main/java/com/google/devtools/build/lib/remote/util/AsyncTaskCache.java
+++ a/src/main/java/com/google/devtools/build/lib/remote/util/AsyncTaskCache.java
@@ -291,6 +291,8 @@ public final class AsyncTaskCache<KeyT, ValueT> {
Action onAlreadyRunning,
Action onAlreadyFinished,
boolean force) {
+ return task;
+ /*
return Single.create(
emitter -> {
synchronized (lock) {
@@ -342,6 +344,7 @@ public final class AsyncTaskCache<KeyT, ValueT> {
});
}
});
+ */
}
/**
@coeuvre could you please take a look whether new stacktraces and info from @werkt is helpful with identifying the issue?
Thanks for the stacktraces and pointers to the potential root cause. I will try to reproduce it locally now and hopefully come up with a fix.
Disabling the async task cache fixes this problem after inspiring up to 100 writes per action on builds. Hangs every time with the task cache enabled, completes every time with it disabled.
Failed to reproduce it locally so far with the described setup. @werkt Is it possible to share your repro?
https://github.com/werkt/bazel/tree/async-task-cache Includes the short circuit return from the cache patch posted here
I am confused. A custom Bazel built from https://github.com/werkt/bazel/tree/async-task-cache can indeed stuck the build, however, that is without the async task cache. When I enable the async task cache by removing the short circuit, the same build succeeded. It's like the inverse of https://github.com/bazelbuild/bazel/issues/21626#issuecomment-2001670837.
@werkt Can you elaborate more?
tl;dr: run it more than once, and run it against buildfarm. You need to see all the content for actions already in the CAS to exhibit this.
The point of the changes to async-task-cache (aside from the short circuit), are to make the client attempt to upload numerous blobs that are already uploaded, and to try to upload many blobs across actions that are the same (to have high consumer counts for the AsyncTaskCache).
If your remote implementation test case does not terminate duplicate blob uploads with committedSize responses on first observation, you may not get the behavior I'm seeing. When I run this against buildfarm, which informs writers of a blob already existing with an early response and completion, I see hangs without the short circuit (sometimes only for the remote_timeout), and with the short circuit it completes quickly on all actions.
@coeuvre were you able to identify the issue? How else we can help here?
I was able to reproduce a hang but the stack trace is different. I am not sure whether they share the same root cause.
In the mean time, can you check whether your build hangs with --remote_download_all
? (If it doesn't, then I know the root cause)
Something has changed in our RBE server config for this not to happen anymore :(
Yet the fact that it was so consistent before means that some more resilience on client side won't hurt.
I think I am seeing the same issue as reported here, on 7.1.1.
2024-03-26 21:28:30 CDT (02:28:30) [755,647 / 761,751] 16720 / 24275 tests, 1 failed; [Prepa] <redacted target>; 2857s ... (200 actions, 0 running)
2024-03-26 21:29:58 CDT (02:29:58) [756,389 / 762,222] 16768 / 24275 tests, 1 failed; [Prepa] <redacted target>; 2945s ... (200 actions, 2 running)
2024-03-26 21:31:04 CDT (02:31:04) [764,176 / 769,524] 17127 / 24275 tests, 1 failed; [Prepa] <redacted target>; 3011s ... (200 actions, 1 running)
2024-03-26 21:32:04 CDT (02:32:04) [766,330 / 771,249] 17205 / 24275 tests, 1 failed; [Prepa] <redacted target>; 3071s ... (200 actions, 0 running)
2024-03-26 21:33:35 CDT (02:33:35) [766,330 / 771,249] 17205 / 24275 tests, 1 failed; [Prepa] <redacted target>; 3161s ... (200 actions, 0 running)
Some relevant flags
--jobs 200
--remote_download_minimal
--experimental_throttle_remote_action_building # retained default value
--bes_backend=grpc://127.0.0.1:5555
--bes_upload_mode=FULLY_ASYNC
--remote_timeout=3600
As per last suggestion, I will try with --remote_download_all
@joeljeske Your build is endeavoring to build the merkle trees of 199 actions, slowly, it seems, but not obviously stuck. 64 (presumably your # of cores, throttled to the remote action building semaphore) are proceeding in this task.
You do have one evaluator skyframe-evaluator-40
waiting for a subprocess, but I don't think your build is shot, just a lot of work going on. Unless you're out of heap and bouncing against the GC ceiling, this will proceed, and definitely doesn't look like the hangs in ensureInputsPresent that indicate a lack of signaling to upload procedures.
Understood! Thank you for the analysis. I’m curious then, why I don’t I see observe the same behavior on Bazel 6.4.0. I can consistently see slow build speeds on 7.1.1 that are not present before the upgrade.
Do you have any ideas?
In the interest of not hijacking this issue, can you file a separate one @joeljeske against a performance regression between 6.4.0. and 7.1.1?
Can you patch https://github.com/bazelbuild/bazel/commit/eda0fe4399b85d3bbc39fda2701287e014ecb919 and check whether this issue is still reproducible?
@coeuvre will this be merged to 7.2? Patching is hard and we not even fully moved to 7.1 yet... I'd rather try to repro this with 7.0 like before and then switch to "head" of 7.2 to see whether the issue is there.
My CAS state is changing and evictions and cleanup not happening recently. This is why I'm unable to repro currently. Yet we are verifying some CAS-related issue with BuildFarm and I hope to get to the point when CAS is mutating again to trigger the issue for us.
Yes, the fix will be merged into 7.2.
No pressure! I will keep this issue open (but with a lower priority) for the time being. Feel free to close it once you can verify the fix.
@coeuvre could you please tell where we can get the custom Bazelisk-ready version that contains the fix? Rolling releases seems having only 7.0 and 8.0 pre-releases and I don't see any 7.1 and 7.2 nightly builds here https://bazel.build/release/rolling
Tried 8.0.0-pre.20240404.3
and still see
[106,946 / 106,964] some target here....; 5713s remote, remote-cache, local
this seems not ok. It doesn't look that https://github.com/bazelbuild/bazel/pull/21941/files is merged to master
though... So probably this is expected.
@iancha1992 @coeuvre could you please suggest which bazelisk
builds can be used to test this change?
A fix for this issue has been included in Bazel 7.2.0 RC1. Please test out the release candidate and report any issues as soon as possible. If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.2.0rc1. Thanks!
This issue has recurred under bazel 7.2. I have a stuck java process awaiting countdown latches for all ensureInputsPresent running.
FTR - I am seeing this issue like some folks above on Bazel 6.1.0 very occassionally.
The symptoms are the same - The actions block on the semaphore in RemoteExecutionService, because there's N ( N = count of procs on my system ) that are locked on RemoteExecutionCache.ensureInputsPresent uploads blockingAwait.
Description of the bug:
Bazel
build
current statewhen there is no such issue observed this run takes less than a minute if not less than 10s.
Which category does this issue belong to?
No response
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
There is no easy way to reproduce it and some noticeable % of our RBE runs are stuck forever without Bazel finishing properly.
Here is the thread dump of Bazel server and there is no activity on RBE side workers at the same time
Which operating system are you running Bazel on?
5.14.0-362.18.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Jan 3 15:54:45 EST 2024 x86_64 x86_64 x86_64 GNU/Linux
What is the output of
bazel info release
?release 7.0.1
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
We do have the following in
.bazelrc