bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.26k stars 4.08k forks source link

Bazel stuck during remote execution #21626

Open dkashyn-sfdc opened 8 months ago

dkashyn-sfdc commented 8 months ago

Description of the bug:

Bazel build current state

[65,456 / 65,465] PackageZip ...rcejar.jar; 20129s remote

when there is no such issue observed this run takes less than a minute if not less than 10s.

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

There is no easy way to reproduce it and some noticeable % of our RBE runs are stuck forever without Bazel finishing properly.

Here is the thread dump of Bazel server and there is no activity on RBE side workers at the same time

Which operating system are you running Bazel on?

5.14.0-362.18.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Jan 3 15:54:45 EST 2024 x86_64 x86_64 x86_64 GNU/Linux

What is the output of bazel info release?

release 7.0.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

We do have the following in .bazelrc

common:remote --jobs=300
common:remote --remote_timeout=3600
dkashyn-sfdc commented 8 months ago

scratch_45.txt here is the thread dump of Bazel server when this is happening

benjaminp commented 8 months ago

I recommend not using --experimental_remote_execution_keepalive.

dkashyn-sfdc commented 8 months ago

Let us try out to remove this!

dkashyn-sfdc commented 8 months ago

Should it TO even with keep alive though?

dkashyn-sfdc commented 8 months ago

Still happening even without keep alive

 cat .bazelrc |grep keep
#common:remote --experimental_remote_execution_keepalive
[65,553 / 65,563] .... 1966s remote
dkashyn-sfdc commented 8 months ago

bazel_server_dump.txt another dump with keep_alive off

jasonschroeder-sfdc commented 8 months ago

FYI @werkt this is my colleague and we're using Buildfarm

coeuvre commented 8 months ago

From both stack dumps, it seems like bazel is waiting for the remote execution to be done. Did it happen for Bazel 6? Do the logs from remote server tell anything?

dkashyn-sfdc commented 8 months ago

We are only starting to experiment with RBE so we don't have pre-7.0 baseline, unfortunately.

Regarding remote server doing something: I've checked BuildFarm workers and there were no active tasks there.

coeuvre commented 8 months ago

Another suspicious stack frame is:


"remote-executor-15" #3372 [22233] prio=5 os_prio=0 cpu=238.42ms elapsed=12851.39s allocated=100633K defined_classes=0 tid=0x00007fb6ac44d050 nid=22233 in Object.wait()  [0x00007fb7562ac000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait0(java.base@21.0.1/Native Method)
    - waiting on <no object reference available>
    at java.lang.Object.wait(java.base@21.0.1/Object.java:366)
    at java.lang.Object.wait(java.base@21.0.1/Object.java:339)
    at com.google.devtools.build.lib.server.GrpcServerImpl$BlockingStreamObserver.onNext(GrpcServerImpl.java:175)
    - locked <0x000000042f67b158> (a com.google.devtools.build.lib.server.GrpcServerImpl$BlockingStreamObserver)
    at com.google.devtools.build.lib.server.GrpcServerImpl$RpcOutputStream.write(GrpcServerImpl.java:253)
    at com.google.devtools.build.lib.util.io.DelegatingOutErr$DelegatingOutputStream.write(DelegatingOutErr.java:89)
    at java.io.BufferedOutputStream.flushBuffer(java.base@21.0.1/BufferedOutputStream.java:125)
    at java.io.BufferedOutputStream.implFlush(java.base@21.0.1/BufferedOutputStream.java:252)
    at java.io.BufferedOutputStream.flush(java.base@21.0.1/BufferedOutputStream.java:240)
    at com.google.devtools.build.lib.runtime.UiEventHandler$FullyBufferedOutputStream.flush(UiEventHandler.java:146)
    at com.google.devtools.build.lib.util.io.AnsiTerminal.flush(AnsiTerminal.java:192)
    at com.google.devtools.build.lib.runtime.UiEventHandler.handleLocked(UiEventHandler.java:373)
    - locked <0x000000042f67b2f0> (a com.google.devtools.build.lib.runtime.UiEventHandler)
    at com.google.devtools.build.lib.runtime.UiEventHandler.handleInternal(UiEventHandler.java:443)
    at com.google.devtools.build.lib.runtime.UiEventHandler.handle(UiEventHandler.java:471)
    at com.google.devtools.build.lib.events.Reporter.handle(Reporter.java:127)
    at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.reportUploadError(ByteStreamBuildEventArtifactUploader.java:342)
    at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.lambda$uploadLocalFiles$5(ByteStreamBuildEventArtifactUploader.java:371)
    at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader$$Lambda/0x00007fb75c9f5568.apply(Unknown Source)
    at io.reactivex.rxjava3.internal.operators.single.SingleResumeNext$ResumeMainSingleObserver.onError(SingleResumeNext.java:73)
    at io.reactivex.rxjava3.internal.operators.completable.CompletableToSingle$ToSingle.onError(CompletableToSingle.java:73)
    at io.reactivex.rxjava3.internal.operators.completable.CompletableCreate$Emitter.tryOnError(CompletableCreate.java:91)
    at io.reactivex.rxjava3.internal.operators.completable.CompletableCreate$Emitter.onError(CompletableCreate.java:77)
    at com.google.devtools.build.lib.remote.util.RxFutures$OnceCompletableOnSubscribe$1.onFailure(RxFutures.java:102)
    at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1119)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21.0.1/ThreadPoolExecutor.java:1144)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21.0.1/ThreadPoolExecutor.java:642)
    at java.lang.Thread.runWith(java.base@21.0.1/Thread.java:1596)
    at java.lang.Thread.run(java.base@21.0.1/Thread.java:1583)

It seems like the Bazel server has trouble sending message to the client. Can you share the list of flags your build uses? And maybe try disabling BEP/BES?

dkashyn-sfdc commented 8 months ago

BEP is async and this is confirmed to be working properly in an async way starting from 7.0

build --bes_backend=grpcs://...
build --bes_upload_mode=fully_async
build --build_event_text_file_upload_mode=fully_async
build --build_event_binary_file_upload_mode=fully_async
build --build_event_json_file_upload_mode=fully_async

Here is the full list of remote-related configs

common:remote_sfw_uswest2 --remote_executor=grpcs://bazel-buildfarm....
common:remote_sfw_uswest2 --remote_cache=grpcs://bazel-buildfarm....
common:remote_sfw_uswest2 --config=remote

# remote defaults.
common:remote --define=EXECUTOR=remote
common:remote --experimental_remote_cache_compression  # https://github.com/buildbuddy-io/buildbuddy/blob/master/.bazelrc#L32
common:remote --jobs=300
common:remote --remote_timeout=3600
common:remote --build_metadata=RBE=True
common:remote --platforms=@rbe_default//config:platform
common:remote --host_platform=@rbe_default//config:platform
common:remote --extra_execution_platforms=@rbe_default//config:platform
common:remote --crosstool_top=@rbe_default//cc:toolchain
common:remote --extra_toolchains=@rbe_default//config:cc-toolchain
common:remote --disk_cache=
common:remote --cpu=k8 --host_cpu=k8 # force tools to use intel x86_64 CPU (in case you are trying on an Apple Mac)
#common:remote --experimental_remote_execution_keepalive
coeuvre commented 8 months ago

From the stack trace above, Bazel encountered errors when uploading blobs to BES/BEP. The errors were not printed out because Bazel server stuck at sending the message to client.

I suggested to try disabling BES/BEP because by doing so we know whether this (the upload error) is the reason why Bazel stuck.

dkashyn-sfdc commented 8 months ago

I'll try to repro it without BES/BEP in next couple of days. The issue still should be fixed though even if this is because of BEP/BES ;) We specifically asking for async.

What I remember from server logs earlier: there were no traces of any uploads in other Bazel server logs.

dkashyn-sfdc commented 8 months ago
❯ cat .bazelrc | grep -e bes -e event
#build --bes_backend=grpcs://...
#build --bes_upload_mode=fully_async
#build --build_event_text_file_upload_mode=fully_async
#build --build_event_binary_file_upload_mode=fully_async
#build --build_event_json_file_upload_mode=fully_async

I see

image

rbe_another_dump.txt

This time RBE server might be not really healthy yet in this case I would expect Bazel to fail after a reasonable timeout.

coeuvre commented 8 months ago

I'll try to repro it without BES/BEP in next couple of days. The issue still should be fixed though even if this is because of BEP/BES ;) We specifically asking for async.

Right, we are still investigate what is the root cause and I am trying to remove the variances as much as possible.

From the new stack trace:

skyframe-evaluator-execution-150" #12989 [603421] daemon prio=5 os_prio=0 cpu=3342.37ms elapsed=1629.24s allocated=1920M defined_classes=0 tid=0x00007f7c5c24ee30 nid=603421 waiting on condition  [0x00007f7b13432000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at jdk.internal.misc.Unsafe.park(java.base@21.0.1/Native Method)
    - parking to wait for  <0x000000056776f8d0> (a java.util.concurrent.CountDownLatch$Sync)
    at java.util.concurrent.locks.LockSupport.parkNanos(java.base@21.0.1/LockSupport.java:269)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@21.0.1/AbstractQueuedSynchronizer.java:756)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@21.0.1/AbstractQueuedSynchronizer.java:1126)
    at java.util.concurrent.CountDownLatch.await(java.base@21.0.1/CountDownLatch.java:276)
    at io.reactivex.rxjava3.internal.observers.BlockingMultiObserver.blockingAwait(BlockingMultiObserver.java:134)
    at io.reactivex.rxjava3.core.Completable.blockingAwait(Completable.java:1497)
    at com.google.devtools.build.lib.remote.RemoteExecutionCache.ensureInputsPresent(RemoteExecutionCache.java:117)
    at com.google.devtools.build.lib.remote.RemoteExecutionService.uploadInputsIfNotPresent(RemoteExecutionService.java:1451)
    at com.google.devtools.build.lib.remote.RemoteSpawnRunner.lambda$exec$2(RemoteSpawnRunner.java:272)
    at com.google.devtools.build.lib.remote.RemoteSpawnRunner$$Lambda/0x00007f7d0c88b4e8.call(Unknown Source)
    at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:245)
    at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:127)
    at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:116)
    at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:265)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:159)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
        ...

it seems like Bazel stuck at uploading inputs to remote server. Since you set --remote_timeout=3600, Bazel will only time out after 1 hour. Can you try with a lower value to check whether Bazel can time out?

dkashyn-sfdc commented 8 months ago
image

rbe_yet_another_dump.txt

coeuvre commented 8 months ago

It seems relate to --experimental_throttle_remote_action_building. Can you try with --noexperimental_throttle_remote_action_building?

dkashyn-sfdc commented 8 months ago

As you can see here https://github.com/bazelbuild/bazel/issues/21626#issuecomment-1994457946 we don't have this enabled. I still can turn it off if required though.

dkashyn-sfdc commented 8 months ago

Ok... I see it is enabled by default. Let me turn that off

  @Option(
      name = "experimental_throttle_remote_action_building",
      defaultValue = "true",
dkashyn-sfdc commented 8 months ago

What to do with BES/BEP and keep_alive that were turned off earlier ?

coeuvre commented 8 months ago

Let's keep them off for now.

dkashyn-sfdc commented 8 months ago

This flag is 8.x only, based on https://github.com/bazelbuild/bazel/commit/294c904c30fe305f60e548a438940d5ab60a15b4

I got

ERROR: --noexperimental_throttle_remote_action_building :: Unrecognized option: --noexperimental_throttle_remote_action_building

for

cat .bazelversion 
7.0.1

So I'd say this is definitely not this property.

coeuvre commented 8 months ago

So I'd say this is definitely not this property.

Although the flag was not available for 7.0.1, the feature is still there and on.

The flag is available in Bazel 7.1.0. Can you use that version instead and try?

dkashyn-sfdc commented 8 months ago

It might take a couple of weeks for us to upgrade to 7.1 . We definitely going to do this in either case yet amount of features we currently disabled is someway worrying...

Do you really think this "throttling" is a prime suspect here?

coeuvre commented 8 months ago

From the stack trace, it's highly likely. Would you mind patching https://github.com/bazelbuild/bazel/commit/294c904c30fe305f60e548a438940d5ab60a15b4 on top of 7.0.1 to give it a try?

dkashyn-sfdc commented 8 months ago

Ok, let us focus on testing this particular condition next week then.

dkashyn-sfdc commented 8 months ago

We were able to use 7.1 and TO is still there

I have another run stuck in

[65,743 / 65,759] ...; 3777s remote, disk-cache, remote-cache

with

#common:remote --experimental_remote_execution_keepalive
common:remote --noexperimental_throttle_remote_action_building

rbe_stuck_1k_seconds.txt rbe_stuck_2k_seconds.txt rbe_stuck_3and6k_seconds.txt rbe_stuck_3k_seconds.txt

werkt commented 8 months ago

Disabling the async task cache fixes this problem after inspiring up to 100 writes per action on builds. Hangs every time with the task cache enabled, completes every time with it disabled.

diff --git b/src/main/java/com/google/devtools/build/lib/remote/util/AsyncTaskCache.java a/src/main/java/com/google/devtools/build/lib/remote/util/AsyncTaskCache.java
index cd91a72b50..ce52000954 100644
--- b/src/main/java/com/google/devtools/build/lib/remote/util/AsyncTaskCache.java
+++ a/src/main/java/com/google/devtools/build/lib/remote/util/AsyncTaskCache.java
@@ -291,6 +291,8 @@ public final class AsyncTaskCache<KeyT, ValueT> {
       Action onAlreadyRunning,
       Action onAlreadyFinished,
       boolean force) {
+    return task;
+    /*
     return Single.create(
         emitter -> {
           synchronized (lock) {
@@ -342,6 +344,7 @@ public final class AsyncTaskCache<KeyT, ValueT> {
                 });
           }
         });
+        */
   }

   /**
dkashyn-sfdc commented 8 months ago

@coeuvre could you please take a look whether new stacktraces and info from @werkt is helpful with identifying the issue?

coeuvre commented 8 months ago

Thanks for the stacktraces and pointers to the potential root cause. I will try to reproduce it locally now and hopefully come up with a fix.

coeuvre commented 8 months ago

Disabling the async task cache fixes this problem after inspiring up to 100 writes per action on builds. Hangs every time with the task cache enabled, completes every time with it disabled.

Failed to reproduce it locally so far with the described setup. @werkt Is it possible to share your repro?

werkt commented 8 months ago

https://github.com/werkt/bazel/tree/async-task-cache Includes the short circuit return from the cache patch posted here

coeuvre commented 8 months ago

I am confused. A custom Bazel built from https://github.com/werkt/bazel/tree/async-task-cache can indeed stuck the build, however, that is without the async task cache. When I enable the async task cache by removing the short circuit, the same build succeeded. It's like the inverse of https://github.com/bazelbuild/bazel/issues/21626#issuecomment-2001670837.

@werkt Can you elaborate more?

werkt commented 8 months ago

tl;dr: run it more than once, and run it against buildfarm. You need to see all the content for actions already in the CAS to exhibit this.

The point of the changes to async-task-cache (aside from the short circuit), are to make the client attempt to upload numerous blobs that are already uploaded, and to try to upload many blobs across actions that are the same (to have high consumer counts for the AsyncTaskCache).

If your remote implementation test case does not terminate duplicate blob uploads with committedSize responses on first observation, you may not get the behavior I'm seeing. When I run this against buildfarm, which informs writers of a blob already existing with an early response and completion, I see hangs without the short circuit (sometimes only for the remote_timeout), and with the short circuit it completes quickly on all actions.

dkashyn-sfdc commented 8 months ago

@coeuvre were you able to identify the issue? How else we can help here?

coeuvre commented 8 months ago

I was able to reproduce a hang but the stack trace is different. I am not sure whether they share the same root cause.

In the mean time, can you check whether your build hangs with --remote_download_all? (If it doesn't, then I know the root cause)

dkashyn-sfdc commented 7 months ago

Something has changed in our RBE server config for this not to happen anymore :(

Yet the fact that it was so consistent before means that some more resilience on client side won't hurt.

joeljeske commented 7 months ago

I think I am seeing the same issue as reported here, on 7.1.1.

2024-03-26 21:28:30 CDT (02:28:30) [755,647 / 761,751] 16720 / 24275 tests, 1 failed; [Prepa] <redacted target>; 2857s ... (200 actions, 0 running)
2024-03-26 21:29:58 CDT (02:29:58) [756,389 / 762,222] 16768 / 24275 tests, 1 failed; [Prepa] <redacted target>; 2945s ... (200 actions, 2 running)
2024-03-26 21:31:04 CDT (02:31:04) [764,176 / 769,524] 17127 / 24275 tests, 1 failed; [Prepa] <redacted target>; 3011s ... (200 actions, 1 running)
2024-03-26 21:32:04 CDT (02:32:04) [766,330 / 771,249] 17205 / 24275 tests, 1 failed; [Prepa] <redacted target>; 3071s ... (200 actions, 0 running)
2024-03-26 21:33:35 CDT (02:33:35) [766,330 / 771,249] 17205 / 24275 tests, 1 failed; [Prepa] <redacted target>; 3161s ... (200 actions, 0 running)

Some relevant flags

--jobs 200
--remote_download_minimal
--experimental_throttle_remote_action_building  # retained default value
--bes_backend=grpc://127.0.0.1:5555
--bes_upload_mode=FULLY_ASYNC
--remote_timeout=3600

threads.txt

As per last suggestion, I will try with --remote_download_all

werkt commented 7 months ago

@joeljeske Your build is endeavoring to build the merkle trees of 199 actions, slowly, it seems, but not obviously stuck. 64 (presumably your # of cores, throttled to the remote action building semaphore) are proceeding in this task.

You do have one evaluator skyframe-evaluator-40 waiting for a subprocess, but I don't think your build is shot, just a lot of work going on. Unless you're out of heap and bouncing against the GC ceiling, this will proceed, and definitely doesn't look like the hangs in ensureInputsPresent that indicate a lack of signaling to upload procedures.

joeljeske commented 7 months ago

Understood! Thank you for the analysis. I’m curious then, why I don’t I see observe the same behavior on Bazel 6.4.0. I can consistently see slow build speeds on 7.1.1 that are not present before the upgrade.

Do you have any ideas?

werkt commented 7 months ago

In the interest of not hijacking this issue, can you file a separate one @joeljeske against a performance regression between 6.4.0. and 7.1.1?

coeuvre commented 7 months ago

Can you patch https://github.com/bazelbuild/bazel/commit/eda0fe4399b85d3bbc39fda2701287e014ecb919 and check whether this issue is still reproducible?

dkashyn-sfdc commented 7 months ago

@coeuvre will this be merged to 7.2? Patching is hard and we not even fully moved to 7.1 yet... I'd rather try to repro this with 7.0 like before and then switch to "head" of 7.2 to see whether the issue is there.

My CAS state is changing and evictions and cleanup not happening recently. This is why I'm unable to repro currently. Yet we are verifying some CAS-related issue with BuildFarm and I hope to get to the point when CAS is mutating again to trigger the issue for us.

coeuvre commented 7 months ago

Yes, the fix will be merged into 7.2.

No pressure! I will keep this issue open (but with a lower priority) for the time being. Feel free to close it once you can verify the fix.

dkashyn-sfdc commented 7 months ago

@coeuvre could you please tell where we can get the custom Bazelisk-ready version that contains the fix? Rolling releases seems having only 7.0 and 8.0 pre-releases and I don't see any 7.1 and 7.2 nightly builds here https://bazel.build/release/rolling

dkashyn-sfdc commented 7 months ago

Tried 8.0.0-pre.20240404.3 and still see

[106,946 / 106,964] some target here....; 5713s remote, remote-cache, local

this seems not ok. It doesn't look that https://github.com/bazelbuild/bazel/pull/21941/files is merged to master though... So probably this is expected.

@iancha1992 @coeuvre could you please suggest which bazelisk builds can be used to test this change?

iancha1992 commented 6 months ago

A fix for this issue has been included in Bazel 7.2.0 RC1. Please test out the release candidate and report any issues as soon as possible. If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.2.0rc1. Thanks!

werkt commented 4 months ago

This issue has recurred under bazel 7.2. I have a stuck java process awaiting countdown latches for all ensureInputsPresent running.

stuck-7.2.0.jstack.txt

alexofortune commented 1 month ago

FTR - I am seeing this issue like some folks above on Bazel 6.1.0 very occassionally.

The symptoms are the same - The actions block on the semaphore in RemoteExecutionService, because there's N ( N = count of procs on my system ) that are locked on RemoteExecutionCache.ensureInputsPresent uploads blockingAwait.