bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
22.96k stars 4.02k forks source link

Builds without the Bytes causes Bazel OOM #18145

Closed alexofortune closed 1 week ago

alexofortune commented 1 year ago

Description of the bug:

We're building code around a fairly large repo and we noticed Bazel server consistently OOMing when --remote_download_toplevel is used.

A heap dump was made ( unfortunately i don't think I can share it :( ), which points to four threads, all named "cli-update-thread" local variables taking roughly 80% of the memory. The variables stack ( from thread gc root inwards ):

UIEventHandler.stateTracker.executionProgressReceiver.eventBus.subscribers.table[0] ( of type InMemoryMemoizingEvaluator )
evaluator.graph.nodeMap -> contains many, many, many entries.

( Each of the threads seems to have their own InMemoryMemoizingEvaluator )

It should be noted that this happens very consistently using BWOB and does not happen basically at all using non-BWOB builds.

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Make a large build with BWOB running over a rather large repository. It won't crash every time, but it will crash eventually.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

release 6.1.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

Nothing

Any other information, logs, or outputs that you want to share?

Bazel before OOM threw a lot of logs like this:

230419 11:46:00.012:I 19 [com.google.devtools.build.lib.skyframe.HighWaterMarkLimiter.handle] Dropping unnecessary temporary state in response to memory pressure. actual=99 threshold=85 [CONTEXT ratelimit_period="10 SECONDS [skipped: 1]" ]
230419 11:46:02.591:I 19 [com.google.devtools.build.lib.metrics.PostGCMemoryUseRecorder.handleNotification] Memory use after full GC: 5546637224
tjgq commented 1 year ago

Can you share the full list of flags you're building with? (Feel free to redact out sensitive data like URLs, etc)

One thing that could help debug this would be to capture a heap histogram (jmap -histo) for the "with bytes" and "without bytes" scenarios and diff them to see where the extra allocations are coming from. I think this should be ok to share, since it would only contain the names of classes in the Bazel codebase, and nothing specific to your project?

alexofortune commented 1 year ago

Hi @tjgq , yes, sure. Here are the flags, I redacted target names and URLs but I suppose that should be OK:

build --auto_cpu_environment_group=[redacted]
build --crosstool_top=[redacted]
build --platforms [redacted]
build --host_platform [redacted]
build --noincompatible_remove_legacy_whole_archive
build --sandbox_base=/dev/shm
build --experimental_strict_action_env
build --jobs 32
build --local_test_jobs 2
build --noremote_allow_symlink_upload
build --noremote_verify_downloads
build --experimental_remote_merkle_tree_cache=true
build --experimental_remote_merkle_tree_cache_size=10000
build --disk_cache=[redacted]
build --test_tmpdir=[redacted]
build --incompatible_remote_results_ignore_disk
build --remote_download_toplevel
build --remote_upload_local_results=false
build --remote_cache=[redacted, remote-cache-grpc-upstream url]
build --test_tmpdir=/mnt/cache/bazel/test_tmpdir

EDIT:

Here are the histograms.

Without BWOB:

  1:       5046563      443718888  [B (java.base@11.0.6)
   2:       4748176      284030144  [Ljava.lang.Object; (java.base@11.0.6)
   3:       3990750       95778000  java.lang.String (java.base@11.0.6)
   4:       2692717       86166944  java.util.concurrent.ConcurrentHashMap$Node (java.base@11.0.6)
   5:       2078356       83134240  com.google.common.collect.MapMakerInternalMap$WeakKeyDummyValueEntry
   6:       1839395       73575800  com.google.devtools.build.skyframe.InMemoryNodeEntry
   7:       2407414       38518624  com.google.devtools.build.lib.vfs.PathFragment$RelativePathFragment
   8:       1479944       35518656  com.google.devtools.build.lib.collect.nestedset.NestedSet
   9:       1143955       27454920  java.util.ArrayList (java.base@11.0.6)
  10:           169       24973880  [Ljava.util.concurrent.ConcurrentHashMap$Node; (java.base@11.0.6)
  11:        575593       23023720  java.util.LinkedHashMap$Entry (java.base@11.0.6)
  12:        542849       21713960  com.google.devtools.build.lib.actions.FileContentsProxy
  13:        880121       21122904  com.google.devtools.build.lib.vfs.RootedPath
  14:        593551       18993632  com.google.devtools.build.lib.actions.Artifact$DerivedArtifact
  15:       1091842       17469472  com.google.common.collect.RegularImmutableList
  16:        527896       16892672  com.google.devtools.build.lib.actions.FileArtifactValue$RegularFileArtifactValue
  17:        274841       15391096  com.google.devtools.build.lib.packages.Rule
  18:        268540       15038240  java.util.LinkedHashMap (java.base@11.0.6)
[...]
Total      45904234     1794130736

And for BWOB ( a bazel server that stopped responding, sometimes it crashes sometimes it stays in some "zombie state" ):

 1:      12763072      990879536  [B (java.base@11.0.6)
   2:      16448840      895386536  [Ljava.lang.Object; (java.base@11.0.6)
   3:       7860660      314426400  com.google.devtools.build.skyframe.InMemoryNodeEntry
   4:       9688347      310027104  java.util.concurrent.ConcurrentHashMap$Node (java.base@11.0.6)
   5:      10308844      247412256  java.lang.String (java.base@11.0.6)
   6:       3173465      126938600  java.util.LinkedHashMap$Entry (java.base@11.0.6)
   7:       4789560      114949440  java.util.ArrayList (java.base@11.0.6)
   8:       4254406      102105744  com.google.devtools.build.lib.collect.nestedset.NestedSet
   9:       6061622       96985952  com.google.devtools.build.lib.vfs.PathFragment$RelativePathFragment
  10:       1707246       95605776  com.google.devtools.build.lib.packages.Rule
  11:           488       91158512  [Ljava.util.concurrent.ConcurrentHashMap$Node; (java.base@11.0.6)
  12:       2223180       88927200  com.google.common.collect.MapMakerInternalMap$WeakKeyDummyValueEntry
  13:       1386372       77636832  java.util.LinkedHashMap (java.base@11.0.6)
  14:       1441043       69969088  [Ljava.util.HashMap$Node; (java.base@11.0.6)
  15:       2151420       68845440  java.util.HashMap$Node (java.base@11.0.6)
  16:       1238280       59437440  com.google.common.collect.SingletonImmutableBiMap
  17:       3643862       58301792  com.google.common.collect.RegularImmutableList
  18:       1720587       55058784  com.google.devtools.build.lib.packages.CallStack$Node
[...]
Total     143838308     5324807032

Difference seems quite dramatic, it's more than 3x.

EDIT:

To make things clearer for debugging, here are the flags that were added to the BWOB box comparing to Non-BWOB box ( they're a subset of the flags posted above ):

build --disk_cache=[redacted]
build --test_tmpdir=[redacted]
build --incompatible_remote_results_ignore_disk
build --remote_download_toplevel
build --remote_upload_local_results=false
build --remote_cache=[redacted, remote-cache-grpc-upstream url]
build --test_tmpdir=/mnt/cache/bazel/test_tmpdir
alexofortune commented 1 year ago

The plot thickens.

I took three heap dumps:

Heap Dump from BWOB server after build vs Heap dump taken off a resting Bazel server post non-BWOB build is like 3x in difference. And what's more - dominator tree in Memory Analyzer ( shows biggest objects and their connections to GC roots ) looks basically the same in the resting bazel server post BWOB build and the one that crashed with OOM during BWOB build. The tree looks completely different on non-BWOB resting bazel heapdump ( no mention of "cli-update-thread" at all, the majority is a single memoizing evaluator that is pointed to by SequencedSkyframeExecutor which is rooted in main Thread )

For me, that looks like a memory leak that eventually leads to memory pressure that bazel server is unable to handle, but if there's any other ideas you have, lemme know!

coeuvre commented 1 year ago

cli-update-threads, created by UiEventHandler, are used to handle UI related events.

For BwoB specifically, it might be https://github.com/bazelbuild/bazel/pull/17604 where Bazel sends download progress to the UI. It keeps some state for each active download.

In your case, I guess there are many active downloads which causes OOM. Can you try with a patched Bazel that comments out function downloadProgress to see whether the OOM still happens?

alexofortune commented 1 year ago

@coeuvre I tracked this even further, you're pointing to the right thing, but I still believe Bazel is defective here - In fact, I believe there is at least one , and might be two bugs at play here. Brace yourself, there's a lot of text :)

Bug #1:

I looked more into the heap dumps - it seems UIStateTracker that is owned by UIEventHandler does have a non-zero-size "runningDownloads" deque. Fortunately, this class also hosts a "downloads" TreeMap which points to the latest event for the given URL. And the instances of this event are of instance type DownloadProgress and it matches perfectly to the fact that these are dispatched to be observed only if download is done with Priority.LOW. And guess what - this is done by ToplevelArtifactsDownloader , which is created only when we use BWOB!

See: https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/remote/RemoteModule.java?L975 ( Toplevel downloader created only if we do not download all outputs ) https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/remote/RemoteActionInputFetcher.java?L99 ( Only Priority.LOW downloads are dispatching events eventually listened to by the UIEventHandler )

Now, the interesting bit: Remember the tree i mentioned above? The last progress message is empty. And there's only one moment when progress message can be empty - when a download is started. See this code:

https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/remote/RemoteCache.java?L286

Quick read will tell you that it only returns an empty progress message if includeBytes = false and finished = false. And this combination is called only in started(), here:

https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/remote/RemoteCache.java?L273

The logical conclusion is - we got the "started" event but never got to listen on finished() event which would remove entries from UI tracking downloads and ultimately allow shutting down UI thread.

Now, where all this machinery is called - see here:

https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/remote/RemoteCache.java?L374

It calls started(), and then attaches a listener that calls finished() when blob is downloaded. My guess is that either this listener does not fire, or out.close(); throws an IOException and finished() never runs. I believe in my case it's "listener never fires" situation, as i didn't see the exception logged in my log ( but this probably should be fixed regardless ).

Closing thought - we should look into why potentially this listener wouldn't run. On my end, I'll compile Bazel tomorrow with some logging and see if it indeed does not run ( or if it runs but out.close() throws an exception ).

Bug nr #2:

This is in relation to how UIEventHandler eventually stops the updating thread. UIEventThread has stopUpdateThread() which will stop the thread. For this to happen, several conditions have to happen - during various events, stateTracker.hasActivities() is called ( either directly or via a helper method checkActivities() on the class ) to re-check for these conditions when some of this state is changed..

HasActivities looks like this:

  synchronized boolean hasActivities() {
    return !(buildCompleted()
        && bepOpenTransports.isEmpty()
        && activeActionUploads.get() == 0
        && activeActionDownloads.get() == 0 // this variable is dead and always zero btw 
        && runningDownloads.isEmpty());
  }

Now, when those things change, events are being listened on and checking is re-done in UIEventHandler to ensure that in case of a new state, we can stop the UI update thread.

For example:

https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/runtime/UiEventHandler.java?L583 ( BuildCompletedEvent )

https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/runtime/UiEventHandler.java?L760 ( ActionUploadFinishedEvent )

However, for the event that updates the downloads:

https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/runtime/UiEventHandler.java?L672

We don't seem to call checkActivities() or stateTracker.hasActivities() anywhere in this method, so, even if we dropped downloads back to zero, it seems that if these events came last and none of above wuould come after it, we'd never re-check the state and never shut down the UI thread.

This also matches my observations - one of the leaked cli-update-threads indeed had an empty runningDownloads deque, yet it was still leaking, so my guess is this is what happened.

On my end, I'll compile Bazel tomorrow with the downloadProgress commented out ( just like you said ) , which should help with the leaking. I'll report tomorrow whether it works.

alexofortune commented 1 year ago

Hi there,

Sorry, busy days & didn't have time to work on this. I'd like to report that patching out the event handler in UI tracker indeed helped with the memory leak - we don't see our bazel server crashing anymore.

coeuvre commented 1 year ago

Thanks for confirming! I will work on the fix soon.

jacobmou commented 1 year ago

@coeuvre Do we have a formal fix for the issue so far or still just comment out https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/runtime/UiEventHandler.java?L672? i was trying to upgrade to bazel 6.1.1 and would hope to leverage builds without the bytes.

coeuvre commented 1 year ago

I believe this is fixed in Bazel HEAD due to other changes. Let me try to work out a dedicated fix for 6.3.0.

alexofortune commented 1 year ago

Thanks a tons for working on this @coeuvre! Once Bazel 6.3.0 drops, I'll report on whether the issue indeed has been mitigated in our setup.

joeljeske commented 1 year ago

@alexofortune @coeuvre, I am still experiencing occasional Bazel OOMs on incremental RBE BwoB builds using 6.3.0rc1. I also see many large "cli-update-threads" in dump created during the OOM, with accumulative retained size accounting for ~40% of my total heap.

I think this issue should be reopened. Happy to provide more information from my dumps, but unable to share them directly.

coeuvre commented 1 year ago

Can you try whether it can be reproduced with Bazel@HEAD?

tjgq commented 1 week ago

Please reopen with more information if the issue persists in 7.x.