Very high memory usage on v2.3.3 - is this configurable?

SrodriguezO commented 2 years ago

We're experiencing severe memory issues w/ the cache since upgrading to v2.3.3 (from v1.1.0). These were asymptomatic for most of January and February, but started causing frequent cache OOMs following our upgrade to Bazel 5 at the beginning of last week. The memory footprint was already significantly higher prior to the Bazel 5 upgrade, however.

Prior to the bazel-remote cache upgrade (which took place 01/01/2022), memory usage was minimal. Following the upgrade, the cache process regularly uses up all the memory on the host (~92g), resulting in the OOM killer killing the cache.

usable_mem

We noticed that the used file handles count markedly dropped following the cache upgrade as well, which leads us to believe some actions that previously relied on heavy disk usage now occur in-memory.

remote_file_handles

--

A very large chunk of the memory usage occurs during cache startup. For example, following a crash at 2:10pm, the cache was holding 70g of memory by 2:29pm, which is when the cache finally started serving requests. You can see the memory usage trend for that OOM/restart (and two prior ones) on this screenshot:

remote_mem_used_2

The cache logs show

<~21:10:00 process starts - logs are truncated, so the exact timestamp is missing, but our service wrapper simply launches this docker container>
…
… <tons of "Removing incomplete file" logs>
2022/03/07 21:24:28 Removing incomplete file: /bazel-remote/cas.v2/ff/f…
2022/03/07 21:24:28 Removing incomplete file: /bazel-remote/cas.v2/ff/f…
2022/03/07 21:24:29 Removing incomplete file: /bazel-remote/cas.v2/ff/f…
2022/03/07 21:24:29 Sorting cache files by atime.
2022/03/07 21:26:26 Building LRU index.
2022/03/07 21:29:41 Finished loading disk cache files.
2022/03/07 21:29:41 Loaded 54823473 existing disk cache items.
2022/03/07 21:29:41 Mangling non-empty instance names with AC keys: disabled
2022/03/07 21:29:41 gRPC AC dependency checks: enabled
2022/03/07 21:29:41 experimental gRPC remote asset API: disabled
2022/03/07 21:29:41 Starting gRPC server on address :8081
2022/03/07 21:29:41 Starting HTTP server on address :8080
2022/03/07 21:29:41 HTTP AC validation: enabled
2022/03/07 21:29:41 Starting HTTP server for profiling on address :8082
2022/03/07 21:29:42 GRPC CAS HEAD … OK
2022/03/07 21:29:42 GRPC CAS HEAD … OK
2022/03/07 21:29:42 GRPC CAS HEAD … OK
…

Most of the memory surge occurs during the "Removing incomplete file" steps, and a second surge occurs as the LRU index is built.

Attempted Mitigations: We attempted restricting the memory allowance for the Docker container via the -m docker flag in hopes to at least keep the process from OOMing, but this did not suffice - the service became unresponsive.

Given that the memory issues became a much worse following the Bazel 5 upgrade, we tweaked these Bazel flags:

We unset the --experimental_remote_cache_async flag
We set --remote_max_connections=10 (we previously had it set to 0, which means no limit, but this didn't affect grpc connections prior to Bazel 5).

Even if these help (we'll find out as tomorrow's workday picks up), we'll still be very close to running out of memory (as we were through February, before the Bazel 5 upgrade).

Is there some way to configure how much memory the bazel-remote process utilizes?

ulrfa commented 2 years ago

Interesting!

Is your bazel-remote configured with storage_mode zstd or uncompressed?

It seems you access the cache via gRPC and not HTTP. Can you confirm?

SrodriguezO commented 2 years ago

We're using zstd, and that's correct - we access the cache via gRPC

mostynb commented 2 years ago

Are you using bazel 5.0's new --experimental_remote_cache_compression flag? If so, I would recommend upgrading bazel-remote to v2.3.4 and trying a bazel version with this fix: https://github.com/bazelbuild/bazel/commit/8ebd70b0c97c8bd584647f219be8dd52217cb5cf

SrodriguezO commented 2 years ago

Are you using bazel 5.0's new --experimental_remote_cache_compression flag? If so, I would recommend upgrading bazel-remote to v2.3.4 and trying a bazel version with this fix: https://github.com/bazelbuild/bazel/commit/8ebd70b0c97c8bd584647f219be8dd52217cb5cf

We are not currently using that. That does seem valuable though, and we'll definitely explore it.

I don't suspect that would decrease the memory footprint on bazel-remote though, right? Is there any way to cap memory usage atm?

--

Sidenote: The bazel flag tweaks we tried yesterday evening helped, but they were insufficient. We experienced another OOM today (and were close to the wire a few other times).

We're currently trying to horizontally scale the cache (based on this comment) as further mitigation.

mostynb commented 2 years ago

bazel-remote should use less memory if it's using zstd compressed storage and the clients are downloading zstd-compressed data (bazel-remote can just write compressed data from disk instead of compressing it on each request).

Another experiment you could try is to run bazel-remote with the uncompressed storage mode. That would exclude zstd compression/decompression from the setup. if you still see OOMs then we would know to focus elsewhere. Saving some pprof data while memory usage is high might also be helpful.

IIRC bazel-remote 1.1.0 was built with go 1.14.2, and go 1.16 switched from using MADV_FREE to MADV_DONTNEED, which might be related.

SrodriguezO commented 2 years ago

Good idea on the pprofs. I'm a bit confused though. The in-use memory profiles don't seem to account for even half of the memory the service is using: bazel-remote_mem-top bazel-remote_pprof-heap_inuse-space bazel-remote_pprof-heap_inuse_cumulative-sort

The cumulative memory in use according to the pprof is ~20GB, but the service was using ~55GB at that time.

It seems the vast majority of the memory usage reported by the pprof is around file loading. At least during this snapshot.

The alloc memory profiles might shed some insight into memory usage spikes that might not have been happening when I took the profile. If I'm interpreting this correctly, a large chunk of memory usage during writes was during zstd encoding, so we might indeed get some benefit from --experimental_remote_cache_compression once Bazel 5.1 goes out. bazel-remote_pprof-heap_alloc_cumulative-sort

Other large chunks seem to be during disk cache writes and grpc responses. Tightening --remote_max_connections hopefully helps there.

Some questions:

Why is the in-use memory on the profile not accounting for ⅗ of the reported memory usage?
Is there a way to maybe make GC more aggressive for this service? Normalized load on our host remained fairly low during the incidents, so we could probably swing that to keep memory usage down.
Regarding your coment about the go MADV_FREE -> MADV_DONTNEED change, do you know if there's a way to toggle that for this service?
Is there any other info I can share to help troubleshoot this issue?

Also, thank you for your prompt replies, I really appreciate that you're helping me work through this :)

-Sergio

mostynb commented 2 years ago

There are some notes on the GODEBUG environment variable here, it's a comma separated list of settings: https://pkg.go.dev/runtime?utm_source=godoc#hdr-Environment_Variables

One of the settings is madvdontneed=0 to use MADV_FREE (the old setting) instead of MADV_DONTNEED. You can read a little about what they mean here: https://man7.org/linux/man-pages/man2/madvise.2.html

It might also be worth setting gctrace=1 to get some GC stats in your logs.

You can try also playing with the GOGC environment variable, to trigger GC more often (also described in the pkg.go link above).

Re the discrepancy between the memory profile's view of memory usage and the systems, there are so many different ways to count memory usage that I think the first step is to try to understand what each tool is measuring. Is that a screenshot from top? Is it running inside docker, or outside?

SrodriguezO commented 2 years ago

The screenshot was indeed from top, running outside the container. The container is just docker run <flags> buchgr/bazel-remote-cache:v2.3.3

Thanks for those links :)

tobbe76 commented 2 years ago

Had the same problem when disk size passed 1Tb it would oom on 64Gb memory server. Setting GOGC=20 solved the problem.

mostynb commented 2 years ago

v2.3.9 has a new --zstd_implementation cgo mode, which might reduce memory usage. Please let me know if it helps.

liam-baker-sm commented 1 year ago

Hello, I can reproduce unusually high memory usage under very specific configuration.

GRPC connection between the bazel build and the bazel-remote server.
Compressed transfer ( --experimental_remote_cache_compression )
Toplelvel download ( --remote_download_toplevel ) With this combination, memory use on the cache server reaches 10GB Removing --remote_download_toplevel memory use on this server does not exceed 3GB.

Test is performed for a large build (~40GB of artefacts), on a single client on the same LAN. Server is for local office use and has a HTTP proxy backend defined, pointing to the main CI cache.

liam-baker-sm commented 1 year ago

Turning off compression --experimental_remote_cache_compression only and running with --remote_download_toplevel the server memory use peaks at 5.1GB. I suspect, based on the bazel output, this is the result of "queing up" fetches and multiplexing them in parallel over the grpc channel (currently there are 300 concurrent fetches in progress over 5 grpc connections).

liam-baker-sm commented 1 year ago

Bazel remote version is 2.4.3 on all servers.

mostynb commented 1 year ago

@liam-baker-sm: Thanks for the report.

Which storage mode is bazel-remote using in this scenario? In the ideal setup, with bazel-remote storing zstd compressed blobs, and bazel requesting zstd blobs, they should be able to be streamed directly from the filesystem without recompression.

tobbe76 commented 1 year ago

We are now using GOMEMLIMIT available in never versions of go. This solves the problem of "transient spike in the live heap size" https://tip.golang.org/doc/gc-guide

mostynb commented 1 year ago

We are now using GOMEMLIMIT available in never versions of go. This solves the problem of "transient spike in the live heap size" https://tip.golang.org/doc/gc-guide

I added a similar suggestion to the systemd configuration example recently: https://github.com/buchgr/bazel-remote/commit/2bcc2f59e111f71b4de4d84013f8e93a1b981872

liam-baker-sm commented 1 year ago

@mostynb The bazel-remote I ran the test against is using uncompressed storage, due to https://github.com/buchgr/bazel-remote/issues/524 The instance points to another bazel-remote using --http_proxy.url

buchgr / bazel-remote

Very high memory usage on v2.3.3 - is this configurable? #529