Open SrodriguezO opened 2 years ago
Interesting!
Is your bazel-remote configured with storage_mode zstd or uncompressed?
It seems you access the cache via gRPC and not HTTP. Can you confirm?
We're using zstd, and that's correct - we access the cache via gRPC
Are you using bazel 5.0's new --experimental_remote_cache_compression
flag? If so, I would recommend upgrading bazel-remote to v2.3.4 and trying a bazel version with this fix: https://github.com/bazelbuild/bazel/commit/8ebd70b0c97c8bd584647f219be8dd52217cb5cf
Are you using bazel 5.0's new --experimental_remote_cache_compression flag? If so, I would recommend upgrading bazel-remote to v2.3.4 and trying a bazel version with this fix: https://github.com/bazelbuild/bazel/commit/8ebd70b0c97c8bd584647f219be8dd52217cb5cf
We are not currently using that. That does seem valuable though, and we'll definitely explore it.
I don't suspect that would decrease the memory footprint on bazel-remote
though, right? Is there any way to cap memory usage atm?
--
Sidenote: The bazel flag tweaks we tried yesterday evening helped, but they were insufficient. We experienced another OOM today (and were close to the wire a few other times).
We're currently trying to horizontally scale the cache (based on this comment) as further mitigation.
bazel-remote should use less memory if it's using zstd compressed storage and the clients are downloading zstd-compressed data (bazel-remote can just write compressed data from disk instead of compressing it on each request).
Another experiment you could try is to run bazel-remote with the uncompressed storage mode. That would exclude zstd compression/decompression from the setup. if you still see OOMs then we would know to focus elsewhere. Saving some pprof data while memory usage is high might also be helpful.
IIRC bazel-remote 1.1.0 was built with go 1.14.2, and go 1.16 switched from using MADV_FREE to MADV_DONTNEED, which might be related.
Good idea on the pprofs. I'm a bit confused though. The in-use memory profiles don't seem to account for even half of the memory the service is using:
The cumulative memory in use according to the pprof is ~20GB, but the service was using ~55GB at that time.
It seems the vast majority of the memory usage reported by the pprof is around file loading. At least during this snapshot.
The alloc memory profiles might shed some insight into memory usage spikes that might not have been happening when I took the profile. If I'm interpreting this correctly, a large chunk of memory usage during writes was during zstd encoding, so we might indeed get some benefit from --experimental_remote_cache_compression once Bazel 5.1 goes out.
Other large chunks seem to be during disk cache writes and grpc responses. Tightening --remote_max_connections
hopefully helps there.
Some questions:
Also, thank you for your prompt replies, I really appreciate that you're helping me work through this :)
-Sergio
There are some notes on the GODEBUG environment variable here, it's a comma separated list of settings: https://pkg.go.dev/runtime?utm_source=godoc#hdr-Environment_Variables
One of the settings is madvdontneed=0
to use MADV_FREE
(the old setting) instead of MADV_DONTNEED
. You can read a little about what they mean here:
https://man7.org/linux/man-pages/man2/madvise.2.html
It might also be worth setting gctrace=1
to get some GC stats in your logs.
You can try also playing with the GOGC
environment variable, to trigger GC more often (also described in the pkg.go link above).
Re the discrepancy between the memory profile's view of memory usage and the systems, there are so many different ways to count memory usage that I think the first step is to try to understand what each tool is measuring. Is that a screenshot from top? Is it running inside docker, or outside?
The screenshot was indeed from top
, running outside the container. The container is just docker run <flags> buchgr/bazel-remote-cache:v2.3.3
Thanks for those links :)
Had the same problem when disk size passed 1Tb it would oom on 64Gb memory server. Setting GOGC=20 solved the problem.
v2.3.9 has a new --zstd_implementation cgo
mode, which might reduce memory usage. Please let me know if it helps.
Hello, I can reproduce unusually high memory usage under very specific configuration.
--experimental_remote_cache_compression
)--remote_download_toplevel
)
With this combination, memory use on the cache server reaches 10GB
Removing --remote_download_toplevel
memory use on this server does not exceed 3GB.Test is performed for a large build (~40GB of artefacts), on a single client on the same LAN. Server is for local office use and has a HTTP proxy backend defined, pointing to the main CI cache.
Turning off compression --experimental_remote_cache_compression
only and running with --remote_download_toplevel
the server memory use peaks at 5.1GB. I suspect, based on the bazel output, this is the result of "queing up" fetches and multiplexing them in parallel over the grpc channel (currently there are 300 concurrent fetches in progress over 5 grpc connections).
Bazel remote version is 2.4.3 on all servers.
@liam-baker-sm: Thanks for the report.
Which storage mode is bazel-remote using in this scenario? In the ideal setup, with bazel-remote storing zstd compressed blobs, and bazel requesting zstd blobs, they should be able to be streamed directly from the filesystem without recompression.
We are now using GOMEMLIMIT available in never versions of go. This solves the problem of "transient spike in the live heap size" https://tip.golang.org/doc/gc-guide
We are now using GOMEMLIMIT available in never versions of go. This solves the problem of "transient spike in the live heap size" https://tip.golang.org/doc/gc-guide
I added a similar suggestion to the systemd configuration example recently: https://github.com/buchgr/bazel-remote/commit/2bcc2f59e111f71b4de4d84013f8e93a1b981872
@mostynb The bazel-remote I ran the test against is using uncompressed storage, due to https://github.com/buchgr/bazel-remote/issues/524 The instance points to another bazel-remote using --http_proxy.url
We're experiencing severe memory issues w/ the cache since upgrading to v2.3.3 (from v1.1.0). These were asymptomatic for most of January and February, but started causing frequent cache OOMs following our upgrade to Bazel 5 at the beginning of last week. The memory footprint was already significantly higher prior to the Bazel 5 upgrade, however.
Prior to the bazel-remote cache upgrade (which took place 01/01/2022), memory usage was minimal. Following the upgrade, the cache process regularly uses up all the memory on the host (~92g), resulting in the OOM killer killing the cache.
We noticed that the used file handles count markedly dropped following the cache upgrade as well, which leads us to believe some actions that previously relied on heavy disk usage now occur in-memory.
--
A very large chunk of the memory usage occurs during cache startup. For example, following a crash at 2:10pm, the cache was holding 70g of memory by 2:29pm, which is when the cache finally started serving requests. You can see the memory usage trend for that OOM/restart (and two prior ones) on this screenshot:
The cache logs show
Most of the memory surge occurs during the "Removing incomplete file" steps, and a second surge occurs as the LRU index is built.
Attempted Mitigations: We attempted restricting the memory allowance for the Docker container via the
-m
docker flag in hopes to at least keep the process from OOMing, but this did not suffice - the service became unresponsive.Given that the memory issues became a much worse following the Bazel 5 upgrade, we tweaked these Bazel flags:
--experimental_remote_cache_async
flag--remote_max_connections=10
(we previously had it set to 0, which means no limit, but this didn't affect grpc connections prior to Bazel 5).Even if these help (we'll find out as tomorrow's workday picks up), we'll still be very close to running out of memory (as we were through February, before the Bazel 5 upgrade).
Is there some way to configure how much memory the bazel-remote process utilizes?