TraceMachina / nativelink

NativeLink is an open source high-performance build cache and remote execution server, compatible with Bazel, Buck2, Reclient, and other RBE-compatible build systems. It offers drastically faster builds, reduced test flakiness, and specialized hardware.
https://nativelink.com
Apache License 2.0
1.18k stars 109 forks source link

Memory store causes memory fragmentation #621

Closed allada closed 7 months ago

allada commented 9 months ago

Here's some stats on real-world usage of Nativelink: image

Since rust is not a garbage collected language it is not allowed to pack memory, so we need to deal with memory fragmentation ourselves. Normally programs don't need to think much about this but in this case we do.

Details on how memory works:

Linux allocates memory in pages, often 4k. The memory manager will pack the memory for you by slotting the returned points from malloc usually from a pool of already allocated pages. This allows code to request (lets say) 100 bytes of data and that 100 bytes may share a page with other pointers in memory.

The down side is that small allocations that live for a long time might end up keeping the entire page (4k) allocated by the kernel. It is rare for an allocator to recycle memory pages that were full at one point; this means if a page was ever full, it will release the page back to the kernel once no data is allocated in the page instead of keep trying to pack new allocations into it.

The problem

Looking at the graph above, you can see that roughly 50% of all of the CAS is ~128bytes or less. This means that the vast majority of items are super tiny and only use a little bit of memory. Since memory store frequently evicts items that have not been used recently from the store, it means that it's very common for an item to be put into it and shortly after to be evicted because the object is rarely used.

So, if a 4k page was allocated, then 32 x 128byte objects were loaded into it then an hour later 31 of the items were evicted, but 1 item is used very frequently, it means that we are now using 4k of memory for 128 bytes of data. Multiply this out and you can see how inefficient this can become.

Possible solutions

  1. A very simple solution to this problem is to simply have a timer or trigger that runs every once in a while and copies some of the data from the top of the LRU-Map (least recently used map) to a new place in memory then frees the old pointer. This is inefficient, but is extremely simple and will have very negligible impact on performance.
  2. We could try to find a smarter Allocator that can handle this case better. This would be great if there's an allocator we can ask how compact our memory is or how packed our memory page is from a given pointer.
  3. Similar to 1, We could store references to the data in a sorted order by age of insertion in addition to age of access (LRU), then periodically make full copies of data that has lived a long time in the LRU.

Some things to keep in mind

Lots of items are very small, if 50% are 128bytes and a sha256/blake3 is 40 byte, this means just copying the hash + a pointer is about +37% for the lower 50%.

MarcusSorealheis commented 9 months ago

This one seems overly complex. The design burden and maintenance overhead of the memory store feels high. When users hit this bottleneck today, it seems it might be more efficient and safer to use a caching mechanism from a database.

allada commented 9 months ago

The first proposed solution is only ~30 lines of code. It is extremely simple and nearly no down side.

MarcusSorealheis commented 9 months ago

SGTM.

allada commented 7 months ago

Merged with #289