facebook / folly

An open-source C++ library developed and used at Facebook.
https://groups.google.com/forum/?fromgroups#!forum/facebook-folly
Apache License 2.0
28.09k stars 5.53k forks source link

Simulating calloc for std::vector (UninitializedMemoryHacks.h simulates malloc) #1508

Open alexander-nadel opened 3 years ago

alexander-nadel commented 3 years ago

Hi,

UninitializedMemoryHacks.h is a great solution when one is interested to take advantage of all the functionalities of an std::vector, but using malloc instead of resize for initialization.

My question: is there a similar solution for calloc?

calloc is known to use some OS tricks to guarantee that the allocated memory is 0-initialized, but without actually accessing it.

Ideally, I'd like to be able to use something like the following line for initializing a vector with 0's:

folly::initWith0s(v, sz);

Thanks, Alex.

nbronson commented 3 years ago

This is possible, but would only be profitable in a very narrow set of circumstances. calloc itself can be faster than malloc + memset (used inside string resize) for multi-page allocations, but when you try to write to those pages later the operating system must take a page fault. If you are actually going to access that memory the overall time is almost always faster with malloc + memset, sometimes much faster.

alexander-nadel commented 3 years ago

I am surprised. I understand that if you actually access the memory after calloc, there will be a page fault, so if you said calloc and malloc+memset were on par I would understand? But why is the overall time with calloc expected to be slower than malloc + memset?

nbronson commented 3 years ago

I was trying to mentally average across workloads, so the statement is too strong. The important idea is that memset is an ideal way to fault in pages: it accesses them sequentially, so the pattern is relatively easy for hardware or software prefetchers to figure out; it results in a small number of large extents in the kernel's virtual memory data structures; and it's common so it is well optimized.

The best-case for calloc would be that your workload accesses only a contiguous prefix of the allocated memory, and that it touches it in a predictable way. In that case you'll basically get the perf of malloc+memset(prefix_size). The worst-case for calloc is that you eventually touch all of the pages, but you do it in a random order.

If you try to benchmark this kind of stuff, you should be aware that GCC will transparently rewrite malloc + memset into calloc in one of its optimization passes. (See https://godbolt.org/z/c6Kcqv) You'll need to put an optimization barrier (like folly::makeUnpredictable) in between them to block the optimization.

When FB updated to the first GCC version with this optimization, one of the internal systems exhibited a huge slowdown. It had on a large hash table that used malloc + memset for its main array. After this got rewritten to calloc, the random access pattern of the hash table caused the pages to be faulted in a random order, which caused a huge amount of churn inside the kernel.

alexander-nadel commented 3 years ago

Thanks for your detailed reply!

A follow-up question: what if I have a big randomly accessed buffer (like FB's hash table, you've mentioned), which doesn't have to be initialized, so, the natural way to allocate memory is by simply using malloc. Is it still worth running memset after the malloc (despite the fact that initialization is not required) to optimize page faults?

nbronson commented 3 years ago

I've never tried the techniques listed there, but it seems like other people have experimented with the best way to pre-fault pages if you don't care about zeros. https://stackoverflow.com/questions/56411164/can-i-ask-the-kernel-to-populate-fault-in-a-range-of-anonymous-pages . Note that the speedup they give is over well-behaved faulting (a linear scan of accesses). The speedup over random faulting order will be much higher.