Use a custom allocator for potentially massive performance improvement

boqwxp commented 4 years ago

Background

I noticed that, for one of my example BV exists-forall problems, time reported that yices spent a full third of its wall-clock time in the kernel. When I ran strace -T I saw that after an initial phase with a normal assortment of syscalls, during the solving phase, system calls were exclusively brk(), presumably called by libc malloc(), and in turn by the safe_malloc()s I see sprinkled throughout the code base.

Experiments

I decided to try an alternative malloc() implementation, because a third of all time in the kernel allocating memory is rather excessive.

My experiments all ran on a Debian 10 VM, which should use the GNU malloc() implementation in libc by default. Here's the time info with the default GNU implementation:

user@host:\~/github/yosys-examples/example1$ #GNU malloc: user@host:\~/github/yosys-examples/example1$ time $(../../yices2/build/x86_64-pc-linux-gnu-release/dist/bin/yices-smt2 example.smt2 &>/dev/null)

real 6m17.133s user 4m5.808s sys 2m8.390s

With the default allocator, Yices used ~365MB-405MB, with spikes over 400MB.

When I tried linking with jemalloc, which is the one used by FreeBSD libc, I got the following time info:

user@host:\~/github/yosys-examples/example1$ #JEMalloc: user@host:\~/github/yosys-examples/example1$ time $(../../yices2/build/x86_64-pc-linux-gnu-release/dist/bin/yices-smt2 example.smt2 &>/dev/null)

real 5m7.416s user 3m20.906s sys 1m44.346s

In this case, memory usage ranged generally between 325-395MB, with spikes up to 450MB.

And when I tried linking with tcmalloc, included in gperftools, I got the following time info:

user@host:\~/github/yosys-examples/example1$ #TCMalloc: user@host:\~/github/yosys-examples/example1$ time $(../../yices2/build/x86_64-pc-linux-gnu-release/dist/bin/yices-smt2 example.smt2 &>/dev/null)

real 3m11.852s user 3m8.928s sys 0m0.549s

In this case, Yices used ~365MB of memory with very little change and only slight growth. So it uses less memory at peak and is almost 100% faster.

Next steps

I know different platforms have different malloc() implementations, and some are more efficient for this application than others. I also know that this is only one example exercising only one part of the solver that apparently does a lot of context creation and destruction. But I see safe_malloc() sprinkled liberally throughout the code base, and I strongly suspect a serious benchmark suite will show significant performance improvement across the board.

There are also potential licensing issues.

But I think at least for some platforms (especially those using GNU libc) the default malloc() implementation should be changed if at all possible. This is not a measly 3% speedup, it's a very juicy fruit dangling low enough to practically just reach up and pick.

BrunoDutertre commented 4 years ago

Thanks. We have experimented with tcmalloc in the past. I didn't know the difference was that large.

boqwxp commented 4 years ago

Indeed, I was very surprised to see such a large difference. If I still had access to a beefy machine I would love to see what it does for some of my larger problems that have previously taken tens of GB of RAM and 24+h on a Xeon E5-2698v4. If the performance improvement persists with bigger problem sizes and more difficult problems, it's a really huge boost in scalability for model-driven program synthesis techniques in hardware design.

SRI-CSL / yices2