ddio / gperftools

Automatically exported from code.google.com/p/gperftools
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

tcmalloc new/delete is slower than Linux new/delete #188

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Compile the attached file alloc_time.cc and link it with/without 
tcmalloc. The relevant commands are:
    g++ -O2 alloc_time.cc -o alloc_time
    g++ -O2 alloc_time.cc -o alloc_time_tcmalloc -ltcmalloc

2. Run the above two executables 'alloc_time' and 'alloc_time_tcmalloc'.
3. The executables print out the average time to allocate/de-allocate an 
'int'.

What is the expected output? What do you see instead?

I expect to find better times with tcmalloc than without. Unfortunately, my 
findings are the reverse - tcmalloc is 3-4 times slower instead.

Here are the outputs of my runs:

$ ./alloc_time
Time taken per allocation: 75 nsecs
Time taken per de-allocation: 23 nsecs

$ ./alloc_time_tcmalloc
Time taken per allocation: 203 nsecs
Time taken per de-allocation: 121 nsecs

What version of the product are you using? On what operating system?

I'm using google perftools version 1.4 on Ubuntu 9.10. My machine has Intel 
Xeon quad-core x86_64 CPUs of 2 GHz each.

Please provide any additional information below.

Original issue reported on code.google.com by mohit.a...@gmail.com on 14 Nov 2009 at 7:20

Attachments:

GoogleCodeExporter commented 9 years ago
I also tried linking against -ltcmalloc_minimal. Here are those results:

$ ./alloc_time_tcmalloc_minimal 
Time taken per allocation: 183 nsecs
Time taken per de-allocation: 122 nsecs

So the allocation time is better by about 20 nsecs, but still considerably 
worse than 
allocation time without tcmalloc. The de-allocation time is unchanged.

Original comment by mohit.a...@gmail.com on 14 Nov 2009 at 7:30

GoogleCodeExporter commented 9 years ago
Every malloc implementation is going to have situations where it's better or 
worse
than another.  Artificial benchmarks like this might happen upon one such 
situation
or another, but it's not very meaningful.  I much prefer to measure relative
performance in real applications.  The main use for benchmarks like this is to
provide a starting point to look at the implementation, to see if there's a
possibility for improvement.  

I'm going to close this bug do not fix, but it would be great if you wanted to 
look
into this more deeply, to understand why there are timing differences in this
(simple) case.  That might turn up ways to tune tcmalloc, or a bug to fix (a 
previous
benchmark like this one showed up a bug in our implementation of realloc).

Also, note that tcmalloc stands for 'thread-caching malloc'.  It will perform 
best,
relative to other mallocs, in threaded applications.

Original comment by csilv...@gmail.com on 14 Nov 2009 at 7:58

GoogleCodeExporter commented 9 years ago

I'd made this artificial benchmark only for demonstration purposes for this bug 
report. 
The reason I even wrote this benchmark is because I was seeing poor results in 
a 
production application. I obviously cannot disclose the code for that on a 
bug-report.

Original comment by mohit.a...@gmail.com on 14 Nov 2009 at 8:03

GoogleCodeExporter commented 9 years ago
Aha, that's a different story!

Check out

http://groups.google.com/group/google-perftools/browse_thread/thread/87d79c8df8e
22b6d/7b5e97c5b92b4997?lnk=gst&q=slow#7b5e97c5b92b4997

Does changing the constants like suggested in this thread, speed things up for 
you?

Original comment by csilv...@gmail.com on 14 Nov 2009 at 9:35

GoogleCodeExporter commented 9 years ago
Absolutely - I see a huge improvement using the suggested changes to common.h 
in the 
link given.

Here are the new times with the freshly built tcmalloc library:

$ ./alloc_time_tcmalloc
Time taken per allocation: 43 nsecs
Time taken per de-allocation: 17 nsecs

This makes tcmalloc faster than glibc - which is what I expected.

Can the next release of google perftools do this automatically please.

Original comment by mohit.a...@gmail.com on 14 Nov 2009 at 11:01

GoogleCodeExporter commented 9 years ago
Unfortunately, no promises: it improves speed on your machine, but slows it 
down on
others.  We're working on trying to figure out what constants work better in 
what
situations, but it's tricky.  I hope we'll be able to come up with a good 
solution
for everyone in time for the next release.

Original comment by csilv...@gmail.com on 14 Nov 2009 at 11:22

GoogleCodeExporter commented 9 years ago

Actually - on digging some more, I realized that the performance degradation I 
was 
seeing earlier in tcmalloc was because my google perftools package was built 
without 
the -O2 flag. This is because I'd set CXXFLAGS, CPPFLAGS and CFLAGS explictly 
before 
running 'configure' - I was expecting that the perftools would add -O2 on top 
of that 
but it didn't.

When I tried the patch, I just did a vanilla build - so got the -O2 by default. 

It turns out the patch to common.h does little on my machine. Its really the 
-O2 that 
matters.

Please close this bug as invalid.

Original comment by mohit.a...@gmail.com on 14 Nov 2009 at 11:56

GoogleCodeExporter commented 9 years ago
Good to know -- thanks for looking into this.

Original comment by csilv...@gmail.com on 15 Nov 2009 at 12:07