tcmalloc_sample_parameter is not zero by default, leading to terrible performance

ddio / gperftools

Automatically exported from code.google.com/p/gperftools

BSD 3-Clause "New" or "Revised" License

0 stars 0 forks source link

tcmalloc_sample_parameter is not zero by default, leading to terrible performance #247

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

In our codebase, we see a significant amount of runtime being wasted in 
DoSampledAllocation. We're just using tcmalloc as an allocator, with no heap 
profiling or heap checking or any of that other fancy stuff.

Forcing tcmalloc_sample_parameter to zero gives us a 3.5% overall speedup 
when running sequential code, and a 1.5x speedup when running parallel code 
across 4 cores.

We're using tcmalloc 0.8 (yes, that's very old, but the sampled allocation 
stuff still seems to work the same way in current releases).

Original issue reported on code.google.com by meta...@gmail.com on 2 Jun 2010 at 5:30

GoogleCodeExporter commented 9 years ago

FWIW, chromium encountered this issue too.  I had to change the default to 0 in 
our branch of tcmalloc.

Original comment by willchan@chromium.org on 2 Jun 2010 at 5:38

GoogleCodeExporter commented 9 years ago

If you're not doing any fancy stuff, consider using -ltcmalloc_minimal.  It's 
smaller, 
and -- if I coded it right -- doesn't do any sampling in tcmalloc.

I'll talk it over here about defaulting the sample parameter to 0 even for 
non-minimal 
libtcmalloc.

Original comment by csilv...@gmail.com on 2 Jun 2010 at 7:00

Changed state: Accepted
Added labels: Priority-Medium, Type-Defect

GoogleCodeExporter commented 9 years ago

Also, a 1.5x speedup is very surprising to us.  The sampling code has been made 
more 
effective since tcmalloc 0.8.  It's true we still sample by default, but you 
may find 
the overhead in tcmalloc 1.5 is significantly lower.  If you're able to try it 
out, 
I'd be interested to hear what you find.

Original comment by csilv...@gmail.com on 2 Jun 2010 at 8:08

GoogleCodeExporter commented 9 years ago

We currently use -ltcmalloc_minimal and still see this issue. We'll try 1.5 and 
let you know what we see. 
The 1.5x speedup seemed to be due to a lock being held during 
DoSampledAllocation which the other 
threads all spun waiting for.

Original comment by meta...@gmail.com on 2 Jun 2010 at 9:04

GoogleCodeExporter commented 9 years ago

Yes, I can believe the change to no longer sample in tcmalloc_minimal, came 
after 
v0.8.  Definitely try 1.5.

} The 1.5x speedup seemed to be due to a lock being held during 
DoSampledAllocation
} which the other threads all spun waiting for.

All allocations require a lock, sampled or not, but it's quite likely the code 
has 
been rewritten since 0.8 so less of the sampling work is done while holding the 
lock.

Original comment by csilv...@gmail.com on 2 Jun 2010 at 9:10

GoogleCodeExporter commented 9 years ago

My findings are as follows:

Firstly, switching from 0.8 to 1.5 has given us something like a 9% speedup 
overall, 
so thanks for that! And as you describe, -ltcmalloc_minimal does indeed appear 
to do 
no sampling.

Using -ltcmalloc, the runtime difference between running with 
tcmalloc_sample_parameter set to 0 and to the (apparently doubled since 0.8) 
default 
of 512K now seems to be less than 1% on my single-threaded run. On my 
4-threaded run, 
the degradation is more than 15% -- a significant improvement, but still pretty 
bad.

Original comment by meta...@gmail.com on 3 Jun 2010 at 4:13

GoogleCodeExporter commented 9 years ago

OK, we've reached consensus to make the default 0 for the next release.

Original comment by csilv...@gmail.com on 3 Jun 2010 at 8:12

Changed state: Started

GoogleCodeExporter commented 9 years ago

This is changed in perftools 1.6, just released.

Original comment by csilv...@gmail.com on 5 Aug 2010 at 8:52

Changed state: Fixed