Gwinel / gperftools

Automatically exported from code.google.com/p/gperftools
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

SpinLockDelay cause performance slow #660

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. on CentOS 4.3
2. kernel is
   2.6.32_1-14-0-0 #1 SMP Mon Mar 31 10:42:09 CST 2014 x86_64 x86_64 x86_64 GNU/Linux
3. libc is
GNU C Library stable release version 2.3.4
Compiled by GNU CC version 3.4.4 20050721 (Red Hat 3.4.4-2).
Compiled on a Linux 2.4.20 system on 2006-03-08.
4. tcmalloc version gperftools-2.1

I'm doing performance test. I see that the performance becoming slower and 
slower when using tcmalloc, if not using tcmalloc, the the performance is good 
and stable.

gstack shows a lot of threads(48) blocked on SpinLockDelay

stack is like this
Thread 51 (Thread 1895635296 (LWP 27838)):
#0  0x00000000008205b6 in base::internal::SpinLockDelay ()
#1  0x000000000082043b in SpinLock::SlowLock ()
#2  0x0000000000817630 in tcmalloc::CentralFreeList::InsertRange ()
#3  0x000000000081f7a1 in tcmalloc::ThreadCache::ReleaseToCentralCache ()
#4  0x000000000081f9ac in tcmalloc::ThreadCache::Scavenge ()
#5  0x00000000008348ce in tc_delete ()
#6  0x000000302d390562 in std::string::_Rep::_M_destroy ()
#7  0x000000302d3907a8 in std::string::reserve ()
#8  0x000000302d39102c in std::string::append () from /usr/lib64/libstdc++.so.6
#9  0x00000000005a95c2 in Json::FastWriter::writeValue ()
#10 0x00000000005a981d in Json::FastWriter::writeValue ()
#11 0x00000000005a981d in Json::FastWriter::writeValue ()
#12 0x00000000005a981d in Json::FastWriter::writeValue ()
#13 0x00000000005a981d in Json::FastWriter::writeValue ()
#14 0x00000000005a939b in Json::FastWriter::write ()
.....

I think this maybe because my libc version is too low?

please help!
Thanks a lot!

Original issue reported on code.google.com by zlxde...@gmail.com on 4 Dec 2014 at 8:54

GoogleCodeExporter commented 9 years ago
I think part of the reason might be due to disabling of libstdc++ fancy 
"cached" allocator. And you seem to be heavy STL user.

tcmalloc has this code (in malloc_extension.cc):

#ifdef __GLIBC__
  // GNU libc++ versions 3.3 and 3.4 obey the environment variables
  // GLIBCPP_FORCE_NEW and GLIBCXX_FORCE_NEW respectively.  Setting
  // one of these variables forces the STL default allocator to call
  // new() or delete() for each allocation or deletion.  Otherwise
  // the STL allocator tries to avoid the high cost of doing
  // allocations by pooling memory internally.  However, tcmalloc
  // does allocations really fast, especially for the types of small
  // items one sees in STL, so it's better off just using us.
  // TODO: control whether we do this via an environment variable?
  setenv("GLIBCPP_FORCE_NEW", "1", false /* no overwrite*/);
  setenv("GLIBCXX_FORCE_NEW", "1", false /* no overwrite*/);

  // Now we need to make the setenv 'stick', which it may not do since
  // the env is flakey before main() is called.  But luckily stl only
  // looks at this env var the first time it tries to do an alloc, and
  // caches what it finds.  So we just cause an stl alloc here.
  string dummy("I need to be allocated");
  dummy += "!";         // so the definition of dummy isn't optimized out
#endif  /* __GLIBC__ */

So you might want to try the following experiments:

* try setting GLIBCXX_FORCE_NEW=1 when running under glibc malloc and see if 
makes your program slower

* try removing this code from tcmalloc and see if it "improves" your performance

In addition to that we should do something to figure out how to make tcmalloc 
faster for your use case. Playing with TCMALLOC_TRANSFER_NUM_OBJ or 
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES might be helpful.

Original comment by alkondratenko on 4 Dec 2014 at 4:00

GoogleCodeExporter commented 9 years ago
Let me clarify. I don't think there's anything particularly broken with lock 
implementation (it's far from perfect implementation, but it should be ok).

I think your problem is that some lock(s) are taken too frequently and/or held 
for too long.

And part of investigation of this case is figuring out why your "stock" malloc 
is faster. I do suspect it's due to libstdc++ fancy allocators (note that newer 
gcc versions do not enable them by default)

Original comment by alkondratenko on 5 Dec 2014 at 8:28