Gwinel / gperftools

Automatically exported from code.google.com/p/gperftools
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Like a teeter totter,run slowly because call ThreadCache::FetchFromCentralCache and ThreadCache::ReleaseToCentralCache frequently #675

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.create many sub threads until unclaimed_cache_space_ < 0,all threads call 
free(malloc(1)),and sleep long time expect the first subthread .
2.the first subthread alloc many times then free for max_length_ = 8192
  state1:
  ThreadCache::size_=8,max_size_=64K
  ThreadCache::freelist[1] :length_ = 1,max_length_ = 8192,lowater_ = 1
3.alloc(8) ->
  state2:
  ThreadCache::size_=0,max_size_=64K
  ThreadCache::freelist[1] :length_ = 0,max_length_ = 8192,lowater_ = 0
4:alloc(8) -->
  state3:
  ThreadCache::size_=64K-8,max_size_=64K
  ThreadCache::freelist[1] :length_ = 8191,max_length_ = 8192,lowater_ = 0
  call
  ThreadCache::FetchFromCentralCache
  CentralFreeList::RemoveRange(spin lock of CentralFreeList) 
5:free
  state4:
  ThreadCache::size_=64K,max_size_=64K
  ThreadCache::freelist[1] :length_ = 8192,max_length_ = 8192,lowater_ = 8192
  call 
  ThreadCache::Scavenge
  ThreadCache::IncreaseCacheLimit(spin lock of heap)  
6:free -->state1
  call
  ThreadCache::ListTooLong
  ThreadCache::ReleaseToCentralCache  
  ThreadCache::PopRange
  CentralFreeList::InsertRange(spin lock of CentralFreeList) 
and 3,4,5,6,3,4...

What is the expected output? What do you see instead?
Because the list of malloc and free call many functions that wastes times and 
use spin lock,once enter this scene,then run slowly although these cpu kernels 
are busy.

What version of the product are you using? On what operating system?
ver:gperftools-2.4
os:RHEL 6.5

Please provide any additional information below.
In the sence, policy of lowater and policy of unclaimed_cache_space are invalid.

void* ThreadCache::FetchFromCentralCache(size_t cl, size_t byte_size) {
  FreeList* list = &list_[cl];
  ASSERT(list->empty());
  const int batch_size = Static::sizemap()->num_objects_to_move(cl);

  const int num_to_move = min<int>(list->max_length(), batch_size);
  void *start, *end;
  int fetch_count = Static::central_cache()[cl].RemoveRange(
      &start, &end, num_to_move);

  ASSERT((start == NULL) == (fetch_count == 0));
  if (--fetch_count >= 0) {
    size_ += byte_size * fetch_count;
    list->PushRange(fetch_count, SLL_Next(start), end);
+   list->clear_lowwatermark();
  }

test code

#include "config_for_unittests.h"
#include <stdio.h>
#ifdef HAVE_UNISTD_H
#include <unistd.h>    // for sleep()
#endif
#include "base/logging.h"
#include <gperftools/malloc_extension.h>
#include "tests/testutil.h"   // for RunThread()
#include <sys/time.h>

static void AllocOnce(void);

static void AllocStuffId(int nId);

static void TestVgop(int nId);

static void FastOrSlow(int nId);

static void CycleAllocFree(int nCount, int nSize);

int main(int argc, char** argv) 
{
    AllocOnce();

    FastOrSlow(0);
    int nThCount = 600;
    RunManyThreadsWithId(AllocStuffId, nThCount, 1 << 20);
    return 0;
}

static void AllocStuffId(int nId)
{
    AllocOnce();

    if (0 != nId)
    {
        sleep(5 * 60);
    }
    else
    {
        sleep(2 * 60);
        TestVgop(nId + 1);
    }
}

static void TestVgop(int nId)
{
    int nCount = (1 + 8192) / 2 * 8192;
    int nSize = 6;
    void** objects = new void*[nCount];
    int i = 0;
    for (i = 0; i < nCount; i++) {
        objects[i] = malloc(nSize);
    }

    for (i = 0; i < nCount; i++) {
        free(objects[i]);
    }
    int nPackCount = 8039;
    for (i = 0; i < nPackCount; i++) {
        objects[i] = malloc(nSize);
    }
    FastOrSlow(nId);
    for (i = 0; i < nPackCount; i++) {
        free(objects[i]);
    }

    delete[] objects;
}

void FastOrSlow(int nId)
{
    timeval starttime, endtime;
    gettimeofday(&starttime, 0);
    int nCount = 30 * 64 * 1024;
    int nSize = 6;
    CycleAllocFree(nCount, nSize);

    gettimeofday(&endtime, 0);
    double timeuse = 1000000 * (endtime.tv_sec - starttime.tv_sec) + endtime.tv_usec - starttime.tv_usec;
    timeuse /= 1000;
    printf("task id %d\ttime:%d ms\n",nId, int(timeuse));
}

void CycleAllocFree(int nCount, int nSize)
{
    for (int i = 0; i < nCount; i++)
    {
        void *p1 = malloc(nSize);
        void *p2 = malloc(nSize);
        free(p2);
        free(p1);
    }
}

void AllocOnce(void)
{
    free(malloc(1));
}

Original issue reported on code.google.com by fanzheny...@163.com on 26 Feb 2015 at 6:28

GoogleCodeExporter commented 9 years ago
I will need more details on what exactly is a problem, on proposed fix and on 
what exactly your test program is proving.

Original comment by alkondratenko on 1 Mar 2015 at 1:27

GoogleCodeExporter commented 9 years ago
New fixed code :
common.h:132
- static const int kMaxDynamicFreeListLength = 8192;
+ static const int kMaxDynamicFreeListLength = 8192 * 2;

After ran long long time with many threads(>500),the tcmalloc may enter a 
state:thread cache fetch from cental cache and release cache to central cache 
frequent.
Why?
the min freelist: his size is 8,maxlength is 8192,but his batch size is 8192 
too,so ,we can build many cases that call malloc/free times << 8192 and enter 
the wrong state .
Guess:
On 32 bit platform,the min (his size is 8)freelist 's maxlength is 8192,his 
batch size is 4096,so the wrong state can not appare.
But my paltform is 64 bit.
So:
the whole fix maybe like:
#ifdef _LP64
static const int kMaxDynamicFreeListLength = 8192 * 2;
#else
static const int kMaxDynamicFreeListLength = 8192;
#endif

Original comment by fanzheny...@163.com on 19 Mar 2015 at 9:16

GoogleCodeExporter commented 9 years ago
Sorry, but I'm still not able to see what the problem is.

Can you please elaborate more on that part rather than proposed solution ?

Original comment by alkondratenko on 22 Mar 2015 at 6:23