ddio / gperftools

Automatically exported from code.google.com/p/gperftools
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

HeapProfilerDump deadlocks on OS X #207

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Using https://code.launchpad.net/~kay-roepke/mysql-proxy/tcmalloc-heap-dump 
(rev 939)

2. Start mysql-proxy with default parameters, and the following environment:
HEAP_PROFILE_MMAP=true HEAPPROFILE=proxy.hprof G_SLICE=always-malloc 
./bin/mysql-proxy

3. send SIGUSR1 to the process once

What is the expected output? What do you see instead?

Instead of just dumping a heap profile the following shows on stderr:

Dumping heap profile to proxy.hprof.0001.heap (user request)
Check failed: regions_ != NULL: 

mysql-proxy is deadlocked, it will not return.
stacktrace below (mysql-proxy is single threaded in this example):
(gdb) bt
#0  0x9432646e in __semwait_signal ()
#1  0x943262ef in nanosleep$UNIX2003 ()
#2  0x000ebb65 in SpinLock::SlowLock ()
#3  0x000e32be in HeapProfilerDump ()
#4  0x94347dbc in __cxa_finalize ()
#5  0x94347cb0 in exit ()
#6  0x000e613b in MemoryRegionMap::BeginRegionLocked ()
#7  0x000e2bd2 in AddRemoveMMapDataLocked ()
#8  0x000e2cb5 in DoGetHeapProfileLocked ()
#9  0x000e318a in DumpProfileLocked ()
#10 0x000e3318 in HeapProfilerDump ()
#11 0x00038644 in chassis_event_thread_loop (event_thread=0x40bfc0) at 
../../../mysql-proxy/src/chassis-event-
thread.c:368
#12 0x00037ed4 in chassis_mainloop (_chas=0x4075c0) at 
../../../mysql-proxy/src/chassis-mainloop.c:360
#13 0x00003ee3 in main_cmdline (argc=1, argv=0xbffff380) at 
../../../mysql-proxy/src/chassis.c:947
#14 0x000040e1 in main (argc=1, argv=0xbffff380) at 
../../../mysql-proxy/src/chassis.c:1115

What version of the product are you using? On what operating system?

google-perftools 1.4
Mac OS X 10.5.8
i686-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5490)
32 bit process

Please provide any additional information below.
Without HEAP_PROFILE_MMAP this works, but no memory other than tcmalloc 
internal is described in the dump.
All relevant unit tests of tcmalloc pass (except of course the heap-checker 
ones).

Original issue reported on code.google.com by kroe...@gmail.com on 8 Jan 2010 at 4:52

GoogleCodeExporter commented 9 years ago
It looks like the deadlock is because of the assertion failure -- the program 
is 
crashing in the middle of a dump, and at exit it tries to dump again, so no 
surprise 
there's deadlock.  The question is why is the check failing that regions_ != 
NULL.

It's true that HEAP_PROFILE_MMAP is not the most well-tested part of perftools. 
 (We 
don't use it at all within Google.)  I suspect, though, that that's not what is 
causing you to not see any memory in the heap-profile dump.  Unless your 
application 
never uses malloc or new, and only uses mmap to allocate memory?

So while this is a likely a bug somewherewith HEAP_PROFILE_MMAP, the more 
useful bug 
to you, I expect, is why you're not seeing any other memory in the dump.  Do 
you see 
anything in the dump -- hex addresses maybe?

If you can run without HEAP_PROFILE_MMAP, and attach the resulting profile 
file, I'll 
take a look.  If it's possible for you to attach a minimal program that 
illustrates 
this problem, that would be great as well.

Original comment by csilv...@gmail.com on 8 Jan 2010 at 5:43

GoogleCodeExporter commented 9 years ago
Ok, I have not immediate nor anticipated need for HEAP_PROFILE_MMAP, this was 
simply trying to find a workaround :)
The app is strictly using system malloc, sometimes via wrappers, but that 
shouldn't matter.

I've just tried to reproduce the missing memory with a simplistic C program, 
but of course there it works as expected. :(
I'll try again with a shared library, as that's how most of our code is loaded.

Below is the pprof output of the attached file.
It was created by starting mysql-proxy as originally described, but without 
HEAP_PROFILE_MMAP, then using the mysql client app to 
connect, execute 'SELECT 1', disconnect. This most certainly involved several 
malloc/free calls.
This test is still single threaded.
The dump was also created by sending USR1 to the process (all that does is to 
set a flag that causes HeapProfilerDump() to be 
executed the next time through the runloop).
There is indeed a crash at program exit, which only happens with tcmalloc's 
free, not system free. I'll investigate, but it is unrelated to 
this problem (might well be a genuine bug in our code).

classdump:mysql-proxy kroepke$ ~/local/tcmalloc/bin/pprof --text 
libexec/mysql-proxy empty.hprof.0001.heap 
Total: 0.0 MB
     0.0  81.8%  81.8%      0.0  81.8% std::basic_string::_Rep::_S_create
     0.0  18.2% 100.0%      0.0 100.0% __static_initialization_and_destruction_0
     0.0   0.0% 100.0%      0.0 100.0% 0x000000008fe01037
     0.0   0.0% 100.0%      0.0 100.0% 0x000000008fe01872
     0.0   0.0% 100.0%      0.0 100.0% 0x000000008fe04102
     0.0   0.0% 100.0%      0.0 100.0% 0x000000008fe07bcf
     0.0   0.0% 100.0%      0.0 100.0% 0x000000008fe0e775
     0.0   0.0% 100.0%      0.0 100.0% 0x000000008fe0e7e3
     0.0   0.0% 100.0%      0.0 100.0% 0x000000008fe0e8c9
     0.0   0.0% 100.0%      0.0 100.0% 0x000000008fe12f36
     0.0   0.0% 100.0%      0.0 100.0% __mh_execute_header
     0.0   0.0% 100.0%      0.0  81.8% std::basic_string::_S_construct
     0.0   0.0% 100.0%      0.0  81.8% std::basic_string::basic_string

Original comment by kroe...@gmail.com on 8 Jan 2010 at 6:06

Attachments:

GoogleCodeExporter commented 9 years ago
Note that the heap profiler only shows memory *currently allocated* at the time 
of the 
dump.  So if SELECT 1 allocated memory and then freed it all, that wouldn't 
show up in 
the heap profile.

} I'll try again with a shared library, as that's how most of our code is 
loaded.

One issue is that tcmalloc has to be linked in last; stuff linked in after it 
won't be 
seen by tcmalloc.  Is that possibly at issue here?

Original comment by csilv...@gmail.com on 8 Jan 2010 at 6:15

GoogleCodeExporter commented 9 years ago
I should've phrased that differently, yes: There is be a memory trace after the 
test, so I should've been able to see it.

Since we dlopen plugins that do most of the allocation, I've recreated that 
test case with a single function that allocates memory, see the attached files.
That memory is all hidden behind __mh_execute_header, which I guess means 
tcmalloc won't help me.
I'll see if changing the linking order helps, though I suspect the dlopen makes 
that point moot, unless I find a way to do it with some dlyd preloading and 
dlsym magic for finding the relevant function pointers. Although that's 
extremely likely to be a pain.

What's interesting is that in the mysql-proxy case, the __mh_execute_header 
portion is indeed 0.0 MB where it has to be way more than that. I can see 
nothing fundamentally different to what the test case does (in terms of linking 
and dlopen order in time).

For completeness' sake, the compiler/linker steps:
$ gcc -fPIC -O0 -g -I/Users/kroepke/local/tcmalloc/include -c leak.c
$ gcc -dynamiclib -Wl,-undefined -Wl,dynamic_lookup -o leak.bundle -lc leak.o
$ gcc -o ltc-dyn leak-main.c -O0 -ggdb -Wl,-undefined,dynamic_lookup \
  -ltcmalloc -I/Users/kroepke/local/tcmalloc/include -L/Users/kroepke/local/tcmalloc/lib -I.
$ HEAP_PROFILE=dyn ./ltc-dyn

leads to:
classdump:tmp kroepke$ ~/local/tcmalloc/bin/pprof --text ltc dyn.0002.heap 
Total: 3.4 MB
     3.4 100.0% 100.0%      3.4 100.0% _main
     0.0   0.0% 100.0%      0.0   0.0% std::basic_string::_Rep::_S_create
     0.0   0.0% 100.0%      0.0   0.0% __static_initialization_and_destruction_0
     0.0   0.0% 100.0%      0.0   0.0% 0x000000008fe01037
     0.0   0.0% 100.0%      0.0   0.0% 0x000000008fe01872
     0.0   0.0% 100.0%      0.0   0.0% 0x000000008fe04102
     0.0   0.0% 100.0%      0.0   0.0% 0x000000008fe07bcf
     0.0   0.0% 100.0%      0.0   0.0% 0x000000008fe0e775
     0.0   0.0% 100.0%      0.0   0.0% 0x000000008fe0e7e3
     0.0   0.0% 100.0%      0.0   0.0% 0x000000008fe0e8c9
     0.0   0.0% 100.0%      0.0   0.0% 0x000000008fe12f36
     0.0   0.0% 100.0%      3.4 100.0% __mh_execute_header
     0.0   0.0% 100.0%      3.4 100.0% dyld_stub_binding_helper
     0.0   0.0% 100.0%      0.0   0.0% std::basic_string::_S_construct
     0.0   0.0% 100.0%      0.0   0.0% std::basic_string::basic_string

Original comment by kroe...@gmail.com on 8 Jan 2010 at 7:08

Attachments:

GoogleCodeExporter commented 9 years ago
Yeah, I'm afraid I don't know much about what happens with perftools in the 
face of 
dlopen.  It's also something that likely differs significantly from OS to OS.  
If you 
want to poke around, that would be great!  You could maybe take your test 
programs 
and add MallocHooks for 'new' that print some information out, to see what 
tcmalloc 
is seeing.

If tcmalloc is seeing all these allocations, then the next step is to look at 
the 
heap-profiler output (the raw file) -- it should be pretty easy to understand.  
This 
will show if the problem is with GetStackTrace -- it's not going beyond the 
dlopen 
boundary, perhaps -- or with pprof, or with something else.

Original comment by csilv...@gmail.com on 8 Jan 2010 at 7:20

GoogleCodeExporter commented 9 years ago
Will do, thanks for your help so far!

Original comment by kroe...@gmail.com on 8 Jan 2010 at 7:25

GoogleCodeExporter commented 9 years ago
Any more word on this?  I'm sorry I can't help much remotely, but I'm 
interested to 
hear anything you've found out so far.

Original comment by csilv...@gmail.com on 10 Mar 2010 at 6:55

GoogleCodeExporter commented 9 years ago
Hi, I am facing similar problem like this, can anybody help? I am linking the 
tcmalloc static library with my project and generating the dynamic library. The 
dynamic library calls static function HeapLeakChecker::NoGlobalLeaks. I use 
java.exe to load/execute my dynamic library and the environment variable 
HEAPCHECK is set to normal. When the program executes the function 
NoGlobalLeaks (before calling this I allocated memory using tc_malloc and it is 
not freed until this point), it hangs after displaying the message “Check 
failed: regions_ != NULL:”.  I just did the same thing with sample exe 
application but it works fine.

Original comment by shivang...@gmail.com on 9 Sep 2010 at 9:10

GoogleCodeExporter commented 9 years ago
Sounds like it may be similar -- weird things happening due to dynamic loading. 
 Or it may be totally different.  You may have to go through the code and 
figure out where regions_ is being set to a non-NULL value, and figure out why 
that's not happening for you.

Original comment by csilv...@gmail.com on 13 Sep 2010 at 8:00

GoogleCodeExporter commented 9 years ago
It's been almost a year now with no word, so I'm closing this CannotReproduce.  
If someone can come up with a simple test-case that illustrates this, or debug 
it themselves, I'm happy to look into it again.

Original comment by csilv...@gmail.com on 31 Aug 2011 at 10:23