patch for heap profiling on windows

GoogleCodeExporter commented 9 years ago

This is a patch file that takes steps towards finishing the 
windows port of the heap profiler.  It doesn't address CPU profiling 
or heap checking.  It has been tested against some fairly large profilees 
on windows.  Some of the changes in the patch are fixes to issues that 
would affect tcmalloc even in the absence of profiling.

It's possible I may have lost some changes when creating the patch (we 
have some other modifications I am not sharing at this time). 

Please ask questions.

Changes: 
- Use the HANDLE API to dump the profile instead of file descriptors. 

- Fix a couple bugs where there were dependences on the global 
constructor order, which is undefined 

- Make sure the pad member of the Central_Freelist is not of length 0 

- Do not unpatch the windows functions, ever.  This is dangerous since 
you would be depending on global destructor order. 

- Do not report large allocations using printf (or at all for now) 
since printf isn't so safe to call in some processes 

- If we are asked to free something allocated by the CRT heap, then do 
a nop.  This is necessary because a bunch of stuff gets allocated 
before our hooks are in place (global c++ constructors, parts of 
environ, other things?) 

- Hook _msize and _expand.  Otherwise, very subtle bugs can occur. 
Even in programs that don't use these, since the linker/compiler 
introduces calls to _msize for global c++ constructors occurring in 
dlls. 

- Make realloc work if it is given something that was originally 
allocated by the CRT heap. 

- Make pprof forgiving of carriage returns.  Editing pprof is the sum 
total of my life experience using perl, so I've probably done 
something stupid. 

- Added stack walker using RtlCaptureStackBackTrace, which doesn't 
work so well with FPO (I turned this walker on by default) 

- Wrote ad-hoc stack walker, which sometimes works with FPO, but 
sometimes gets a snapshot from an earlier time due to left over stack junk.  
It is possible that this one is faster than the previous one if you turn 
off the heuristics -- I didn't test that. 

- Leak all spinlocks since you cannot count on the global destructor 
order. 

- Hook all instances of the CRT in the process, not just an arbitrary 
one 

- Hook LoadLibraryExW in order to detect dynamic loads of the CRT 

- WART: Using FreeLibrary to free the CRT is dangerous if you later 
LoadLibrary it again.  So we aren't perfect yet. 

- Fix some compile errors with VC8 

- Wrote nm and addr2line programs that use PDB symbols.  Did not add 
these to any build scripts.  You must download Debugging Tools For 
Windows and copy dbghelp.dll and symsrv.dll into the directory of 
nm.exe and addr2line.exe for them to function.  There exist old 
versions of these dlls that do not function. 

- Modified pprof to fall back on these pdb versions of nm and 
addr2line if they are siblings of pprof.  Processes linking in a 
mixture of gcc compiled and cl compiled dlls seem to get the union of 
the symbols.

Original issue reported on code.google.com by dvi...@gmail.com on 16 Oct 2008 at 11:27

Attachments:

google-perftools-0.99.2.msvc8_i386_hprof.patch

GoogleCodeExporter commented 9 years ago

Ok, a new problem: now that we patch everything, there's nothing left to serve 
as a
dependency between libtcmalloc_minimal.dll and the main executable.  That is, 
if you
have a problem like
   main() { malloc(1); }
and link it will libtcmalloc_minimal.dll, the linker will decide libtcmalloc 
doesn't
add anything, and just ignore it at link time.  The end result is malloc won't 
get
patched.

The reason this didn't happen before, I think, is that libtcmalloc_minimal 
overrode
operator new, which mscvrt uses (I guess?), and that yielded the necessary 
dependency.

There are various ways to work around this (including the /INCLUDE linker 
flag), but
they're all pretty ugly.  The way I'd like to work around it is to continue to
override operator new/new[]/delete/delete[], rather than patching them.  The 
plan
here is that I'd just override them in the main module, while still patching 
them
inside dll's.

Does this seem like a workable solution?  Does it still address the issues that
caused you to want to patch new/delete in the first place?

Original comment by csilv...@gmail.com on 15 Nov 2008 at 11:19

GoogleCodeExporter commented 9 years ago

I believe that change can be dangerous if the user manages to make use of 
new/delete 
before things are patched (they can have a mismatch between new/delete over dll 
boundaries, since the dll's delete won't be able to call tcmalloc's new yet).  
I'm 
pretty sure the user is not supposed to call new/delete from things like 
DLLMain, but 
it works in practice so people do it.

I've been using a .obj file in my binary that simply forces a reference to 
tcmalloc 
to do this.  It seems pretty safe -- the user already has to modify their link 
line, 
and this approach will work with any compiler (e.g., gcc).

Original comment by dvi...@gmail.com on 16 Nov 2008 at 3:56

GoogleCodeExporter commented 9 years ago

I took a look at the new patch_functions.cc.  It looks good except for a couple 
comments that I found a little misleading.

"This only matters when we're linking statically, and there's no DLLs."

Dynamically loaded instances of libc are not mutually exclusive with respect to 
a a 
statically linked libc (e.g., if the program statically links with libc and 
also uses 
advapi32.dll).  I think the code is doing the right thing.

Original comment by dvi...@gmail.com on 16 Nov 2008 at 5:02

GoogleCodeExporter commented 9 years ago

I've changed the comment -- thanks for looking it over.

} It seems pretty safe -- the user already has to modify their link line, 
} and this approach will work with any compiler (e.g., gcc).

I was hoping to make this as easy as possible for clients, and so was hoping to 
avoid
adding a second, non-obvious link argument.  Can you explain a bit more about 
why the
change could be dangerous?  You said:

} they can have a mismatch between new/delete over dll 
} boundaries, since the dll's delete won't be able to call tcmalloc's new yet

What do you mean by "over dll boundaries"?  Why won't the dll's delete be able 
to
call tcmalloc's delete yet?  (I assume you meant "tcmalloc's delete" and not
"tcmalloc's new".)  I think I'm missing what the problem scenario is.

Original comment by csilv...@gmail.com on 16 Nov 2008 at 10:23

GoogleCodeExporter commented 9 years ago

Basically, a user would need to invoke new from the executable before tcmalloc 
has an 
opportunity to patch things.  They would need to proceed to pass a pointer to 
the 
new'd object to a DLL that did not explicitly link against tcmalloc's lib file. 
 This 
DLL would need to invoke delete on the pointer.  Example:

// foo.exe
// call this from a global constructor
void irunreallyearly(void){
  kill(new char);
}

// foo.dll
void __declspec(dllexport) kill(char*p){
  delete p;
}

irunreallyearly would need to be invoked by a constructor that runs before 
tcmalloc 
gets to patch things.  I cannot find documentation regarding the order in which 
global constructors run in Visual C, and we don't need to rely on undefined 
behavior 
if we do not define new.  The only thing I've come across with respect to 
global 
constructor order are some forum posts claiming it isn't defined between 
different 
compilation units.

Another possible problem here is that tcmalloc may just crash if new is invoked 
before tcmalloc's constructors are invoked.  This isn't so good either, and the 
possibility can be avoided by not defining new.

Original comment by dvi...@gmail.com on 25 Nov 2008 at 2:18

GoogleCodeExporter commented 9 years ago

OK.  I will make it so you have to use /INCLUDE when linking in tcmalloc.  
Hopefully
folks won't mind too much!

Original comment by csilv...@gmail.com on 25 Nov 2008 at 2:56

GoogleCodeExporter commented 9 years ago

OK, we're getting closer!  I have sign-off here on the final version of the
patch_functions patch, so we're only left with a few small things in tcmalloc 
and
friends.  Mostly they have to do with code that runs before patching, or with 
unpatching.

My next question is about the following, in heap_profiler.cc:
    // On windows, new will not have been called yet (since we haven't
    // patched stuff) so call InitialMallocHook_New here to prevent an
    // assertion failure on the following lines.
    InitialMallocHook_New(NULL,0);

What is the exact problem here that this code is solving?  Why haven't we 
patched
stuff yet?  Don't we patch like the very first thing we do?

I think this is related to the problem you addressed elsewhere in the patch, 
where
the heap-profiler initialization might run before tcmalloc's.  I've fixed that, 
so I
think this code is no longer needed.  Can you tell me how to reproduce the 
problem
that prompted this change, so I can verify that problem is fixed in the current 
codebase?

Original comment by csilv...@gmail.com on 1 Dec 2008 at 9:15

GoogleCodeExporter commented 9 years ago

I'm not sure what it is about (some of?) our binaries that causes the 
compilation 
units to initialize in a different order.  Any chance I could get a patch for 
the 
initialization order fix and test it here?  It sounds like you've probably 
taken care 
of it.

Original comment by dvi...@gmail.com on 3 Dec 2008 at 8:48

GoogleCodeExporter commented 9 years ago

Hmm, I'm trying to remember what we did -- it was a while ago, and for a 
different
reason.

The key was to put this before the REGISTER_MODULE_INITIALIZER in 
heap-profiler.cc:
static const TCMallocGuard tcmalloc_initializer;

For this to work, TCMallocGuard has to be moved from tcmalloc.cc into a .h 
file.  I
just moved the declaration by creating a new .h file that looked like this:
class TCMallocGuard {
 public:
  TCMallocGuard();
  ~TCMallocGuard();
};

and declaring that .h file at the top of heap_profiler.cc.  I had to mess with 
a few
things in tcmalloc.cc, but that's the gist of it.  Is that enough to test your
change?  Sorry I can't come up with a real patch very easily; this was combined 
with
some other things that are difficult to tease out (and wouldn't apply cleanly).

Original comment by csilv...@gmail.com on 3 Dec 2008 at 10:36

GoogleCodeExporter commented 9 years ago

OK, assuming this heap-profiler issue is ok, there's only one remaining 
question I
have, and then the patch will be fully applied!  That question has to do with 
this
change to port.h:

   ~SpinLock() {
-    perftools_pthread_once(&initialize_token_, InitializeMutex);
-    DeleteCriticalSection(&mutex_);
+    /* dvitek: Because global destructor order is undefined, this
+     * causes serious problems (crashes after main finishes) if
+     * someone tries to use the lock after this destructor fires. 
+     */
+    // perftools_pthread_once(&initialize_token_, InitializeMutex);
+    // DeleteCriticalSection(&mutex_);
   }

Can you talk more to what crashes you were seeing here?  Based on the comment, I
would expect to see this happen only if a SpinLock were defined in one class 
and then
used in another, which I don't think happens.

Perhaps the problem is that a spinlock is deleted in the main thread, while 
other
threads are still running?  That could cause the crash you mention.  Anyway, 
more
details about what prompted this will make it easier to understand the right 
fix.

Original comment by csilv...@gmail.com on 6 Dec 2008 at 1:57

GoogleCodeExporter commented 9 years ago

Never mind about that last comment -- I just hand-verified that all SpinLocks in
perftools live the entire length of the program, so leaking memory in the 
destructor
is perfectly fine.  So I think, modulo the heap-profiler issue, we're good to 
go!

Original comment by csilv...@gmail.com on 6 Dec 2008 at 2:06

GoogleCodeExporter commented 9 years ago

Perftools 1.0rc1 is out, with (almost) all the content in this patch applied!  
Thanks
so much for your contribution.  Let me know how the release works for you.

Original comment by csilv...@gmail.com on 13 Dec 2008 at 1:46

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

I'm working to get pprof running on cygwin, and this line in pprof is giving me 
trouble:
    if ($l =~
/^($h)-($h)\s+..x.\s+($h)\s+\S+:\S+\s+\d+\s+(\S+\.(so|dll|dylib|exe)(\.\d+)*\w*)
/i) {

(It decides which lines from the /proc/self/maps output to pay attention to.)

By design, this line is supposed to match libraries.  However, the 'exe' 
extension is
causing it to match the main executable, which is leading to trouble.

pprof catches the main executable later in the function:
  # Append special entry for the main program. ...

Do you remember why you added "|exe" to the regexp in your patch
http://google-perftools.googlecode.com/issues/attachment?aid=753171857806843247&
name=google-perftools-0.99.2.msvc8_i386_hprof.patch?
 I know it was a long time ago.  If we can't figure out the reason why, I'm going to
remove it for the next release (perftools 1.2).

Original comment by csilv...@gmail.com on 13 Mar 2009 at 7:17

GoogleCodeExporter commented 9 years ago

I don't remember off the top of my head.  I'll see if anything bad happens when 
I 
remove it.

Original comment by dvi...@gmail.com on 16 Mar 2009 at 4:47

GoogleCodeExporter commented 9 years ago

seems to work okay without it.  At this point, my best guess is that it was 
added at 
some point when the later code for dealing with the main executable wasn't 
working 
yet.

Original comment by dvi...@gmail.com on 17 Mar 2009 at 3:45

GoogleCodeExporter commented 9 years ago

Great!  I'll take it out then.

Original comment by csilv...@gmail.com on 17 Mar 2009 at 5:47

Sgenmi / gperftools

patch for heap profiling on windows #83