CreateToolhelp32Snapshot fails spuriously in an MT application

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

I have a multithreaded applicaion that calls LoadLibrary from multiple 
threads. Sometimes the library fails to load with the following error on 
the console:

Check failed: sidestep::SIDESTEP_SUCCESS == 
PreamblePatcher::Patch(windows_fn_[i], perftools_fn_[i], &origstub_fn_[i])

What is the expected output? What do you see instead?

The expected behavior is consistent loading of the libraries, with the same 
result as if without tcmalloc.

What version of the product are you using? On what operating system?

Windows XP, MSVC 7.1, STLPort 5.1.4, perftools 1.4 + patch from issue 199 
and minor modifications for STLPort.

Please provide any additional information below.

The problem is related to issue 199: origstub_fn_[i] is dirty when calling 
PreamblePatcher::RawPatch from LibcInfoWithPatchFunctions<T>::Patch. 
However, the proposed patch from that issue does not help in my case 
because after cleaning origstub_fn_[i] and patching the functions, the 
resulting functions become double-patched. This leads to an infinite loop 
when calling, e.g. operator delete - it calls 
LibcInfoWithPatchFunctions<0>::Perftools_delete, which calls 
LibcInfoWithPatchFunctions<1>::Perftools_delete, which calls 
LibcInfoWithPatchFunctions<0>::Perftools_delete and so on.

The root of the problem is in the PatchAllModules function, which is called 
from LoadLibrary. It appears that CreateToolhelp32Snapshot can fail 
occasionally, which leads to assuming the process has no modules at all and 
marking all module_libcs as invalid. The next call to LoadLibrary, for 
which CreateToolhelp32Snapshot succeeds, will double-patch the modules and 
lead to the described behavior.

I'm not sure what causes CreateToolhelp32Snapshot to fail (in my case 
GetLastError returns 8, Not enough storage is available to process this 
command, but I clearly have enough memory). MSDN mentions that it may 
spuriously return ERROR_BAD_LENGTH (24). Probably, when multiple threads 
call to LoadLibrary, CreateToolhelp32Snapshot fails to synchronize with the 
library manager properly, even though the call is for the current process.

One other thing I noticed. The patch_all_modules_lock in PatchAllModules is 
locked after traversing the modules snapshot. But module_libcs is used 
during the traverse, which may lead to a race condition. However, moving 
patch_all_modules_lock before it does not help my situation.

Original issue reported on code.google.com by andrey.s...@gmail.com on 24 Dec 2009 at 7:16

GoogleCodeExporter commented 9 years ago

} but since B was added to the nopatching_set this time C doesn't get patched.

You're right.  The more I look into it, the more fragile this whole patching
optimization is.  I'm worried other bugs may be lurking in there as well.

I wonder there's a more targeted, but safer, optimization we could do that would
handle the bad situations that arise in practice, even though they don't handle 
every
situation.  Do you have a user-case you're optimizing for, where you can say 
what
modules the program is loading and freeing, and what other modules that brings 
in,
and if any of those modules have a refcount of 1 or more (so windows isn't 
really
doing any work) before you get to the problematic load/free calls?

In the meantime, thanks for the commented version of the patch.  I'll take a 
look
through it and try to puzzle through all the possible situations it could have 
to
deal with.

Original comment by csilv...@gmail.com on 17 Jan 2010 at 3:53

GoogleCodeExporter commented 9 years ago

Our program isn't doing anything complicated as far as modules go.  The 
executable 
is linked against two or three dozen libraries that get loaded before the first 
PatchAllModules call, then the rest of the libraries are dynamically loaded at 
program startup time, it's about 80 in total.  It is multithreaded and I think 
some 
of the libraries may be loaded in parallel.  Once the program is up and running 
there's just this one OPENGL32 LoadLibrary/FreeLibrary happening multiple times 
per 
frame.  Since the program was linked against OPENGL32 at the start it would 
never 
be unloaded.  At program shutdown time some or all of the dynamically loaded 
libraries are unloaded and the program exits.

There isn't any crazy constant loading and unloading libraries from different 
threads all the time.  With all the libraries that has to be loaded on startup 
the 
last released tcmalloc was doubling the application startup time and the 
OPENGL32 
was seriously hurting frame time.  The version I sent Friday/Saturday makes a 
huge 
difference to our program, but our program's not one to find the corner cases 
of 
getting the patching right.

Original comment by da...@fries.net on 17 Jan 2010 at 10:29

GoogleCodeExporter commented 9 years ago

Right, I don't want your example to test the corner cases, I want your example 
to
understand a common use case.  I'd like to start by optimizing for your case, 
rather
than the general case.  I'm hoping that will be easier.

I don't understand why my patch, without your changes, is so slow for you.  If I
understand your situation right, when the OPENGL LoadLibrary() happens within a
frame, then PatchAllFunctions should return false, since it didn't need to patch
anything (everything was already patched, due to the OPENGL load at program 
start). 
So you should be seeing barely any overhead at all.  Why were you, in fact, 
seeing
lots of overhead?

Original comment by csilv...@gmail.com on 18 Jan 2010 at 12:00

GoogleCodeExporter commented 9 years ago

patching_map was never allocated, so the FreeLibrary always ran 
PatchAllModules. 
patching_map was never allocated because there was never any library passed to
LoadLibrary that required patching, so it never called the allocate.  Once I
allocated it, the performance was much improved.  What's the reasoning for 
allocating
them dynamically instead having a filescope static object?

my version of patch_functions.cc
LoadLibrary PatchAllModules  0.146, total  0.146
LoadLibrary PatchAllModules  0.060, total  0.205
LoadLibrary PatchAllModules  0.153, total  0.358
LoadLibrary PatchAllModules  0.240, total  0.598
FreeLibrary PatchAllModules  0.058, total  0.058
LoadLibrary PatchAllModules  0.177, total  0.775
FreeLibrary PatchAllModules  0.057, total  0.115
LoadLibrary PatchAllModules  0.170, total  0.944
LoadLibrary PatchAllModules  0.337, total  1.281
LoadLibrary PatchAllModules  0.347, total  1.628
LoadLibrary PatchAllModules  0.171, total  1.799
LoadLibrary PatchAllModules  0.275, total  2.074
LoadLibrary PatchAllModules  0.177, total  2.251
LoadLibrary PatchAllModules  0.177, total  2.428
LoadLibrary PatchAllModules  0.178, total  2.606
LoadLibrary PatchAllModules  0.178, total  2.785
LoadLibrary PatchAllModules  0.191, total  2.976
LoadLibrary PatchAllModules  0.227, total  3.202
LoadLibrary PatchAllModules  0.199, total  3.402
LoadLibrary PatchAllModules  0.188, total  3.590
FreeLibrary PatchAllModules  0.083, total  0.198
FreeLibrary PatchAllModules  0.072, total  0.269
FreeLibrary PatchAllModules  0.072, total  0.341
FreeLibrary PatchAllModules  0.066, total  0.408
FreeLibrary PatchAllModules  0.065, total  0.473
FreeLibrary PatchAllModules  0.063, total  0.536
FreeLibrary PatchAllModules  0.065, total  0.600
FreeLibrary PatchAllModules  0.067, total  0.667
FreeLibrary PatchAllModules  0.065, total  0.732
FreeLibrary PatchAllModules  0.059, total  0.791
FreeLibrary PatchAllModules  0.275, total  1.066
        final total LoadLibrary 3.590 ms
        final total FreeLibrary 1.066 ms

Fixed the patching_map failure and 6.23 to 6.65 second start time
LoadLibrary PatchAllModules  1.882, total  1.882 
LoadLibrary PatchAllModules  1.783, total  3.666 
LoadLibrary PatchAllModules  1.803, total  5.469
LoadLibrary PatchAllModules  1.808, total  7.277
LoadLibrary PatchAllModules  1.736, total  9.013
LoadLibrary PatchAllModules  1.800, total 10.813
LoadLibrary PatchAllModules  1.754, total 12.567
FreeLibrary PatchAllModules  1.732, total  1.732
LoadLibrary PatchAllModules  2.007, total 14.574
LoadLibrary PatchAllModules  1.897, total 16.470
LoadLibrary PatchAllModules  2.012, total 18.483
LoadLibrary PatchAllModules  2.002, total 20.484
LoadLibrary PatchAllModules  2.168, total 22.652
LoadLibrary PatchAllModules  2.110, total 24.761
LoadLibrary PatchAllModules  2.062, total 26.823
LoadLibrary PatchAllModules  2.097, total 28.920
LoadLibrary PatchAllModules  2.302, total 31.222
LoadLibrary PatchAllModules  2.774, total 33.996
LoadLibrary PatchAllModules  2.513, total 36.509
LoadLibrary PatchAllModules  2.611, total 39.119
LoadLibrary PatchAllModules  2.672, total 41.792
LoadLibrary PatchAllModules  2.774, total 44.566
LoadLibrary PatchAllModules  2.844, total 47.410
LoadLibrary PatchAllModules  2.775, total 50.185
LoadLibrary PatchAllModules  2.853, total 53.038
LoadLibrary PatchAllModules  2.959, total 55.997
LoadLibrary PatchAllModules  3.116, total 59.113
LoadLibrary PatchAllModules  3.067, total 62.180
LoadLibrary PatchAllModules  3.044, total 65.224
LoadLibrary PatchAllModules  3.204, total 68.428
LoadLibrary PatchAllModules  3.243, total 71.671
        final total LoadLibrary 71.671 ms
        final total FreeLibrary 1.732 ms

Original comment by da...@fries.net on 18 Jan 2010 at 10:16

GoogleCodeExporter commented 9 years ago

Now that I think about it, the one FreeLibrary call in your version of the 
library
was only because I put the new allocation in the else part of the FreeLibrary
if (patching_map) {
}... else{
  patching_map = new std::map<HMODULE, int>;
}

The problem I see is patching_map only has a list of libraries that required 
patching
when Perftools_LoadLibraryExW was called after LoadLibrary was patched.  If a 
library
was loaded before tcmalloc that required patching PatchAllModules would patch 
it, but
FreeLibrary wouldn't have it in the list of libraries that required unpatching 
when
unloaded.

Original comment by da...@fries.net on 18 Jan 2010 at 10:39

GoogleCodeExporter commented 9 years ago

Issue 206 has been merged into this issue.

Original comment by csilv...@gmail.com on 18 Jan 2010 at 11:03

GoogleCodeExporter commented 9 years ago

} patching_map was never allocated, so the FreeLibrary always ran 
PatchAllModules. 

Hmm, this seems like a straightforward bug.  How does it work, instead of 
applying 
your patch, to just take the version from comment 38, and then in 
Perftools_FreeLibrary(), insert the following right after you acquire the lock:
    if (!patching_map)
      return rv;

} What's the reasoning for allocating them dynamically instead having a 
filescope
} static object?

They were static originally, but this crashed tcmalloc_minimal_test-static.  I 
don't 
know what the particular problem was with static linking, but I do know that 
filescope static objects with constructors are tricky to get right, so I'm 
happy to 
just punt on that issue.  I don't feel strongly though, as long as it passes 
the 
tests.

} The problem I see is patching_map only has a list of libraries that required
} patching when Perftools_LoadLibraryExW was called after LoadLibrary was 
patched.

Right -- it looks like there are two issues still open for this bug report: one 
is 
the timing is too slow for you, and another is a correctness report.  Let's 
deal with 
the timing first.  I believe my patch should be fast, and if not I want to 
figure out 
why (I understand your patch fixes this problem, but I want to understand why 
mine 
doesn't, before deciding whether to apply yours.)  This bug you pointed out in 
FreeLibrary may be the cause.

Once we've got the timing licked, let's look into this correctness issue.  We 
can 
then figure out the minimal change needed to get that fixed too.

Original comment by csilv...@gmail.com on 18 Jan 2010 at 11:08

GoogleCodeExporter commented 9 years ago

The second timing run was comment 38, only with collecting timing information 
and 
allocating patch_map.  Replacing in FreeLibrary as you suggest
if (patching_map) {
with
if (!patching_map) return rv;
Is only going to save one PatchAllModules call.  That's 1.732 ms, I can rerun 
the 
above timing, but it isn't going to hardly change it.  The comment 38 version 
along 
with a patch_map fix is much faster than the original, mine's still about 15 
times 
faster in the PatchAllModules routine.  Compare the PatchAllModules execution 
times, yours starts at 1.882 and increases until it gets to 3.243, because 
everytime there are more modules to look through to see if there are any 
symbols as 
it always looks through the existing unpatched ones and it doesn't matter if 
modules are loading or unloading.  Mine has a constant amount of time when 
something is unloaded (except the last unload, who knows why) because it just 
has 
to remove entries from the data structures and look at no modules, when a 
module is 
loaded it only has to look for symbols in that one library.

Original comment by da...@fries.net on 19 Jan 2010 at 1:40

GoogleCodeExporter commented 9 years ago

We are talking past each other.  It must be very frustrating to you.  Let's 
start 
over; I'll try to do things very slowly but surely, bringing up only one point 
at a 
time in my comments.

} Compare the PatchAllModules execution times

But my claim is that PatchAllModules shouldn't be being called at all -- all 
the 
times you're calling LoadLibraryExW and FreeLibrary in your inner loop (per 
frame), 
it's basically a noop, isn't it?  Because the relevant libraries were loaded at 
program startup.  So all the LoadLibrary and FreeLibrary are doing inside 
windows is 
incrementing and decrementing a refcount.  Is that accurate?

If not, please explain what the *windows* code (not the perftools code) is 
doing in 
the LoadLibraryExW and FreeLibrary calls within your inner loop (the stuff that 
happens per-frame).  In particular: what modules is it loading into the 
executable 
that weren't loaded before, on each of the per-frame LoadLibrary calls?  What 
modules 
is it "taking out" of the executable, on each of the per-frame FreeLibrary 
calls?

Original comment by csilv...@gmail.com on 19 Jan 2010 at 2:04

GoogleCodeExporter commented 9 years ago

Sorry, my timing run wasn't clear.  The timings I posted above was for an 
entire 
application run.  As it stands neither version calls PatchAllModules in our 
frame.

Original comment by da...@fries.net on 19 Jan 2010 at 2:17

GoogleCodeExporter commented 9 years ago

Ok, now I'm more confused than ever.  If neither version is calling 
PatchAllModules in 
our frame, then why is your version so much faster than mine?  You talk about 
how much 
you've sped up PatchAllModules, but that shouldn't matter if it's never being 
called.  
So what code is running in my version of the file, but not yours, that is 
causing 
slowness (cutting and pasting these lines of code into the bug report would be 
helpful).

Original comment by csilv...@gmail.com on 19 Jan 2010 at 2:38

GoogleCodeExporter commented 9 years ago

The time it takes the application to draw a frame once it is up and running is 
only 
part of the original problem with the last released tcmalloc.  Neither version 
is 
calling PatchAllModules in the application frame, hence the comment 55 timings 
do 
not address what overhead there is in the frame with either version.  I could 
make 
some time measurements if that would be of interest.  What I was timing was how 
long PatchAllModules took to execute, which routine it was called from and a 
running sum.  PatchAllModules is being called at application startup and again 
at 
application termination.  Are you seeing why my PatchAllModules was so much 
faster 
than the other version?

Original comment by da...@fries.net on 19 Jan 2010 at 2:48

GoogleCodeExporter commented 9 years ago

No, I still haven't tried to understand why your PatchAllModules is faster.  It 
looks 
like I still don't even understand what problem you're trying to solve.  
Originally, 
I thought the report was that each frame was taking X time with tcmalloc, but 
only 
half X time with the windows allocator.  Is that not true?  Or perhaps it was 
true, 
but one of my patches fixed that particular problem?

What problem are you seeing now, that still needs to be solved (based on the 
current 
svn-trunk, which is with my patch committed but not yours on top of it)?  Is it 
only 
with startup and tear-down time?  And by "startup", does that refer to just 
when the 
program actually starts up, or are you also considering the time spent while 
loading 
whatever modules you load up after the program starts but before you start 
doing 
"real work"?

Original comment by csilv...@gmail.com on 19 Jan 2010 at 2:55

GoogleCodeExporter commented 9 years ago

The issue 206 did only start out as seeing a big increase in frame time, and as 
we 
dug into it, we realized in addition the increase in startup time.  Yes the 
scope 
changed, but so did our observations of tcmalloc and the issues involved.

Original two problems, startup time doubled from 6 seconds Microsoft allocator 
to 
11 seconds tcmalloc.  I measured startup time from executing the application to 
when the last completed loading data message printed.  That's pretty much when 
the 
application is available to the user to interact with.  The other problem was 
frame 
time, one number can quote from my earlier comment 2 of issue 206 was GL 
SwapBuffers taking 0.17ms for Microsoft allocator and 4.17ms for tcmalloc.

As far as what's in subversion, I don't think anyone will notice the startup 
time 
difference now.  I don't think the additional tcmalloc overhead in the 
frametime is 
a problem either, though in this case I would need to benchmark the subversion 
one 
to my version and see which is faster.

As far as timing goes, I think it is down to what is tcmalloc's goals?  Is it 
to be 
as fast as it can be?  In which case having a PatchAllModules that is a 
constant 
time overhead over one that linearly increases the amount of time as modules 
loads 
should be pretty obvious.

The other question is either more correct and safe in the corner cases?  It has 
already been covered that the nopatching_set might not be safe in all cases, 
but 
both versions are using it.  If the patching_map has been fixed in the 
subversion 
as discussed, PatchAllModules will not be called in FreeLibrary on modules 
loaded 
before LoadLibrary has been patched as they will not be added to patching_map.  
My 
version handles that case so far as I know.

My version is winning on faster and safer, but feel free to poke holes in it 
and 
make it better.

As far as our application Ryan has found that turning on the Microsoft 
allocator 
low fragmentation heap greatly decreases the nondeterminism that was really 
hurting 
it earlier.  We have an all night run of that version running against tcmalloc 
in 
progress right now.  We ran both over the weekend and the Microsoft allocator 
version crashed at some point in time while tcmalloc was still running, though 
slow 
because of a problem in our code and the Microsoft allocator most likely ran 
out of 
memory for probably the same issue.

Original comment by da...@fries.net on 19 Jan 2010 at 4:00

GoogleCodeExporter commented 9 years ago

} As far as timing goes, I think it is down to what is tcmalloc's goals?

Stop.  Please take a step back here.  You are trying to push solutions, and I 
am 
still trying to understand what the problems are.  I know you've identified, in 
your 
own head, what the problems are and what the solutions are, but that doesn't 
help me 
any.  I need to understand what the problems are as well.  I'm sure your patch 
solves 
some problems that are important to you, but I don't know what they are.  I 
can't 
evaluate this patch until I understand what it's trying to do.  "Speed things 
up" is 
not a specific enough answer. I need to understand what parts of your execution 
environment are causing you trouble right now.  Then I will have the necessary 
context to understand why the current code is causing those problems, and what 
the 
solutions might be.

So please, tell me, without any other side-information to confuse the issue and 
without referencing any other parts of the discussions we've had until now, how 
the 
perftools at svn-head compares to windows malloc (1) at program start up, and 
(2) at 
frame time.  Let's go from there.

Original comment by csilv...@gmail.com on 19 Jan 2010 at 4:20

GoogleCodeExporter commented 9 years ago

OK, I had a chance to look at your patch, and if I understand it right, the 
major 
difference is using the refcount information to avoid calling PatchAllModules 
when a 
library is loaded multiple times.  That makes sense to me.  In your patch, you 
divide 
that refcount-management between LoadLibraryExW and PatchAllModules, but while 
it 
gives a bit extra accuracy, I don't think it will help runtime at all (because 
the 
extra accuracy will almost never trigger).

So I've tried implementing the same idea; the new file is attached here.  
Instead of 
reporing the comparison at program startup and at frametime against svn-head, 
try 
comparing it against this new file.

This also suggests a third thing you could measure (instrument): how many times 
PatchAllModules is actually being called in your application.  My goal is to 
get it 
to the point it's called less than 100 times.  Then it doesn't matter how 
efficient 
it is, since it's called so rarely.

Original comment by csilv...@gmail.com on 19 Jan 2010 at 6:01

Attachments:

patch_functions.cc

GoogleCodeExporter commented 9 years ago

Yes part of the patch is to avoid calling PatchAllModules, the other part is in 
PatchAllModules to only check new modules for malloc symbols 
libc_info.PopulateWindowsFn.  That's the mostly constant time execution of 
PatchAllModules I mentioned earlier.

As far as how many times PatchAllModules is listed, two separate complete 
program 
runs are listed in comment 55, that is when PatchAllModules is called from 
LoadLibrary or FreeLibrary.  I'll see about the latest version.

Original comment by da...@fries.net on 19 Jan 2010 at 6:11

GoogleCodeExporter commented 9 years ago

Comment 55 just gives times.  I'm interested in counts.

Original comment by csilv...@gmail.com on 19 Jan 2010 at 6:17

GoogleCodeExporter commented 9 years ago

comment 55 gives one line per PatchAllModules call put each in a file,
cat mine.txt |grep LoadLibrary |wc -l
18
cat mine.txt |grep FreeLibrary |wc -l
13
wc -l mine.txt 
31 mine.txt
Called 18 times from LoadLibrary, 13 from FreeLibrary, 31 times total

cat patching_map_fix.txt |grep LoadLibrary |wc -l
30
cat patching_map_fix.txt |grep FreeLibrary |wc -l
1
wc -l patching_map_fix.txt 
31 patching_map_fix.txt
Called 30 times from LoadLibrary, 1 from FreeLibrary, 31 times total

Original comment by da...@fries.net on 19 Jan 2010 at 6:23

GoogleCodeExporter commented 9 years ago

Ok, so what's the status now with respect to the three questions I've asked:

(1) how the perftools with my latest patch compares to windows malloc at 
program 
start up
(2) how the perftools with my latest patch compares to windows malloc at frame 
time
(3) how many times PatchAllFunctions() is called with my latest patch vs your 
patch

If I understood your last comment correctly, for (3) it's the same number of 
times 
(though each call to PatchAllFunctions is faster in your case).  What about (1) 
and 
(2)?

Original comment by csilv...@gmail.com on 19 Jan 2010 at 6:37

GoogleCodeExporter commented 9 years ago

1. The time difference doesn't show up on a hand held stopwatch.
2. Microsoft 4.088ms, comment 67 tcmalloc 4.042 ms, average of 30 frames for 
each,
but both jitter around.
3. PatchAllFunctions calls, comment 67 tcmalloc 41 times, my patch 65 times

Original comment by da...@fries.net on 19 Jan 2010 at 5:51

GoogleCodeExporter commented 9 years ago

I believe the comment-67 patch had a bug: I had an == and != reversed in 
FreeLibrary.  
The line:
        if (--it->second == 0) {
should actually be
        if (--it->second > 0) {
This means it probably wasn't patching as correctly as it ought to.  I don't 
know how 
much of a difference it will make in your timing runs, but it may.  I'm sorry I 
didn't catch that earlier. :-(  It's my fault; I really need better testing for 
this 
part of the codebase.

But let's assume the numbers you report hold up:
} 1. The time difference doesn't show up on a hand held stopwatch.
} 2. Microsoft 4.088ms, comment 67 tcmalloc 4.042 ms
} 3. PatchAllFunctions calls, comment 67 tcmalloc 41 times, my patch 65 times

It looks like this addresses all the issues that you had originally, is that 
right?  
My understanding is you had two problems originally that you wanted solved:
1) perftools was much slower than without at startup time
2) perftools was slower than without when making frames

With the latest patch, both of those issues are addressed, no?  Is there any 
other 
existing issue you have, that the patch doesn't address?

Original comment by csilv...@gmail.com on 19 Jan 2010 at 7:20

GoogleCodeExporter commented 9 years ago

if (--it->second > 0) {
Can't be right in FreeLibrary, it's now always calling PatchAllModules.

Original comment by da...@fries.net on 19 Jan 2010 at 9:15

GoogleCodeExporter commented 9 years ago

Ugh, I found another, really stupid bug. :-(  In LoadLibraryExW, it says:
    if (nopatching_set && nopatching_set->count(szFullPath) > 0)
      nopatching_set2->insert(rv);
      return rv;  // case (1)

There need to be braces around this.  Right now it's always returning rv, and 
never 
trying to patch!  No wonder you don't see any timing differences.

Does this fix make things behave better?  If not, I'll attach an instrumented 
version 
of the file, which does printfs so we can better understand what's going on.

Original comment by csilv...@gmail.com on 19 Jan 2010 at 9:28

GoogleCodeExporter commented 9 years ago

The comment 67 patch came out with PatchAllModules called 41 times for a total 
time
spent in PatchAllModules of 144.694 milliseconds.  Adding in the comment 73 and 
75
changes it now calls PatchAllModules 47 times for a total time spent in
PatchAllModules of 164.971 milliseconds and an average of 3.51 milliseconds per
PatchAllModules call, so technically yes it made a negative time impact.  I'll 
assume
it is a correctness tradeoff, so no problem there.

To answer your question the startup time and frame time overhead are both 
sufficient
for our application.  You can consider these issues closed, though I will point 
out
in the same conditions my version calls PatchAllModules 65 times for a total 
time
spent in PatchAllModules of 15.612 milliseconds and an average of 0.24 
milliseconds
per PatchAllModules call, or about ten times less time total then the above 
version.
 You have access to the tcmalloc repository, you can pick the faster or slower
algorithm, your choice.

Original comment by da...@fries.net on 19 Jan 2010 at 10:37

GoogleCodeExporter commented 9 years ago

Thank you for the data.  It sounds like the complexity that your version adds 
is 
worthwhile, then.  I can't take the patch as you have it, because it calls 
windows 
functions while holding the spinlock, and that can lead to deadlocks (though 
obviously it doesn't for your application) -- I have comments about that where 
the 
lock is defined in the source file.

However, I've taken your ideas and attempted to come up with a patch that 
incorporates them but is not subject to the lock inversion problem.  I've 
attached it 
below.  Try it out and see how it works for you.  I can't guarantee it's 
bug-free, 
but maybe you can use it as a starting point to get to something that is as 
good as 
your patch.

I'm hoping to release a new perftools tomorrow (before I go on vacation), so 
there's 
a bit of a time crunch, unfortunately.  But if we don't get this change in for 
this 
release, that's ok, we'll get it in the next one.

Let me know how this works, and/or if you have any changes to suggest to it.

Original comment by csilv...@gmail.com on 20 Jan 2010 at 3:36

Attachments:

patch_functions.cc

GoogleCodeExporter commented 9 years ago

Looks like in PatchAllModules
patched_modules->insert(libc_info.hmodule());
needs to be moved outside of the if statement in the last for loop, otherwise 
it's 
only the ones that were patched instead of all currently loaded modules.  I see 
in 
LoadLibrary it adds any new libraries that were in the nopatching_set.  I see 
FreeLibrary is erasing from patch_set when the reference goes to zero.  What 
I'm 
seeing is all the modules that were loaded before LoadLibrary was patched, or 
modules that a module was linked to (and don't get LoadLibrary called on) 
aren't in 
the list and will be searched everytime.  Now just moving it outside the if 
statement will not scrub the patched_modules against that are unloaded at the 
same 
time as another because they were brought in by that module.

The other issue is in our case OPENGL32 is already loaded.  The first 
LoadLibrary 
will leave a reference count of 1 in load_map, fail the nopatching_set, invoke 
PatchAllModules, add to the nopatching_set, and return.  FreeLibrary will then 
decrement the reference count to 0, erase from load_map and patch_set and call 
PatchAllModules.  After that, the reference count will be bouncing between 0 
and 1, 
and while it will be skipping the PatchAllModules because of the nopatching_set 
and 
noop_patch_set, it is going to be searching the nopatching_set and being 
inserted 
and removed from patch_set and noop_patch_set.

In my version the first FreeLibrary call would also see a 0 reference count and 
call PatchAllModules, but that would see the module still in memory and set the 
reference count to 1, and from there on it would be going between 1 and 2.  At 
least that's how it's supposed to work.

By the way, I liked your swap erasing of hModules to get the current set of 
added 
modules, that qualifies as low cost in my book.

I'll plan on getting some numbers first thing in the morning.  As I expect you 
are 
needing to do other things to get ready for vacation (I always do), I would 
suggest 
shipping with this or the previous version, whichever is faster, and leave 
looking 
at the patch until when you get back.

Original comment by da...@fries.net on 20 Jan 2010 at 4:45

GoogleCodeExporter commented 9 years ago

} The first LoadLibrary will leave a reference count of 1 in load_map, [...]

Thanks, that's a very nice explanation of what you're seeing.  That helps me
understand the control flow now.

} call PatchAllModules, but that would see the module still in memory and set 
the
} reference count to 1, and from there on it would be going between 1 and 2. At
} least that's how it's supposed to work.

I saw that, but I'm not sure it's safe.  The problem is that once you start
incrementing refcounts in PatchAllModules for the dependent libraries (that is, 
the
libraries brought in by whatever you passed to LoadLibrary), you have to run
PatchAllModules again to decrement the refcounts.  Otherwise, your refcount 
gets too
high.  I worked through a complicated case where we think someone has a 
refcount of
1, when it's actually been unloaded, unless we call PatchAllModules more often 
in
FreeLibrary.  I think the set lookups will end up being cheaper overall.

I hope so anyway!  See what your timings show.  And thanks for finding the bug 
with
the if.

Original comment by csilv...@gmail.com on 20 Jan 2010 at 6:54

GoogleCodeExporter commented 9 years ago

nopatching_set is never allocated, for OPENGL32 this is fatal.  In 
PatchAllModules,
2nd for loop, every time a g_module_libcs[i] is not found in the
currently_loaded_modules set_is_valid is set to false and made_changes is set to
true, which causes PatchAllModules to return true and so forth.  
g_module_libc[i]
will never be found if it's already invalid.

I don't see any tests that are loading dynamic libraries.  Time for a windows
specific test that starts up, dynamically loads a few common windows dll's that 
would
not otherwise be loaded into the program, loop for specific number of times on
LoadLibrary, FreeLibrary for a library that would have been loaded with the 
program
such as USER32 or GDI32, and then unload the previously dynamically loaded 
libraries.
 It also needs to time each LoadLibrary and FreeLibrary.  Yes it specifically mirrors
our use case, but it will let you test your own changes.

Original comment by da...@fries.net on 20 Jan 2010 at 3:29

GoogleCodeExporter commented 9 years ago

Ok, here's a new version of the file, which fixes the two bugs you've pointed 
out.

} I don't see any tests that are loading dynamic libraries.

That's right.  It would be great if you'd like to write one to contribute in!  
I 
don't know windows programming at all really, and don't feel qualified to write 
anything.  In particular, I'd love a test where we load dll which brings in 
other 
ones as dependencies, and then we load some of the dependencies explicitly too, 
and 
have different load/free ordering.  I feel that will test corner cases we don't 
really see right now.

Original comment by csilv...@gmail.com on 20 Jan 2010 at 3:59

GoogleCodeExporter commented 9 years ago

For my next trick, I will actually attach the new file...

Original comment by csilv...@gmail.com on 20 Jan 2010 at 4:05

Attachments:

patch_functions.cc

GoogleCodeExporter commented 9 years ago

01-20-2010 comment 81 update
PatchAllModules called 73,  sum 154.851 ms 

I reworked my patch to avoid calling GetModuleInfo with the spinlock held.
PatchAllModules called 65,  sum 14.219 ms

Original comment by da...@fries.net on 20 Jan 2010 at 6:29

Attachments:

patch_functions.cc

GoogleCodeExporter commented 9 years ago

I'm still a bit nervous about the complexity of this patch.  In particular, I'm 
worried that it's dangerous to be incrementing the refcount in PatchAllModules, 
when 
there's no visibility into when it should be decremented.  But I can't think of 
any 
situations where it will cause trouble, so I'm willing to go with it.

There are a few minor-ish reasons I'd prefer to use my patch as the basis 
rather than 
yours: as part of the patch, I was able to clean up of ModuleEntryCopy and 
LibcModules and I'd like to keep that, and also I have the noop_patch_set 
optimization that I think will help in some cases (though maybe not yours).  So 
I've 
updated my patch to have all the functionality that's in your patch (as far as 
i can 
tell).  I've attached it here.  Do you mind giving it a try and seeing if it 
performs 
the same as your patch?

If we do end up going with your patch, we'll need to make a few small changes.  
For 
instance, I notice that you set loaded_modules early in PatchAllModules, but 
then 
actually do the patching later.  I'm concerned that's not safe: if two threads 
try to 
run this at the same time, it's possible one will think the module has already 
been 
handled  when it hasn't been yet.  It's easy enough to move the increment and 
the 
actual patching to the be under the same lock critical-section.

While you're running the numbers, I'll try to think a bit more through the 
locking 
issues, and try to convince myself this change is safe.  If I can't think of 
any 
problems, I'll try to get one of the two versions of this patch committed today.

Original comment by csilv...@gmail.com on 20 Jan 2010 at 8:07

Added labels: ButIstillamnotconvincedyourpatchissafe--itmakesmereallyuncomfortablethatyoure

Attachments:

patch_functions.cc

GoogleCodeExporter commented 9 years ago

I chickened out -- these patches just aren't well enough tested yet.  I 
released 
without any of the speedup patches.  We'll keep working on it so we can get it 
right 
for the next release.

Original comment by csilv...@gmail.com on 20 Jan 2010 at 11:07

Removed labels: ButIstillamnotconvincedyourpatchissafe--itmakesmereallyuncomfortablethatyoure

GoogleCodeExporter commented 9 years ago

comment 84 version timing
PatchAllModules called 74,  sum 96.115 ms

I came up with the version I was trying to write earlier, but I didn't see any
way to pull it off, that is making use of the Windows reference count in       
helping to decide if PatchAllModules is required.  GetModuleHandleEx can find   
out
if a module is loaded and get the handle as well as increase the reference  
count, if
it isn't call LoadLibrary and PatchAllModules, but if it is already
loaded no need to call PatchAllModules.  The problem is in FreeLibrary, I
tried getting the path before calling FreeLibrary, then check after to see if   
it
was still loaded, but that's expensive.  I just realized tonight that the   
GetModuleHandleEx feature of passing an address and finding out what module
owns it can be used in the FreeLibrary routine.  Just give the module base
address, and if it is still loaded it will return the same module handle, if
not it will be something else.

Calling GetModuleHandleEx with an address is averaging around .003
milliseconds.  This version even beat my old version's time spent in 
PatchAllModules.
PatchAllModules called 65,  sum 11.662 ms
Note need
#define _WIN32_WINNT 0x0501
in config.h for WindowsXP or greater to use GetModuleHandleEx.

Original comment by da...@fries.net on 21 Jan 2010 at 12:37

Attachments:

patch_functions.cc

GoogleCodeExporter commented 9 years ago

> I released without any of the speedup patches.

Just to let you know that the front page says it was released in 2009.

Original comment by andrey.s...@gmail.com on 21 Jan 2010 at 6:56

GoogleCodeExporter commented 9 years ago

Thanks!  That, at least, was an easy bug to fix. :-)

Original comment by csilv...@gmail.com on 21 Jan 2010 at 9:54

GoogleCodeExporter commented 9 years ago

btw, thank you for the latest patch.  I'll take a look in a few weeks, when I'm 
back
from vacation.

Original comment by csilv...@gmail.com on 22 Jan 2010 at 11:16

GoogleCodeExporter commented 9 years ago

I looked over your patch and it looks very nice and clean.  I had to modify it a
little bit (the double-checked locking you were doing wasn't safe, and isn't 
really
necessary anyway).

I had to make one big change: I took out the nopatch_set.  I just don't see how 
to
make it safe without being even more expensive.  The problem is the following
scenario: let's suppose the we load module A, which depends on B and C.  B was 
loaded
before A, and contains a libc routine that we patched.  When we load A, we'll 
add it
to nopatch_set, because we didn't do any work (we didn't patch B because it was
already patched).  Now let's suppose we unload A and unload B -- so now 
everything is
unloaded -- and then reload A.  In this case, we'll need to patch B, but we 
won't,
because A is in nopatch_set.

The only way around this that I see is to store refcounts in last_loaded 
(obtained
via GetModuleHandleExW, I guess), not just a set of modules.  Then we can 
return true
from PatchAllModules if we don't do any patching/unpatching, *and* no refcount 
of a
patched module changes.  That seems like a lot of overhead.

So for now, I'm just taking out the nopatching_set variable.  If it turns out 
to be a
big perf improvement to add it back in, for certain workloads, we can try to 
figure
out a way to do it safely.

Then I got to thinking: why are we keeping our own refcount when windows is 
doing it
for us?  How much are we really gaining?  I don't know.  So I've put all the 
load_map
stuff in an #ifdef, so you can do timing tests both with and without our own 
map.  To
use, find the line '#define USE_LOAD_MAP' and comment or uncomment it.

Let me know if this version of the patch still gives the good numbers you saw 
with
your own patch (and what difference it makes for USE_LOAD_MAP or without).  
Then we
can figure out where to go from there.

(Note, my windows dev machine died, so I haven't been able to compile or test 
this
patch, so it may have some minor syntax problems.  It's very similar to your 
patch,
though, just with stuff taken out, so I don't expect too many difficulties.)

Original comment by csilv...@gmail.com on 3 Feb 2010 at 10:31

Attachments:

patch_functions.cc

GoogleCodeExporter commented 9 years ago

Why not simply compare snapshots provided by PSAPI and only process the 
difference 
between ones before and after LoadLibrary/FreeLibrary? This would allow to keep 
the 
full picture of which module does need patching and which does not, without 
dependencies headache.

Original comment by andrey.s...@gmail.com on 4 Feb 2010 at 3:11

GoogleCodeExporter commented 9 years ago

andrey.semashev, EnumProcessModules is taking .038 ms to .051 ms, which would 
need to
be run twice, before and after, then the list of modules must be compared to 
see what
changed.  My last version (which comment 90 builds on), uses 
GetModuleInformation to
determine if the module was actually loaded or unloaded (and which can be free 
if the
module was loaded on LoadLibrary) and if it was call PatchAllModules.  The
PatchAllModules uses the last PatchAllModules result to get the list of modules 
to
check.  I think what's in is the safest method.

There's just the question of how much does keeping a reference count help? 
Unfortunately the program I'm using has OPENGL32 loaded already, and from then 
on
it's LoadLibrary, FreeLibrary, so this program will not show any benefit, 
because it
doesn't use it.  I'm okay with using it or not.

csilvers,
// TODO(csilvers): for this to be reliable, we need to also take
// into account if we *would* have patched any modules had they not
// already been loaded.  (That is, made_changes should ignore
// g_last_loaded.)
The return value is ignored in this patch anyway.

> the double-checked locking you were doing wasn't safe, and isn't
> really necessary anyway
> 
> if (!load_map) {
>   // Allocate with the lock held, but we didn't have the lock when we
>   // checked (to avoid the cost of getting it all the time), so check again.
>   SpinLockHolder h(&patch_all_modules_lock);
>   if(!load_map) {
>     nopatching_set = new std::set<std::string>;
>     load_map = new std::map<HMODULE, int>;
>   }
> }

How's that not safe?  It puts the allocations all in one place for
that function avoiding other bugs, like forgetting to duplicate the 
allocation in every place required in the function.

Now this is not safe.  Note the NULL derefernce in the else case if it
hasn't already been allocated. 
  if (::GetModuleHandleExW(0, lpFileName, &rv)) {
...
    if (!g_load_map)  g_load_map = new std::map<HMODULE, int>;
    ++(*g_load_map)[rv];
...
  } else {
...
    (*g_load_map)[rv] = 1; 
...
  }

This patch,
PatchAllModules called 82,  0.050 ms, sum 13.976 ms

Original comment by da...@fries.net on 4 Feb 2010 at 5:11

GoogleCodeExporter commented 9 years ago

> andrey.semashev, EnumProcessModules is taking .038 ms to .051 ms, which would 
need
> to be run twice, before and after, then the list of modules must be compared 
to see 
> what changed.

It's once. One may cache the snapshot from the last call. And the comparison 
may be 
optimized, in case of LoadLibrary for sure.
However, I don't have numbers to argue. I just think that this solution might 
be 
simpler and more reliable, whereas not very different in terms of performance.

Original comment by andrey.s...@gmail.com on 4 Feb 2010 at 6:02

GoogleCodeExporter commented 9 years ago

} I just think that this solution might be simpler and more reliable

Sorry, what is "this solution"?

} There's just the question of how much does keeping a reference count help?

OK, let's keep it out for simplicity.

I think the patch in comment 90 is pretty simple, all told -- and even simpler 
once
we get rid of refcounting.  I am able to look at it and say it's "obviously" 
correct
(which means it has lots of bugs probably, only subtle ones :-) ), so I'm happy 
to go
with it.  The only question is performance.

Can both of you say the performance is good on this patch?  Can you compare 
timing
runs for this patch with your own favorite patch (if different)?  I've attached 
a new
version of patch_functions.cc below, with refcounting removed, just to make 
life easier.

} > the double-checked locking you were doing wasn't safe
} How's that not safe?

It's a moot point now, since we got rid of this data structure (for now), but if
you're curious, here's a brief explanation:
   http://www.drdobbs.com/184405726

Original comment by csilv...@gmail.com on 4 Feb 2010 at 7:47

Attachments:

patch_functions.cc

GoogleCodeExporter commented 9 years ago

I tested it, looks good.

Original comment by da...@fries.net on 4 Feb 2010 at 10:58

GoogleCodeExporter commented 9 years ago

Great!  I'll check it into svn so it can get wider exposure.

Original comment by csilv...@gmail.com on 11 Feb 2010 at 1:26

GoogleCodeExporter commented 9 years ago

This should be fixed in perftools 1.6, just released.

Original comment by csilv...@gmail.com on 5 Aug 2010 at 8:46

Changed state: Fixed

ddio / gperftools

CreateToolhelp32Snapshot fails spuriously in an MT application #200