Closed GoogleCodeExporter closed 9 years ago
} but since B was added to the nopatching_set this time C doesn't get patched.
You're right. The more I look into it, the more fragile this whole patching
optimization is. I'm worried other bugs may be lurking in there as well.
I wonder there's a more targeted, but safer, optimization we could do that would
handle the bad situations that arise in practice, even though they don't handle
every
situation. Do you have a user-case you're optimizing for, where you can say
what
modules the program is loading and freeing, and what other modules that brings
in,
and if any of those modules have a refcount of 1 or more (so windows isn't
really
doing any work) before you get to the problematic load/free calls?
In the meantime, thanks for the commented version of the patch. I'll take a
look
through it and try to puzzle through all the possible situations it could have
to
deal with.
Original comment by csilv...@gmail.com
on 17 Jan 2010 at 3:53
Our program isn't doing anything complicated as far as modules go. The
executable
is linked against two or three dozen libraries that get loaded before the first
PatchAllModules call, then the rest of the libraries are dynamically loaded at
program startup time, it's about 80 in total. It is multithreaded and I think
some
of the libraries may be loaded in parallel. Once the program is up and running
there's just this one OPENGL32 LoadLibrary/FreeLibrary happening multiple times
per
frame. Since the program was linked against OPENGL32 at the start it would
never
be unloaded. At program shutdown time some or all of the dynamically loaded
libraries are unloaded and the program exits.
There isn't any crazy constant loading and unloading libraries from different
threads all the time. With all the libraries that has to be loaded on startup
the
last released tcmalloc was doubling the application startup time and the
OPENGL32
was seriously hurting frame time. The version I sent Friday/Saturday makes a
huge
difference to our program, but our program's not one to find the corner cases
of
getting the patching right.
Original comment by da...@fries.net
on 17 Jan 2010 at 10:29
Right, I don't want your example to test the corner cases, I want your example
to
understand a common use case. I'd like to start by optimizing for your case,
rather
than the general case. I'm hoping that will be easier.
I don't understand why my patch, without your changes, is so slow for you. If I
understand your situation right, when the OPENGL LoadLibrary() happens within a
frame, then PatchAllFunctions should return false, since it didn't need to patch
anything (everything was already patched, due to the OPENGL load at program
start).
So you should be seeing barely any overhead at all. Why were you, in fact,
seeing
lots of overhead?
Original comment by csilv...@gmail.com
on 18 Jan 2010 at 12:00
patching_map was never allocated, so the FreeLibrary always ran
PatchAllModules.
patching_map was never allocated because there was never any library passed to
LoadLibrary that required patching, so it never called the allocate. Once I
allocated it, the performance was much improved. What's the reasoning for
allocating
them dynamically instead having a filescope static object?
my version of patch_functions.cc
LoadLibrary PatchAllModules 0.146, total 0.146
LoadLibrary PatchAllModules 0.060, total 0.205
LoadLibrary PatchAllModules 0.153, total 0.358
LoadLibrary PatchAllModules 0.240, total 0.598
FreeLibrary PatchAllModules 0.058, total 0.058
LoadLibrary PatchAllModules 0.177, total 0.775
FreeLibrary PatchAllModules 0.057, total 0.115
LoadLibrary PatchAllModules 0.170, total 0.944
LoadLibrary PatchAllModules 0.337, total 1.281
LoadLibrary PatchAllModules 0.347, total 1.628
LoadLibrary PatchAllModules 0.171, total 1.799
LoadLibrary PatchAllModules 0.275, total 2.074
LoadLibrary PatchAllModules 0.177, total 2.251
LoadLibrary PatchAllModules 0.177, total 2.428
LoadLibrary PatchAllModules 0.178, total 2.606
LoadLibrary PatchAllModules 0.178, total 2.785
LoadLibrary PatchAllModules 0.191, total 2.976
LoadLibrary PatchAllModules 0.227, total 3.202
LoadLibrary PatchAllModules 0.199, total 3.402
LoadLibrary PatchAllModules 0.188, total 3.590
FreeLibrary PatchAllModules 0.083, total 0.198
FreeLibrary PatchAllModules 0.072, total 0.269
FreeLibrary PatchAllModules 0.072, total 0.341
FreeLibrary PatchAllModules 0.066, total 0.408
FreeLibrary PatchAllModules 0.065, total 0.473
FreeLibrary PatchAllModules 0.063, total 0.536
FreeLibrary PatchAllModules 0.065, total 0.600
FreeLibrary PatchAllModules 0.067, total 0.667
FreeLibrary PatchAllModules 0.065, total 0.732
FreeLibrary PatchAllModules 0.059, total 0.791
FreeLibrary PatchAllModules 0.275, total 1.066
final total LoadLibrary 3.590 ms
final total FreeLibrary 1.066 ms
Fixed the patching_map failure and 6.23 to 6.65 second start time
LoadLibrary PatchAllModules 1.882, total 1.882
LoadLibrary PatchAllModules 1.783, total 3.666
LoadLibrary PatchAllModules 1.803, total 5.469
LoadLibrary PatchAllModules 1.808, total 7.277
LoadLibrary PatchAllModules 1.736, total 9.013
LoadLibrary PatchAllModules 1.800, total 10.813
LoadLibrary PatchAllModules 1.754, total 12.567
FreeLibrary PatchAllModules 1.732, total 1.732
LoadLibrary PatchAllModules 2.007, total 14.574
LoadLibrary PatchAllModules 1.897, total 16.470
LoadLibrary PatchAllModules 2.012, total 18.483
LoadLibrary PatchAllModules 2.002, total 20.484
LoadLibrary PatchAllModules 2.168, total 22.652
LoadLibrary PatchAllModules 2.110, total 24.761
LoadLibrary PatchAllModules 2.062, total 26.823
LoadLibrary PatchAllModules 2.097, total 28.920
LoadLibrary PatchAllModules 2.302, total 31.222
LoadLibrary PatchAllModules 2.774, total 33.996
LoadLibrary PatchAllModules 2.513, total 36.509
LoadLibrary PatchAllModules 2.611, total 39.119
LoadLibrary PatchAllModules 2.672, total 41.792
LoadLibrary PatchAllModules 2.774, total 44.566
LoadLibrary PatchAllModules 2.844, total 47.410
LoadLibrary PatchAllModules 2.775, total 50.185
LoadLibrary PatchAllModules 2.853, total 53.038
LoadLibrary PatchAllModules 2.959, total 55.997
LoadLibrary PatchAllModules 3.116, total 59.113
LoadLibrary PatchAllModules 3.067, total 62.180
LoadLibrary PatchAllModules 3.044, total 65.224
LoadLibrary PatchAllModules 3.204, total 68.428
LoadLibrary PatchAllModules 3.243, total 71.671
final total LoadLibrary 71.671 ms
final total FreeLibrary 1.732 ms
Original comment by da...@fries.net
on 18 Jan 2010 at 10:16
Now that I think about it, the one FreeLibrary call in your version of the
library
was only because I put the new allocation in the else part of the FreeLibrary
if (patching_map) {
}... else{
patching_map = new std::map<HMODULE, int>;
}
The problem I see is patching_map only has a list of libraries that required
patching
when Perftools_LoadLibraryExW was called after LoadLibrary was patched. If a
library
was loaded before tcmalloc that required patching PatchAllModules would patch
it, but
FreeLibrary wouldn't have it in the list of libraries that required unpatching
when
unloaded.
Original comment by da...@fries.net
on 18 Jan 2010 at 10:39
Issue 206 has been merged into this issue.
Original comment by csilv...@gmail.com
on 18 Jan 2010 at 11:03
} patching_map was never allocated, so the FreeLibrary always ran
PatchAllModules.
Hmm, this seems like a straightforward bug. How does it work, instead of
applying
your patch, to just take the version from comment 38, and then in
Perftools_FreeLibrary(), insert the following right after you acquire the lock:
if (!patching_map)
return rv;
} What's the reasoning for allocating them dynamically instead having a
filescope
} static object?
They were static originally, but this crashed tcmalloc_minimal_test-static. I
don't
know what the particular problem was with static linking, but I do know that
filescope static objects with constructors are tricky to get right, so I'm
happy to
just punt on that issue. I don't feel strongly though, as long as it passes
the
tests.
} The problem I see is patching_map only has a list of libraries that required
} patching when Perftools_LoadLibraryExW was called after LoadLibrary was
patched.
Right -- it looks like there are two issues still open for this bug report: one
is
the timing is too slow for you, and another is a correctness report. Let's
deal with
the timing first. I believe my patch should be fast, and if not I want to
figure out
why (I understand your patch fixes this problem, but I want to understand why
mine
doesn't, before deciding whether to apply yours.) This bug you pointed out in
FreeLibrary may be the cause.
Once we've got the timing licked, let's look into this correctness issue. We
can
then figure out the minimal change needed to get that fixed too.
Original comment by csilv...@gmail.com
on 18 Jan 2010 at 11:08
The second timing run was comment 38, only with collecting timing information
and
allocating patch_map. Replacing in FreeLibrary as you suggest
if (patching_map) {
with
if (!patching_map) return rv;
Is only going to save one PatchAllModules call. That's 1.732 ms, I can rerun
the
above timing, but it isn't going to hardly change it. The comment 38 version
along
with a patch_map fix is much faster than the original, mine's still about 15
times
faster in the PatchAllModules routine. Compare the PatchAllModules execution
times, yours starts at 1.882 and increases until it gets to 3.243, because
everytime there are more modules to look through to see if there are any
symbols as
it always looks through the existing unpatched ones and it doesn't matter if
modules are loading or unloading. Mine has a constant amount of time when
something is unloaded (except the last unload, who knows why) because it just
has
to remove entries from the data structures and look at no modules, when a
module is
loaded it only has to look for symbols in that one library.
Original comment by da...@fries.net
on 19 Jan 2010 at 1:40
We are talking past each other. It must be very frustrating to you. Let's
start
over; I'll try to do things very slowly but surely, bringing up only one point
at a
time in my comments.
} Compare the PatchAllModules execution times
But my claim is that PatchAllModules shouldn't be being called at all -- all
the
times you're calling LoadLibraryExW and FreeLibrary in your inner loop (per
frame),
it's basically a noop, isn't it? Because the relevant libraries were loaded at
program startup. So all the LoadLibrary and FreeLibrary are doing inside
windows is
incrementing and decrementing a refcount. Is that accurate?
If not, please explain what the *windows* code (not the perftools code) is
doing in
the LoadLibraryExW and FreeLibrary calls within your inner loop (the stuff that
happens per-frame). In particular: what modules is it loading into the
executable
that weren't loaded before, on each of the per-frame LoadLibrary calls? What
modules
is it "taking out" of the executable, on each of the per-frame FreeLibrary
calls?
Original comment by csilv...@gmail.com
on 19 Jan 2010 at 2:04
Sorry, my timing run wasn't clear. The timings I posted above was for an
entire
application run. As it stands neither version calls PatchAllModules in our
frame.
Original comment by da...@fries.net
on 19 Jan 2010 at 2:17
Ok, now I'm more confused than ever. If neither version is calling
PatchAllModules in
our frame, then why is your version so much faster than mine? You talk about
how much
you've sped up PatchAllModules, but that shouldn't matter if it's never being
called.
So what code is running in my version of the file, but not yours, that is
causing
slowness (cutting and pasting these lines of code into the bug report would be
helpful).
Original comment by csilv...@gmail.com
on 19 Jan 2010 at 2:38
The time it takes the application to draw a frame once it is up and running is
only
part of the original problem with the last released tcmalloc. Neither version
is
calling PatchAllModules in the application frame, hence the comment 55 timings
do
not address what overhead there is in the frame with either version. I could
make
some time measurements if that would be of interest. What I was timing was how
long PatchAllModules took to execute, which routine it was called from and a
running sum. PatchAllModules is being called at application startup and again
at
application termination. Are you seeing why my PatchAllModules was so much
faster
than the other version?
Original comment by da...@fries.net
on 19 Jan 2010 at 2:48
No, I still haven't tried to understand why your PatchAllModules is faster. It
looks
like I still don't even understand what problem you're trying to solve.
Originally,
I thought the report was that each frame was taking X time with tcmalloc, but
only
half X time with the windows allocator. Is that not true? Or perhaps it was
true,
but one of my patches fixed that particular problem?
What problem are you seeing now, that still needs to be solved (based on the
current
svn-trunk, which is with my patch committed but not yours on top of it)? Is it
only
with startup and tear-down time? And by "startup", does that refer to just
when the
program actually starts up, or are you also considering the time spent while
loading
whatever modules you load up after the program starts but before you start
doing
"real work"?
Original comment by csilv...@gmail.com
on 19 Jan 2010 at 2:55
The issue 206 did only start out as seeing a big increase in frame time, and as
we
dug into it, we realized in addition the increase in startup time. Yes the
scope
changed, but so did our observations of tcmalloc and the issues involved.
Original two problems, startup time doubled from 6 seconds Microsoft allocator
to
11 seconds tcmalloc. I measured startup time from executing the application to
when the last completed loading data message printed. That's pretty much when
the
application is available to the user to interact with. The other problem was
frame
time, one number can quote from my earlier comment 2 of issue 206 was GL
SwapBuffers taking 0.17ms for Microsoft allocator and 4.17ms for tcmalloc.
As far as what's in subversion, I don't think anyone will notice the startup
time
difference now. I don't think the additional tcmalloc overhead in the
frametime is
a problem either, though in this case I would need to benchmark the subversion
one
to my version and see which is faster.
As far as timing goes, I think it is down to what is tcmalloc's goals? Is it
to be
as fast as it can be? In which case having a PatchAllModules that is a
constant
time overhead over one that linearly increases the amount of time as modules
loads
should be pretty obvious.
The other question is either more correct and safe in the corner cases? It has
already been covered that the nopatching_set might not be safe in all cases,
but
both versions are using it. If the patching_map has been fixed in the
subversion
as discussed, PatchAllModules will not be called in FreeLibrary on modules
loaded
before LoadLibrary has been patched as they will not be added to patching_map.
My
version handles that case so far as I know.
My version is winning on faster and safer, but feel free to poke holes in it
and
make it better.
As far as our application Ryan has found that turning on the Microsoft
allocator
low fragmentation heap greatly decreases the nondeterminism that was really
hurting
it earlier. We have an all night run of that version running against tcmalloc
in
progress right now. We ran both over the weekend and the Microsoft allocator
version crashed at some point in time while tcmalloc was still running, though
slow
because of a problem in our code and the Microsoft allocator most likely ran
out of
memory for probably the same issue.
Original comment by da...@fries.net
on 19 Jan 2010 at 4:00
} As far as timing goes, I think it is down to what is tcmalloc's goals?
Stop. Please take a step back here. You are trying to push solutions, and I
am
still trying to understand what the problems are. I know you've identified, in
your
own head, what the problems are and what the solutions are, but that doesn't
help me
any. I need to understand what the problems are as well. I'm sure your patch
solves
some problems that are important to you, but I don't know what they are. I
can't
evaluate this patch until I understand what it's trying to do. "Speed things
up" is
not a specific enough answer. I need to understand what parts of your execution
environment are causing you trouble right now. Then I will have the necessary
context to understand why the current code is causing those problems, and what
the
solutions might be.
So please, tell me, without any other side-information to confuse the issue and
without referencing any other parts of the discussions we've had until now, how
the
perftools at svn-head compares to windows malloc (1) at program start up, and
(2) at
frame time. Let's go from there.
Original comment by csilv...@gmail.com
on 19 Jan 2010 at 4:20
OK, I had a chance to look at your patch, and if I understand it right, the
major
difference is using the refcount information to avoid calling PatchAllModules
when a
library is loaded multiple times. That makes sense to me. In your patch, you
divide
that refcount-management between LoadLibraryExW and PatchAllModules, but while
it
gives a bit extra accuracy, I don't think it will help runtime at all (because
the
extra accuracy will almost never trigger).
So I've tried implementing the same idea; the new file is attached here.
Instead of
reporing the comparison at program startup and at frametime against svn-head,
try
comparing it against this new file.
This also suggests a third thing you could measure (instrument): how many times
PatchAllModules is actually being called in your application. My goal is to
get it
to the point it's called less than 100 times. Then it doesn't matter how
efficient
it is, since it's called so rarely.
Original comment by csilv...@gmail.com
on 19 Jan 2010 at 6:01
Attachments:
Yes part of the patch is to avoid calling PatchAllModules, the other part is in
PatchAllModules to only check new modules for malloc symbols
libc_info.PopulateWindowsFn. That's the mostly constant time execution of
PatchAllModules I mentioned earlier.
As far as how many times PatchAllModules is listed, two separate complete
program
runs are listed in comment 55, that is when PatchAllModules is called from
LoadLibrary or FreeLibrary. I'll see about the latest version.
Original comment by da...@fries.net
on 19 Jan 2010 at 6:11
Comment 55 just gives times. I'm interested in counts.
Original comment by csilv...@gmail.com
on 19 Jan 2010 at 6:17
comment 55 gives one line per PatchAllModules call put each in a file,
cat mine.txt |grep LoadLibrary |wc -l
18
cat mine.txt |grep FreeLibrary |wc -l
13
wc -l mine.txt
31 mine.txt
Called 18 times from LoadLibrary, 13 from FreeLibrary, 31 times total
cat patching_map_fix.txt |grep LoadLibrary |wc -l
30
cat patching_map_fix.txt |grep FreeLibrary |wc -l
1
wc -l patching_map_fix.txt
31 patching_map_fix.txt
Called 30 times from LoadLibrary, 1 from FreeLibrary, 31 times total
Original comment by da...@fries.net
on 19 Jan 2010 at 6:23
Ok, so what's the status now with respect to the three questions I've asked:
(1) how the perftools with my latest patch compares to windows malloc at
program
start up
(2) how the perftools with my latest patch compares to windows malloc at frame
time
(3) how many times PatchAllFunctions() is called with my latest patch vs your
patch
If I understood your last comment correctly, for (3) it's the same number of
times
(though each call to PatchAllFunctions is faster in your case). What about (1)
and
(2)?
Original comment by csilv...@gmail.com
on 19 Jan 2010 at 6:37
1. The time difference doesn't show up on a hand held stopwatch.
2. Microsoft 4.088ms, comment 67 tcmalloc 4.042 ms, average of 30 frames for
each,
but both jitter around.
3. PatchAllFunctions calls, comment 67 tcmalloc 41 times, my patch 65 times
Original comment by da...@fries.net
on 19 Jan 2010 at 5:51
I believe the comment-67 patch had a bug: I had an == and != reversed in
FreeLibrary.
The line:
if (--it->second == 0) {
should actually be
if (--it->second > 0) {
This means it probably wasn't patching as correctly as it ought to. I don't
know how
much of a difference it will make in your timing runs, but it may. I'm sorry I
didn't catch that earlier. :-( It's my fault; I really need better testing for
this
part of the codebase.
But let's assume the numbers you report hold up:
} 1. The time difference doesn't show up on a hand held stopwatch.
} 2. Microsoft 4.088ms, comment 67 tcmalloc 4.042 ms
} 3. PatchAllFunctions calls, comment 67 tcmalloc 41 times, my patch 65 times
It looks like this addresses all the issues that you had originally, is that
right?
My understanding is you had two problems originally that you wanted solved:
1) perftools was much slower than without at startup time
2) perftools was slower than without when making frames
With the latest patch, both of those issues are addressed, no? Is there any
other
existing issue you have, that the patch doesn't address?
Original comment by csilv...@gmail.com
on 19 Jan 2010 at 7:20
if (--it->second > 0) {
Can't be right in FreeLibrary, it's now always calling PatchAllModules.
Original comment by da...@fries.net
on 19 Jan 2010 at 9:15
Ugh, I found another, really stupid bug. :-( In LoadLibraryExW, it says:
if (nopatching_set && nopatching_set->count(szFullPath) > 0)
nopatching_set2->insert(rv);
return rv; // case (1)
There need to be braces around this. Right now it's always returning rv, and
never
trying to patch! No wonder you don't see any timing differences.
Does this fix make things behave better? If not, I'll attach an instrumented
version
of the file, which does printfs so we can better understand what's going on.
Original comment by csilv...@gmail.com
on 19 Jan 2010 at 9:28
The comment 67 patch came out with PatchAllModules called 41 times for a total
time
spent in PatchAllModules of 144.694 milliseconds. Adding in the comment 73 and
75
changes it now calls PatchAllModules 47 times for a total time spent in
PatchAllModules of 164.971 milliseconds and an average of 3.51 milliseconds per
PatchAllModules call, so technically yes it made a negative time impact. I'll
assume
it is a correctness tradeoff, so no problem there.
To answer your question the startup time and frame time overhead are both
sufficient
for our application. You can consider these issues closed, though I will point
out
in the same conditions my version calls PatchAllModules 65 times for a total
time
spent in PatchAllModules of 15.612 milliseconds and an average of 0.24
milliseconds
per PatchAllModules call, or about ten times less time total then the above
version.
You have access to the tcmalloc repository, you can pick the faster or slower
algorithm, your choice.
Original comment by da...@fries.net
on 19 Jan 2010 at 10:37
Thank you for the data. It sounds like the complexity that your version adds
is
worthwhile, then. I can't take the patch as you have it, because it calls
windows
functions while holding the spinlock, and that can lead to deadlocks (though
obviously it doesn't for your application) -- I have comments about that where
the
lock is defined in the source file.
However, I've taken your ideas and attempted to come up with a patch that
incorporates them but is not subject to the lock inversion problem. I've
attached it
below. Try it out and see how it works for you. I can't guarantee it's
bug-free,
but maybe you can use it as a starting point to get to something that is as
good as
your patch.
I'm hoping to release a new perftools tomorrow (before I go on vacation), so
there's
a bit of a time crunch, unfortunately. But if we don't get this change in for
this
release, that's ok, we'll get it in the next one.
Let me know how this works, and/or if you have any changes to suggest to it.
Original comment by csilv...@gmail.com
on 20 Jan 2010 at 3:36
Attachments:
Looks like in PatchAllModules
patched_modules->insert(libc_info.hmodule());
needs to be moved outside of the if statement in the last for loop, otherwise
it's
only the ones that were patched instead of all currently loaded modules. I see
in
LoadLibrary it adds any new libraries that were in the nopatching_set. I see
FreeLibrary is erasing from patch_set when the reference goes to zero. What
I'm
seeing is all the modules that were loaded before LoadLibrary was patched, or
modules that a module was linked to (and don't get LoadLibrary called on)
aren't in
the list and will be searched everytime. Now just moving it outside the if
statement will not scrub the patched_modules against that are unloaded at the
same
time as another because they were brought in by that module.
The other issue is in our case OPENGL32 is already loaded. The first
LoadLibrary
will leave a reference count of 1 in load_map, fail the nopatching_set, invoke
PatchAllModules, add to the nopatching_set, and return. FreeLibrary will then
decrement the reference count to 0, erase from load_map and patch_set and call
PatchAllModules. After that, the reference count will be bouncing between 0
and 1,
and while it will be skipping the PatchAllModules because of the nopatching_set
and
noop_patch_set, it is going to be searching the nopatching_set and being
inserted
and removed from patch_set and noop_patch_set.
In my version the first FreeLibrary call would also see a 0 reference count and
call PatchAllModules, but that would see the module still in memory and set the
reference count to 1, and from there on it would be going between 1 and 2. At
least that's how it's supposed to work.
By the way, I liked your swap erasing of hModules to get the current set of
added
modules, that qualifies as low cost in my book.
I'll plan on getting some numbers first thing in the morning. As I expect you
are
needing to do other things to get ready for vacation (I always do), I would
suggest
shipping with this or the previous version, whichever is faster, and leave
looking
at the patch until when you get back.
Original comment by da...@fries.net
on 20 Jan 2010 at 4:45
} The first LoadLibrary will leave a reference count of 1 in load_map, [...]
Thanks, that's a very nice explanation of what you're seeing. That helps me
understand the control flow now.
} call PatchAllModules, but that would see the module still in memory and set
the
} reference count to 1, and from there on it would be going between 1 and 2. At
} least that's how it's supposed to work.
I saw that, but I'm not sure it's safe. The problem is that once you start
incrementing refcounts in PatchAllModules for the dependent libraries (that is,
the
libraries brought in by whatever you passed to LoadLibrary), you have to run
PatchAllModules again to decrement the refcounts. Otherwise, your refcount
gets too
high. I worked through a complicated case where we think someone has a
refcount of
1, when it's actually been unloaded, unless we call PatchAllModules more often
in
FreeLibrary. I think the set lookups will end up being cheaper overall.
I hope so anyway! See what your timings show. And thanks for finding the bug
with
the if.
Original comment by csilv...@gmail.com
on 20 Jan 2010 at 6:54
nopatching_set is never allocated, for OPENGL32 this is fatal. In
PatchAllModules,
2nd for loop, every time a g_module_libcs[i] is not found in the
currently_loaded_modules set_is_valid is set to false and made_changes is set to
true, which causes PatchAllModules to return true and so forth.
g_module_libc[i]
will never be found if it's already invalid.
I don't see any tests that are loading dynamic libraries. Time for a windows
specific test that starts up, dynamically loads a few common windows dll's that
would
not otherwise be loaded into the program, loop for specific number of times on
LoadLibrary, FreeLibrary for a library that would have been loaded with the
program
such as USER32 or GDI32, and then unload the previously dynamically loaded
libraries.
It also needs to time each LoadLibrary and FreeLibrary. Yes it specifically mirrors
our use case, but it will let you test your own changes.
Original comment by da...@fries.net
on 20 Jan 2010 at 3:29
Ok, here's a new version of the file, which fixes the two bugs you've pointed
out.
} I don't see any tests that are loading dynamic libraries.
That's right. It would be great if you'd like to write one to contribute in!
I
don't know windows programming at all really, and don't feel qualified to write
anything. In particular, I'd love a test where we load dll which brings in
other
ones as dependencies, and then we load some of the dependencies explicitly too,
and
have different load/free ordering. I feel that will test corner cases we don't
really see right now.
Original comment by csilv...@gmail.com
on 20 Jan 2010 at 3:59
For my next trick, I will actually attach the new file...
Original comment by csilv...@gmail.com
on 20 Jan 2010 at 4:05
Attachments:
01-20-2010 comment 81 update
PatchAllModules called 73, sum 154.851 ms
I reworked my patch to avoid calling GetModuleInfo with the spinlock held.
PatchAllModules called 65, sum 14.219 ms
Original comment by da...@fries.net
on 20 Jan 2010 at 6:29
Attachments:
I'm still a bit nervous about the complexity of this patch. In particular, I'm
worried that it's dangerous to be incrementing the refcount in PatchAllModules,
when
there's no visibility into when it should be decremented. But I can't think of
any
situations where it will cause trouble, so I'm willing to go with it.
There are a few minor-ish reasons I'd prefer to use my patch as the basis
rather than
yours: as part of the patch, I was able to clean up of ModuleEntryCopy and
LibcModules and I'd like to keep that, and also I have the noop_patch_set
optimization that I think will help in some cases (though maybe not yours). So
I've
updated my patch to have all the functionality that's in your patch (as far as
i can
tell). I've attached it here. Do you mind giving it a try and seeing if it
performs
the same as your patch?
If we do end up going with your patch, we'll need to make a few small changes.
For
instance, I notice that you set loaded_modules early in PatchAllModules, but
then
actually do the patching later. I'm concerned that's not safe: if two threads
try to
run this at the same time, it's possible one will think the module has already
been
handled when it hasn't been yet. It's easy enough to move the increment and
the
actual patching to the be under the same lock critical-section.
While you're running the numbers, I'll try to think a bit more through the
locking
issues, and try to convince myself this change is safe. If I can't think of
any
problems, I'll try to get one of the two versions of this patch committed today.
Original comment by csilv...@gmail.com
on 20 Jan 2010 at 8:07
Attachments:
I chickened out -- these patches just aren't well enough tested yet. I
released
without any of the speedup patches. We'll keep working on it so we can get it
right
for the next release.
Original comment by csilv...@gmail.com
on 20 Jan 2010 at 11:07
comment 84 version timing
PatchAllModules called 74, sum 96.115 ms
I came up with the version I was trying to write earlier, but I didn't see any
way to pull it off, that is making use of the Windows reference count in
helping to decide if PatchAllModules is required. GetModuleHandleEx can find
out
if a module is loaded and get the handle as well as increase the reference
count, if
it isn't call LoadLibrary and PatchAllModules, but if it is already
loaded no need to call PatchAllModules. The problem is in FreeLibrary, I
tried getting the path before calling FreeLibrary, then check after to see if
it
was still loaded, but that's expensive. I just realized tonight that the
GetModuleHandleEx feature of passing an address and finding out what module
owns it can be used in the FreeLibrary routine. Just give the module base
address, and if it is still loaded it will return the same module handle, if
not it will be something else.
Calling GetModuleHandleEx with an address is averaging around .003
milliseconds. This version even beat my old version's time spent in
PatchAllModules.
PatchAllModules called 65, sum 11.662 ms
Note need
#define _WIN32_WINNT 0x0501
in config.h for WindowsXP or greater to use GetModuleHandleEx.
Original comment by da...@fries.net
on 21 Jan 2010 at 12:37
Attachments:
> I released without any of the speedup patches.
Just to let you know that the front page says it was released in 2009.
Original comment by andrey.s...@gmail.com
on 21 Jan 2010 at 6:56
Thanks! That, at least, was an easy bug to fix. :-)
Original comment by csilv...@gmail.com
on 21 Jan 2010 at 9:54
btw, thank you for the latest patch. I'll take a look in a few weeks, when I'm
back
from vacation.
Original comment by csilv...@gmail.com
on 22 Jan 2010 at 11:16
I looked over your patch and it looks very nice and clean. I had to modify it a
little bit (the double-checked locking you were doing wasn't safe, and isn't
really
necessary anyway).
I had to make one big change: I took out the nopatch_set. I just don't see how
to
make it safe without being even more expensive. The problem is the following
scenario: let's suppose the we load module A, which depends on B and C. B was
loaded
before A, and contains a libc routine that we patched. When we load A, we'll
add it
to nopatch_set, because we didn't do any work (we didn't patch B because it was
already patched). Now let's suppose we unload A and unload B -- so now
everything is
unloaded -- and then reload A. In this case, we'll need to patch B, but we
won't,
because A is in nopatch_set.
The only way around this that I see is to store refcounts in last_loaded
(obtained
via GetModuleHandleExW, I guess), not just a set of modules. Then we can
return true
from PatchAllModules if we don't do any patching/unpatching, *and* no refcount
of a
patched module changes. That seems like a lot of overhead.
So for now, I'm just taking out the nopatching_set variable. If it turns out
to be a
big perf improvement to add it back in, for certain workloads, we can try to
figure
out a way to do it safely.
Then I got to thinking: why are we keeping our own refcount when windows is
doing it
for us? How much are we really gaining? I don't know. So I've put all the
load_map
stuff in an #ifdef, so you can do timing tests both with and without our own
map. To
use, find the line '#define USE_LOAD_MAP' and comment or uncomment it.
Let me know if this version of the patch still gives the good numbers you saw
with
your own patch (and what difference it makes for USE_LOAD_MAP or without).
Then we
can figure out where to go from there.
(Note, my windows dev machine died, so I haven't been able to compile or test
this
patch, so it may have some minor syntax problems. It's very similar to your
patch,
though, just with stuff taken out, so I don't expect too many difficulties.)
Original comment by csilv...@gmail.com
on 3 Feb 2010 at 10:31
Attachments:
Why not simply compare snapshots provided by PSAPI and only process the
difference
between ones before and after LoadLibrary/FreeLibrary? This would allow to keep
the
full picture of which module does need patching and which does not, without
dependencies headache.
Original comment by andrey.s...@gmail.com
on 4 Feb 2010 at 3:11
andrey.semashev, EnumProcessModules is taking .038 ms to .051 ms, which would
need to
be run twice, before and after, then the list of modules must be compared to
see what
changed. My last version (which comment 90 builds on), uses
GetModuleInformation to
determine if the module was actually loaded or unloaded (and which can be free
if the
module was loaded on LoadLibrary) and if it was call PatchAllModules. The
PatchAllModules uses the last PatchAllModules result to get the list of modules
to
check. I think what's in is the safest method.
There's just the question of how much does keeping a reference count help?
Unfortunately the program I'm using has OPENGL32 loaded already, and from then
on
it's LoadLibrary, FreeLibrary, so this program will not show any benefit,
because it
doesn't use it. I'm okay with using it or not.
csilvers,
// TODO(csilvers): for this to be reliable, we need to also take
// into account if we *would* have patched any modules had they not
// already been loaded. (That is, made_changes should ignore
// g_last_loaded.)
The return value is ignored in this patch anyway.
> the double-checked locking you were doing wasn't safe, and isn't
> really necessary anyway
>
> if (!load_map) {
> // Allocate with the lock held, but we didn't have the lock when we
> // checked (to avoid the cost of getting it all the time), so check again.
> SpinLockHolder h(&patch_all_modules_lock);
> if(!load_map) {
> nopatching_set = new std::set<std::string>;
> load_map = new std::map<HMODULE, int>;
> }
> }
How's that not safe? It puts the allocations all in one place for
that function avoiding other bugs, like forgetting to duplicate the
allocation in every place required in the function.
Now this is not safe. Note the NULL derefernce in the else case if it
hasn't already been allocated.
if (::GetModuleHandleExW(0, lpFileName, &rv)) {
...
if (!g_load_map) g_load_map = new std::map<HMODULE, int>;
++(*g_load_map)[rv];
...
} else {
...
(*g_load_map)[rv] = 1;
...
}
This patch,
PatchAllModules called 82, 0.050 ms, sum 13.976 ms
Original comment by da...@fries.net
on 4 Feb 2010 at 5:11
> andrey.semashev, EnumProcessModules is taking .038 ms to .051 ms, which would
need
> to be run twice, before and after, then the list of modules must be compared
to see
> what changed.
It's once. One may cache the snapshot from the last call. And the comparison
may be
optimized, in case of LoadLibrary for sure.
However, I don't have numbers to argue. I just think that this solution might
be
simpler and more reliable, whereas not very different in terms of performance.
Original comment by andrey.s...@gmail.com
on 4 Feb 2010 at 6:02
} I just think that this solution might be simpler and more reliable
Sorry, what is "this solution"?
} There's just the question of how much does keeping a reference count help?
OK, let's keep it out for simplicity.
I think the patch in comment 90 is pretty simple, all told -- and even simpler
once
we get rid of refcounting. I am able to look at it and say it's "obviously"
correct
(which means it has lots of bugs probably, only subtle ones :-) ), so I'm happy
to go
with it. The only question is performance.
Can both of you say the performance is good on this patch? Can you compare
timing
runs for this patch with your own favorite patch (if different)? I've attached
a new
version of patch_functions.cc below, with refcounting removed, just to make
life easier.
} > the double-checked locking you were doing wasn't safe
} How's that not safe?
It's a moot point now, since we got rid of this data structure (for now), but if
you're curious, here's a brief explanation:
http://www.drdobbs.com/184405726
Original comment by csilv...@gmail.com
on 4 Feb 2010 at 7:47
Attachments:
I tested it, looks good.
Original comment by da...@fries.net
on 4 Feb 2010 at 10:58
Great! I'll check it into svn so it can get wider exposure.
Original comment by csilv...@gmail.com
on 11 Feb 2010 at 1:26
This should be fixed in perftools 1.6, just released.
Original comment by csilv...@gmail.com
on 5 Aug 2010 at 8:46
Original issue reported on code.google.com by
andrey.s...@gmail.com
on 24 Dec 2009 at 7:16