Closed GoogleCodeExporter closed 9 years ago
I managed to work around the problem with CreateToolhelp32Snapshot by replacing
its
usage with PSAPI. I did not eliminate Toolhelp32 completely from sources, but
it
should be doable, since the only thing left of it is a dummy usage of
MODULEENTRY32
and MAX_MODULE_NAME32.
However, the race condition in PatchAllModules I described is still present. It
shows
with the new PSAPI-related code as occasional failures of the
GetModuleInformation
function when multiple threads call LoadLibrary/FreeLibrary. It appears that
FreeLibrary may invalidate module handle that is about to be passed to
GetModuleInformation in another thread.
I could not work around this race condition, because when I tried to move the
spin
lock acquisition above the EnumProcessModules call I began to observe
deadlocks. I
don't know what causes it, but it looks like it blocks on acquisition of some
OS
resource during various API calls, such as GetModuleInformation and
GetProcAddress.
This resource somehow comes already acquired by another thread that is running
LoadLibrary/FreeLibrary and gets blocked in the patch_all_modules_lock. It
looks like
some lock-free approach is needed to solve this problem.
Along with my changes in PatchAllModules with regard to PSAPI I also replaced
std::vector with a stack-based solution in order to remove the dependency on
the
dynamic memory allocation that is actually being patched. I'm not sure how
correct it
is to allocate memory before patching and free it after patching. Besides, it's
a bit
of a speedup. However, I didn't check all other functions that are called
during the
patching process in order to fix it there, so there may be more work to do in
this
direction.
One thing I'm not sure about is the reason of the PatchMainExecutableLocked
function
and if it is used correctly now, in context of PSAPI. AFAIU, all modules,
including
the executable, are patched by PatchOneModuleLocked, since EnumProcessModules
will
return it as the first element of the hModules array. Is the call to
PatchMainExecutableLocked needed then?
Anyway, please find my patch against 1.4 attached. Besides the described
changes, the
patch also includes two significant changes:
1. In atomicops-internals-x86-msvc.h I enabled MSVC atomic intrinsics to
squeeze as
much performance as possible.
2. In windows/port.h I added #include <cstdio>, which is required to compile
with
STLPort. The reason for it is that perftools defines nasty macros, such as
vsnprintf,
which interfere with STLPort. The include allows STLPort to define vsnprintf as
intended before redefining it with the macro.
Original comment by andrey.s...@gmail.com
on 24 Dec 2009 at 2:29
Attachments:
Oh, I forgot to mention that in order to compile perftools with the patch,
psapi.lib
should be added to MSVC project.
Original comment by andrey.s...@gmail.com
on 24 Dec 2009 at 2:32
Thank you for the detailed bug report!
} It looks like some lock-free approach is needed to solve this problem.
The easiest solution, I think, is to just to copy all the me32 data structures
inside
the CreateToolhelp32Snapshot -- I should have done that in the first place (the
current collect-then-patch structure was an attempt to get rid of the
winapi-internal
deadlocks you observed). It should be enough to just copy the information
needed for
SameAsME32. Then we can iterate over the copies, outside the CreateToolhelp
loop,
with the patch_all_modules spinlock.
Do you think this is maybe the same issue as in issue 135?
I'm not sure I understand the pros and cons of moving away from toolhelp API to
the
psapi. Do we know that's more reliable? I wonder if it would just be better to
retry CreateToolhelp32Snapshot a few times if it fails.
Original comment by csilv...@gmail.com
on 24 Dec 2009 at 10:37
Also, some of the authors of the windows-patching code think it will never be
that
reliable, especially in multi-threaded contexts. You may want to consider
following
the approach chromium uses, of patching mscrt and using that; it's a bit more
work on
your part, but a lot more dependable. There are some details here:
http://groups.google.com/group/google-perftools/browse_thread/thread/41cd3710af8
5e57b/71113ef1e9d41ba2?lnk=gst&q=mbelshe#71113ef1e9d41ba2
If you go this route, and it works well for you, let me know; once this
technique has
been used and understood enough, I plan to add some instructions to the
README.windows file.
Original comment by csilv...@gmail.com
on 24 Dec 2009 at 10:45
> Do you think this is maybe the same issue as in issue 135?
I think, it is related but hardly the same. My original problem was with
CreateToolhelp32Snapshot, which is not covered there. Besides, that issue dates
back
to 1.2, and was fixed in 1.3, and I'm using 1.4.
> I'm not sure I understand the pros and cons of moving away from toolhelp API
to the
psapi. Do we know that's more reliable? I wonder if it would just be better
to
retry CreateToolhelp32Snapshot a few times if it fails.
Well, PSAPI never failed during my tests. Soon I'll have more robust testing
done on
my modified version, which will show whether my solution was final. But
honestly, I
think it showed better already.
As for retrying CreateToolhelp32Snapshot, I had that thought at first but
rejected it
for a few reasons. First, on what error codes would you retry? MSDN mentions
ERROR_BAD_LENGTH, but on my setup I get ERROR_NOT_ENOUGH_MEMORY. Who knows what
other
errors it may return under these unstable conditions? Second, for how long to
spin?
There is no ultimate answer for this because of different thread numbers in
applications and different hardware matters. And finally, what to do if after
spinning CreateToolhelp32Snapshot it doesn't succeed? Bailing out the
application or
failing LoadLibrary is not acceptable for me, nor I think will be acceptable
for
other users. Therefore I decided to get rid of CreateToolhelp32Snapshot
entirely.
The PSAPI has no drawbacks, AFAICT. The only downside I see is that it's not
available on Windows 9x, but who cares?
> The easiest solution, I think, is to just to copy all the me32 data
structures
inside the CreateToolhelp32Snapshot -- I should have done that in the first
place
(the current collect-then-patch structure was an attempt to get rid of the
winapi-
internal deadlocks you observed). It should be enough to just copy the
information
needed for SameAsME32. Then we can iterate over the copies, outside the
CreateToolhelp loop, with the patch_all_modules spinlock.
I'm not sure I understand. Do you mean that you want to patch
CreateToolhelp32Snapshot too and duplicate module_libcs? You would have to
acquire
the spinlock in order to do that, and in order to avoid deadlocks, you would
have to
release it before passing on to the real CreateToolhelp32Snapshot. This is
where race
condition happens - another thread may get through to the part of
PatchAllModules
where module_libcs is modified, and your duplicates become inactual.
What I meant by "some lock-free approach" is removing patch_all_modules_lock
completely. One can keep a pointer to the actual module_libcs array, read it
with
acquire semantics before the "collect" part of PatchAllModules, then create a
duplicate of it, then collect and modify it, and then store the new pointer
with CAS.
If it fails, then retry the whole loop. Something like that.
> You may want to consider following the approach chromium uses, of patching
mscrt
and using that
It describes patching the static CRT, and I'm using dynamic. I have a large
number of
projects here, many of them exchange memory between each other, so patching the
static CRT is not suitable. Also I feel very uneasy about modifying dynamic
CRT, as
it is also used in other projects, many of them are out of my control. Mixing
around
with modified and original CRT is a dangerous thing I would not want to try.
Original comment by andrey.s...@gmail.com
on 25 Dec 2009 at 6:49
} The PSAPI has no drawbacks, AFAICT. The only downside I see is that it's not
} available on Windows 9x, but who cares?
The other downside of using psapi is having to add a dependency on a new
library. I
don't know very much about windows development, so I don't know how much
complexity
that adds, if any, to using perftools. My goal is to make it as easy as
possible.
} Do you mean that you want to patch
} CreateToolhelp32Snapshot too and duplicate module_libcs?
No, that wasn't it. I was just suggesting storing the output of the toolhelp
iteration (the me32 structures) in our own data structure. We can do that
without
the lock. Then we can examine that output within the lock. This separates our
use
of the lock from the windows system calls.
It's pretty straightforward, so I can just do it when I get back from vacation.
} What I meant by "some lock-free approach" is removing patch_all_modules_lock
} completely
Right, I understand the proposal. But I think it's more complicated than
needed to
address the problems here.
Original comment by csilv...@gmail.com
on 25 Dec 2009 at 4:26
> The other downside of using psapi is having to add a dependency on a new
library. I
> don't know very much about windows development, so I don't know how much
complexity
> that adds, if any, to using perftools. My goal is to make it as easy as
possible.
I don't think it poses any problem since psapi.dll is shipped with Windows. On
NT6
(Vista and later) PSAPI v2 is even implemented by kernel32.dll, so I don't
expect
this API to go away any time soon.
Original comment by andrey.s...@gmail.com
on 25 Dec 2009 at 8:46
> No, that wasn't it. I was just suggesting storing the output of the toolhelp
> iteration (the me32 structures) in our own data structure. We can do that
without
> the lock. Then we can examine that output within the lock. This separates
our use
> of the lock from the windows system calls.
Ah, I see. You're right, this is a simpler solution.
> It's pretty straightforward, so I can just do it when I get back from
vacation.
Ok, have a good holiday!
Original comment by andrey.s...@gmail.com
on 25 Dec 2009 at 8:52
I think it's safest to try this patch in two stages: one to fix the race
condition,
and one to move to psapi. I've done the first one. I don't have great
multi-threaded tests, but it passes all the unittests, so I think it's not any
worse
than what came before. Would you like to try it and see how it works for you?
I've
attached the patch.
Original comment by csilv...@google.com
on 31 Dec 2009 at 3:32
Oops, really attached this time.
Original comment by csilv...@google.com
on 31 Dec 2009 at 3:34
Attachments:
Thanks for the patch but I won't be able to experiment with it until after the
New
Year holidays. I took a look at it though, and it seems it does resolve race
condition over module_libcs.
Still, I'd like to repeat my concern with regard to using std::vector in the
patching
code. Is it correct to use it? IIUC, it allocates memory before patching and
releases
after. Does it not pose a problem for tcmalloc?
Meanwhile, the extensive tests on my patched version (with PSAPI) passed, there
were
no errors similar to CreateToolhelp32Snapshot failures that I whitnessed
earlier. I
think, after the holidays I will try to integrate your copying approach into my
version with PSAPI.
Original comment by andrey.s...@gmail.com
on 31 Dec 2009 at 1:05
Ah, I must have missed your concerns about vector the first time around.
Interesting; I hadn't thought of that. I'm not sure it can be a problem,
though. We
already have code in perftools that supports tcmalloc freeing memory allocated
by
windows; we need it anyway because windows is not totally consistent in the
order it
allocates and frees things itself. I think the vector code would be handled in
the
same way. It should be fine. Have you seen any actual problems there?
} Meanwhile, the extensive tests on my patched version (with PSAPI) passed,
there were
} no errors similar to CreateToolhelp32Snapshot failures that I whitnessed
earlier.
Great. Once I've verified this part of the patch isn't broken, I'll work on
applying
the psapi part. Another patch would be great. But I wouldn't worry about the
vector
stuff for this patch, yet. If it turns out to be a problem, we can handle it
then.
Original comment by csilv...@gmail.com
on 31 Dec 2009 at 5:49
OK, I got the race-condition patch reviewed and checked in, and managed to get
a
change to use psapi reviewed and checked in as well. Both are in the svn
repository
now. Would you like to try the perftools version in svn-trunk now and see if
it fixes
the problems you have? If not, let me know (perhaps in the form of a patch)
and I'll
try to fix it up. All perftools unittests pass, but they don't test module
loading/unloading very well.
Original comment by csilv...@gmail.com
on 6 Jan 2010 at 12:35
Here's another version of patch-functions.cc -- the whole file, not just a
patch. Can
you try dropping this in and seeing how it works? It should speed up
performance a
bit on module load/unload, but I don't have very good tests of those functions
(since
I don't really program in windows), so I'm not sure how well they work. The
unittests
all pass, at any rate. Let me know how this works for you.
Original comment by csilv...@gmail.com
on 9 Jan 2010 at 3:49
Attachments:
I'm back from holidays.
Thanks for all your work. I looked at patch-functions.cc and will give it a
try.
However, I have a few notes about it:
1. In PatchAllModules, you seem to have missed the check if cbNeeded is within
hModules boundaries. Theoretically, EnumProcessModules may return more than
hModules
size.
2. I'm not sure why dwFlags and lpFileName should be taken into account in the
MapKey
ordering. The nolibc_set container contains information about modules loaded
into the
application, and modules are really identified by handles (or by complete file
names,
but since LoadLibrary may accept relative and absolute paths, this is much less
reliable), but not by flags.
3. Does this code really patch all modules loaded by LoadLibrary? If I
LoadLibrary("A.dll"), and A.dll is linked with B.dll, is B.dll patched by
PatchOneModule? AFAIK, linked modules are not loaded via subsequent calls to
LoadLibrary.
Original comment by andrey.s...@gmail.com
on 11 Jan 2010 at 6:57
BTW, I noticed you have added a note to README.windows, that when linking with
static
builds of tcmalloc users should add a dependency on psapi.lib. You can remove
this
requirement if you add #pragma comment(lib, "psapi.lib") to the
Windows-specific source
code. Of course, this will work with MSVC-compatible compilers, but I think it
would be
nice anyway.
Original comment by andrey.s...@gmail.com
on 11 Jan 2010 at 7:13
I verified my note #3 from the previous post and it seems I was right.
LibcInfo::PopulateWindowsFn don't get called for the dependent dlls that are
loaded
implicitly.
Original comment by andrey.s...@gmail.com
on 11 Jan 2010 at 8:31
Thank you for your comments. I know very little about how windows DLL works,
so I
appreciate your careful thinking about all the issues involved.
The dependent DLLs is a good point, and must be why we wrote LoadLibraryEx to
call
PatchAllFunctions originally. I'll have to move back to that.
If A.dll depends on B.dll, and one calls LoadLibraryEx("A.dll"), and then calls
FreeLibrary(A_dll_handle), does that free B.dll as well? If so, it looks like
I'll
have to call PatchAllFunctions for FreeLibrary as well; that is, the code we
had
before was about as efficient as it gets.
I'll have to figure out another way of making the patching/unpatching more
efficient,
then.
In the meantime, can you try out the version in svn? I want to make sure that
version, at least, is working as intended.
Original comment by csilv...@gmail.com
on 11 Jan 2010 at 3:04
} 2. I'm not sure why dwFlags and lpFileName should be taken into account in
the
MapKey
} ordering. The nolibc_set container contains information about modules loaded
into
the
} application, and modules are really identified by handles (or by complete
file
names,
} but since LoadLibrary may accept relative and absolute paths, this is much
less
} reliable), but not by flags.
I was storing the dwFlags because I thought that a module-name + flags would
unique
identify a dll, even if the dll search path changed. But reading the docs more
carefully I see that's not true, and I'll just have to get the absolute path
(via a
windows call) before caching.
Original comment by csilv...@gmail.com
on 11 Jan 2010 at 3:36
> If A.dll depends on B.dll, and one calls LoadLibraryEx("A.dll"), and then
calls
> FreeLibrary(A_dll_handle), does that free B.dll as well?
If B.dll is no longer used (that is, its reference count drops to zero) then
it's
unloaded. Similarly, there will be no subsequent call to FreeLibrary("B.dll")
in that
case.
>I'll have to figure out another way of making the patching/unpatching more
> efficient, then.
You can try enumerating modules before and after the genuine
LoadLibrary/FreeLibrary
and only process the difference. As for the LoadLibrary case, I suspect that
the
difference between the snapshots will be in the end of the list, but that's not
documented anywhere.
> In the meantime, can you try out the version in svn?
I'll try it out tomorrow.
> I was storing the dwFlags because I thought that a module-name + flags would
unique
> identify a dll, even if the dll search path changed. But reading the docs
more
> carefully I see that's not true, and I'll just have to get the absolute path
(via a
> windows call) before caching.
I don't think that messing with file names and paths is a good idea in the
first
place. Think about case differences, forward/backward slashes, subst drives and
junctions. This will be unreliable. Module handles are much more reliable and
simpler
to work with, IMHO.
Original comment by andrey.s...@gmail.com
on 11 Jan 2010 at 6:34
} Module handles are much more reliable and simpler to work with, IMHO.
I don't think that will work for this use-case. If I load a module, unload it,
and
then load it again, am I guaranteed to get the same module handle both times?
I
assume not.
In any case, while all the problems you worry about are real, they're not an
impediment in this case. I don't care if I say two modules are different when
they're actually the same: all I lose is a little efficiency in that I can't
short-
circuit the patching process quite as often. What's important to me is that if
I say
two modules are the same, they really are. It looks like an absolute path name
should be enough to ensure that.
Original comment by csilv...@gmail.com
on 11 Jan 2010 at 6:38
> I don't think that will work for this use-case. If I load a module, unload
it, and
> then load it again, am I guaranteed to get the same module handle both times?
I
> assume not.
But in that case the module will be unpatched in the FreeLibrary, won't it? So
it has
to be patched again.
> In any case, while all the problems you worry about are real, they're not an
> impediment in this case.
Fair enough.
Original comment by andrey.s...@gmail.com
on 11 Jan 2010 at 6:52
} But in that case the module will be unpatched in the FreeLibrary, won't it?
So it
} has to be patched again.
The module may not have required patching at all. The bad case to worry about
it
while (user_types_a_key()) {
h = LoadLibrary("mydll");
DoCoolThingInMyDll();
FreeLibrary(h);
}
In the common case, mydll has no libc functions and doesn't require patching.
But
storing h in a data structure doesn't help us make shortcut in LoadLibraryExW.
We
need the filename.
I've written pretty verbose comments in this version of the file (attached), so
hopefully it will make things a bit clearer. I think it addresses all the
problems
you've pointed out so far. Let me know how it works for you.
Original comment by csilv...@gmail.com
on 11 Jan 2010 at 10:48
Looks like the attach didn't go through.
Original comment by andrey.s...@gmail.com
on 12 Jan 2010 at 2:54
There's another copy in comment 9 of issue 206. Does it work to get it from
there?
Original comment by csilv...@gmail.com
on 12 Jan 2010 at 3:22
I tried patch-functions.cc from that issue. It works, according to a few of my
tests.
However the optimization doesn't give much in my case, although I didn't do
time
measurements. For example, in one of my tests LibcInfo::PopulateWindowsFn is
called
1941 times before the optimization and 1776 after. I think the main reason for
this
is that many of our dlls depend on other dlls implicitly, which are not stored
in the
patching_set/nopatching_set containers. One of the dependent dlls is
msvcrt.dll,
which actually defines the functions to patch. This makes every dll loaded with
LoadLibrary that depends on msvcrt.dll (which basically means _every_ dll in my
case)
to be considered as patchable. In other words, patching_set contains A.dll,
B.dll,
C.dll, msvcrt.dll, where A.dll, B.dll and C.dll depend on msvcrt.dll. And
nopatching_set is nearly empty.
I think, these sets could do a better job if they were maintained by
PatchAllModules,
and if patching_set only contained modules that have to be patched _themselves_
(not
these depend on the modules to be patched). In the described example,
patching_set
should only contain msvcrt.dll, and the rest of modules should be in the
nopatching_set.
Again, I didn't profile the test, so I don't know how faster it have become
with the
optimization, and would my proposal make it any faster.
Original comment by andrey.s...@gmail.com
on 12 Jan 2010 at 1:43
Also, a few minor comments, if I may:
* In PatchAllModules, the check "if (i >= kMaxModules)" could be made before
the
loop.
* In Perftools_LoadLibraryExW, the szFullPath can be sized with the PATH_MAX
constant.
* In Perftools_FreeLibrary, you can erase the element from patching_set within
the
first lock scope of patch_all_modules_lock in order to avoid the second lock.
Also,
count+erase can be replaced by a more optimal find+erase(by iterator).
Don't take that as a nitpicking, though. :)
Original comment by andrey.s...@gmail.com
on 12 Jan 2010 at 1:54
> I think, these sets could do a better job if they were maintained by
> PatchAllModules, and if patching_set only contained modules that have to be
patched
> _themselves_ (not these depend on the modules to be patched). In the described
> example, patching_set should only contain msvcrt.dll, and the rest of modules
should
> be in the nopatching_set.
On the second thought, no, that would be incorrect as it would lead to missing
to
unpatch modules on FreeLibrary.
Original comment by andrey.s...@gmail.com
on 12 Jan 2010 at 2:03
} Also, a few minor comments, if I may:
Thanks for the review! I've made the relevant changes.
} In other words, patching_set contains A.dll, B.dll,
} C.dll, msvcrt.dll, where A.dll, B.dll and C.dll depend on msvcrt.dll. And
} nopatching_set is nearly empty.
When you call FreeLibrary(A), does this cause msvcrt.dll to unload? I would
have
thought the main executable would depend on msvcrt directly, so its refcount
would be
high enough it stays loaded even after you unload A.dll (along with B.dll and
C.dll).
If that's the case, then nopatching_set should still include A, since A didn't
cause
any patching to happen: msvcrt was already patched when the it was loaded by
the main
executable. I guess that's not happening in your case? Do you know why not?
In any case, as long as this patch doesn't degrade performance for you, I'm
happy.
It's intended to help for certain cases where the current behavior is too slow
(in
issue 206), and if you don't fall into that category, then I'm not totally
surprised
if the patch doesn't help you a lot.
Original comment by csilv...@gmail.com
on 12 Jan 2010 at 7:13
> When you call FreeLibrary(A), does this cause msvcrt.dll to unload? I would
have
> thought the main executable would depend on msvcrt directly, so its refcount
would
> be high enough it stays loaded even after you unload A.dll (along with B.dll
and
> C.dll).
That's right.
> If that's the case, then nopatching_set should still include A, since A didn't
> cause any patching to happen: msvcrt was already patched when the it was
loaded
> by the main executable. I guess that's not happening in your case? Do you
know
> why not?
You are right. I misinterpreted my test results when I wrote that. Actually,
it's
patching_set that is empty, and nopatching_set contains all the dlls.
Original comment by andrey.s...@gmail.com
on 13 Jan 2010 at 7:04
Today I tried some other tests with the recent patch_functions.cc and observed
a
deadlock when FreeLibrary is called concurrently. One thread is in the
PopulateWindowsFn attempting to call GetProcAddress, and the other one is in
the
PatchAllModules trying to acquire patch_all_modules_lock scoped lock. I think
that a
similar deadlock can occur when calling LoadLibrary concurrently as well.
Original comment by andrey.s...@gmail.com
on 13 Jan 2010 at 8:06
Do you think it would be possible to collect ProcAddresses before the
patch_all_modules_lock in PatchAllModules is acquired? Perhaps, the collected
addresses
could be stored in the modules vector elements and later used to apply patching?
Original comment by andrey.s...@gmail.com
on 13 Jan 2010 at 8:24
BTW, SVN trunk is also affected with the deadlock.
Original comment by andrey.s...@gmail.com
on 13 Jan 2010 at 8:38
I'm confused why there would be deadlock for calling ProcAddresses and
acquiring the
patch_all_modules lock. Is the theory that they're both trying to acquire the
same
windows lock internally? But I think we do our own spinlock implementation on
windows, which doesn't make any windows syscalls. Can you give a bit more
detail of
the stacktraces when these routines are deadlocking?
Oh, maybe you're running with your own patch that uses the InterlockedFoo
methods for
spinlocks, rather than our own custom thing? Try taking that out and seeing if
the
deadlock goes away. I'm not convinced those speed things up in any case. (At
least,
according to our internal tests, they don't.)
Original comment by csilv...@gmail.com
on 13 Jan 2010 at 6:44
> Is the theory that they're both trying to acquire the same windows lock
internally?
Yes. I think one of the LoadLibrary calls is made by system. Not sure how that
happens, though.
> Can you give a bit more detail of the stacktraces when these routines are
> deadlocking?
There's not much more I can tell as I don't see Windows symbols. One thread is:
ntdll.dll!7c90e514()
ntdll.dll!7c90df5a()
ntdll.dll!7c91b24b()
ntdll.dll!7c94b1b0()
ntdll.dll!7c927784()
ntdll.dll!7c927573()
ntdll.dll!7c91005d()
ntdll.dll!7c94bc4c()
ntdll.dll!7c901046()
ntdll.dll!7c917e50()
ntdll.dll!7c90da0a()
kernel32.dll!7c8021eb()
psapi.dll!76bf186c()
ntdll.dll!7c90da0a()
psapi.dll!76bf20f7()
ntdll.dll!7c91005d()
ntdll.dll!7c917ec0()
kernel32.dll!7c80ae7e()
> libtcmalloc_minimal.dll!`anonymous
namespace'::LibcInfo::PopulateWindowsFn(const
`anonymous-namespace'::ModuleEntryCopy
& me32={...}) Line 465 C++
libtcmalloc_minimal.dll!`anonymous namespace'::PatchAllModules() Line 683 +
0xa C++
libtcmalloc_minimal.dll!`anonymous
namespace'::WindowsInfo::Perftools_LoadLibraryExW(const unsigned short *
lpFileName=0x7ffdec00, void * hFile=0x00000000, unsigned long dwFlags=0) Line
966
C++
kernel32.dll!7c801d72()
kernel32.dll!7c801da8()
FwcWsp.dll!5560e8d6()
The other one is this:
ntdll.dll!7c90e514()
ntdll.dll!7c90d21a()
kernel32.dll!7c8023f1()
kernel32.dll!7c802455()
> libtcmalloc_minimal.dll!SpinLock::SlowLock() Line 117 + 0x23 C++
libtcmalloc_minimal.dll!`anonymous namespace'::PatchAllModules() Line 654 +
0x1f C++
libtcmalloc_minimal.dll!`anonymous
namespace'::WindowsInfo::Perftools_LoadLibraryExW(const unsigned short *
lpFileName=0x7ffdfc00, void * hFile=0x00000000, unsigned long dwFlags=0) Line
966
C++
kernel32.dll!7c801d72()
kernel32.dll!7c801da8()
VAS_ResultNavigatorBuilder.dll!...
VAS_ResultNavigatorBuilder.dll!...
VAS_ResultNavigatorBuilder.dll!...
VAS_ResultNavigatorBuilder.dll!...
VAS_ResultNavigatorBuilder.dll!...
VAS_ResultNavigatorBuilder.dll!...
VAS_ResultNavigatorBuilder.dll!...
VAS_ResultNavigatorBuilder.dll!_DllMainCRTStartup(void *
hDllHandle=0x01e70000, unsigned long dwReason=1, void * lpreserved=0x00000000)
Line
272 + 0xd C
ntdll.dll!7c90118a()
ntdll.dll!7c91c4fa()
ntdll.dll!7c91c285()
ntdll.dll!7c91d01e()
ntdll.dll!7c91d31b()
ntdll.dll!7c91c22c()
ntdll.dll!7c916371()
VAS_ResultNavigatorBuilder.dll is one of our dlls, it eventually calls
LoadLibrary. I
can't tell much of the other thread origins as it seems not to contain any of
our
dlls. By the FwcWsp.dll I found that it relates to Microsoft Internet Security
and
Acceleration Server, whatever that is.
> Oh, maybe you're running with your own patch that uses the InterlockedFoo
methods
> for spinlocks, rather than our own custom thing?
What custom thing? I use perftools 1.4 as a base version + my initial patch I
attached here + patch_functions.cc from trunk. If you mean those "#pragma
intrinsic"
in my patch, then yes, they are there. But they do not generate any
Interlocked*
function calls but rather inline assembly (which is faster according to tests
on my
modules).
Original comment by andrey.s...@gmail.com
on 13 Jan 2010 at 7:09
OK, I understand what's going on now, I think. It's a lock inversion problem:
the
call to LoadLibraryExW (and/or FreeLibrary) acquires some windows lock, and
then
acquires my global spinlock. In the meantime, we acquire my global spinlock,
and
then try to acquire some windows lock in GetProcAddress. I think this is
basically
what you were saying.
This is going to be tricky to address. One way to address it is to not make
*any*
windows calls in places I currently hold the patch spinlock. That may not be
very
practical; I'll check. The other solution would be to not hold the patch
spinlock as
much during actual patching. I'm not sure that's practical either. I'll take
a look
and see what we can do.
In the meantime, it would be really useful if we could add an example program
to our
test suite that exercises this code: loads and frees modules a lot in multiple
threads, while making allocations of various sorts. Do you think you could
write
such a thing (or perhaps strip down some of your existing example programs)?
If you
can find a small program that tickles this bug, and can be made into a
unittest, I'd
be very glad to add it to the test suite.
Original comment by csilv...@gmail.com
on 13 Jan 2010 at 7:23
It turns out there was only one windows system call that I had to move (as far
as I
can tell from my audit), so I've done so. I had to rearrange the file a bit to
get
everything defined where it needed to be, but it's all working now. Again: it
passes
unittests, but i don't know how it will work on your testcase. The new version
of
patch_functions.cc is attached.
Original comment by csilv...@gmail.com
on 13 Jan 2010 at 10:11
Attachments:
Looks like the deadlock is fixed by this. It didn't appear on the test case
that
showed it before. I noticed however that the check for cbNeeded in
PatchAllModules is
still inside the loop, but that's not critical. Thanks a lot for fixing it, I
think,
I'll try more extensive testing in the coming few days.
As for the standalone test, it won't be easy. I did some research on the thread
origins that was apparently holding the system lock. It turns out that it is a
part
of ISA firewall. This thread is injected in any application that uses sockets
(and my
test does so), so in order to truly reproduce the issue the test has to load
libraries, use sockets and run with ISA firewall in action. This is quite a
complex
environment.
I can try to write an artifical multithreaded test that loads/unloads libraries
concurrently, but I'm not sure it will show the deadlock. It will take some
time
also, not before the next week.
Original comment by andrey.s...@gmail.com
on 14 Jan 2010 at 6:04
Great!, I'm glad the deadlock is fixed. It looks like the problems that
triggered
your initial bug report are all fixed as well? If this looks like it's all
good to
you, I'll get it into the next release (hopefully soon).
Original comment by csilv...@gmail.com
on 14 Jan 2010 at 4:17
Yes, I have no complaints left. In a few days I'll have my extensive testing
results on
this version and see if something else comes up (I don't think it will). Thanks
for all
your commitment!
Original comment by andrey.s...@gmail.com
on 14 Jan 2010 at 4:29
Here's a patch relative to 'Comment 38 by csilvers, Jan 13'
It fixes the startup time and the OPENGL32 runtime performance problems we were
seeing. It uses loaded_map as a referenced counted list of modules loaded,
that way
if the module is listed in LoadLibrary it can quickly skip any other test,
including
the nopatching_set string compare. The FreeLibrary will skip the
PatchAllModules if
it still has a reference count. If the library was loaded before the patching
was
applied and a libary is referenced/unreferenced by LoadLibrary/FreeLibrary like
OPENGL32 was in our case, it will get a reference count of 1 in the first
PatchAllModules and go between 1 and 2 after that.
I'm copying the loaded_map to a set to do set comparisons in PatchAllModules to
compare the last set of modules with the current, then dealing with the added or
removed modules and not looking at anything that was previously loaded. I've
only
tested it in our application. I don't know if there is a faster way than
copying
over the loaded_map to a set to do the set operation to get the added or removed
modules, feel free to check on that.
Original comment by da...@fries.net
on 15 Jan 2010 at 11:24
Attachments:
The attached file is empty.
Original comment by andrey.s...@gmail.com
on 16 Jan 2010 at 9:35
I'm not sure what happened there, here's the full file.
I didn't update all the file comments, so it will need some more work. I added
some extra print messages behind #ifdef DEBUG to help track what is happening.
Original comment by da...@fries.net
on 16 Jan 2010 at 4:35
Attachments:
It seems like you've reworked PatchAllModules() more than is necessary: moving
everything to sets etc instead of just keeping the bool and vector like we had
before. Is there a reason you did that? It's hard for me to tell, but it
seems like
the major change for this patch is to updating refcounts of loaded libraries in
PatchAllModules, not just in LoadLibraryExW. Is that right? If so, can't you
just
make the change by adding one line to PatchAllModules, that updates the refcount
there? But maybe I'm not understanding what you're trying to do, or where the
time
win comes from your patch.
Original comment by csilv...@gmail.com
on 16 Jan 2010 at 5:11
In short, the version I posted scans only the modules that were loaded since
the
last PatchAllModules run, and avoids calling PatchAllModules in FreeLibrary on
a
library that didn't require any patching.
The patching_map reference count method only keeps track of libraries when
PatchAllModules returns true, ie they or a library they were linked against and
pulled in at the same time needed to be patched. The nopatching_set helps on
LoadLibrary to skip out on ones that don't need patching, but that check will
be a
series of string compares. The FreeLibrary routine will only skip
PatchAllModules
if the library called for Free is in the patching_map and results in a greater
than
0 reference count. Other modules that didn't need to be patched (80 to what? 3
in
our program), invoke PatchAllModules such as OPENGL32 in our case every
FreeLibrary
call.
PatchAllModules modules vector builds up a list of the currently loaded
modules.
Then it goes through the set of libc modules doing a vector erase on only the
libc
patched modules (which is a linear operation every time you erase from the
middle
because everything after has to be moved, not that it is probably contributing
much
to the total time). Every time it goes though PatchAllModules, every module
which
isn't in libc gets scanned for malloc symbols, I expect that is the big
timesink
short of skipping PatchAllModules altogether when it isn't required.
In the version I posted, the loaded_map is the previous list of modules.
PatchAllModules get the current list EnumProcessModules (and skips calling
GetModuleInformation by the way on every module) and the set operations
produces
the modules that were add or removed since the previous PatchAllModule. The
loaded_map is updated for those module changes. Previously finding out if any
libc
functions were unloaded required traversing the modules vector, now it's a set
find
and only on the modules that were removed, which half of the time will be zero,
so
maybe that for loop needs to be in an 'if (!removed.empty())' check now that I
think about it. Then just the added modules have GetModuleInformation called
on
them and only the added modules are scanned, which the first time through will
be
all of them and afterwards will be one or a very small number of them.
Original comment by da...@fries.net
on 16 Jan 2010 at 5:49
OK, let's not focus on the vector-traversal efficiency just yet: as you say,
these
objects are all very small, and O(n) vs O(n^2) is likely not a big deal. In any
case, that's easy to optimize later.
For now, can you make a minimally invasive patch that adds the functionality
that is
actually helping you? (The data structures that keep track of what is and isn't
currently loaded, it sounds like.) That is, no new set objects -- in fact new
data
structures kept to as minimum as you can. That will make it clearer what the
patch
is doing. Then, in a separate step, we can optimize further if it's called for.
Original comment by csilv...@gmail.com
on 16 Jan 2010 at 6:17
I think this is the minimally invasive patch required to not have a performance
regression as compared to the Microsoft allocator. Can you suggest an
alternative
to the std::set container or set_difference operation (or why are you opposed
to
them)?
For PatchAllModules to be efficient it needs to only look at the modules loaded
since the last scan. To do that it needs a list of modules that were loaded
last.
If it was using a vector it would need to sort both first to find the added
list.
That would make it more complicated and leave the rest of the logic
substantially
the same as the version I posed.
Original comment by da...@fries.net
on 16 Jan 2010 at 6:44
I'm not opposed to the way you're doing it now, I just don't understand it.
There's
lots of state stored in global variables that are shared between multiple
routines,
you are copying data around from one data structure to another; overall it just
seems
very tangled to me. I feel a more minimal change would keep the tangling to a
minimum.
In lieu of that, I think I need another version of the patch with much more
extensive
comments around the parts you've added. I'd like to get a new version of
perftools
out next week, but my time is tight, so I'm hoping it won't be too difficult to
puzzle out what's going on here.
} To do that it needs a list of modules that were loaded last.
You just need a list of modules that had been loaded at some point, no? It
doesn't
matter if it's the last scan or an earlier scan. I'd think you can just reuse
the
existing patched_modules data structure. I'm not suggesting you store this
information in a vector, I'm suggesting you try to fit what you're doing in
with the
existing control flow and data structures, rather than changing it all about.
Original comment by csilv...@gmail.com
on 16 Jan 2010 at 6:55
I'll go through and update the comments and post those. As to code changes,
all my
computers at home run Linux, so I'm not going to be able to do any code changes
this weekend anyway.
I think PatchAllModules does need the last list of loaded modules
PatchAllModules
scanned to clear out the list of modules not loaded in a FreeLibrary in case a
later LoadLibrary loads one at the same location. I'm not sure what you were
getting at for that question.
Which data structure is patched_modules? The patching_map or loaded_map I
renamed
it to and reuse it in which context?
Original comment by da...@fries.net
on 16 Jan 2010 at 7:12
I didn't remove the #ifdef DEBUG print messages, but I was expecting them to be
removed before the final commit. I can produce a version with them removed if
you
want.
Looks like I had a bug between PatchAllModules and FreeLibrary. If the
reference
count reaches 0 (but the reference count was wrong and it is still loaded), it
would stay zero through PatchAllModules. Erase the entry in FreeLibrary before
calling PatchAllModules so it can detect this condition and add it back with a
reference count of 1.
I think loaded_copy could be removed, it would be ugly though. std::map can't
be
used for set operations, std::set only has a key, but if that key was a class
that
held both a module handle and a reference count, but the comparison was only on
the
module handle and the reference count was marked mutable, I think it would
work.
It sound too ugly for me though.
Slight nitpick note, the description for nopatching_set may not be completely
correct. If both library A and B are linked against library C and only C has
malloc routines to patch and the application isn't linked against any of the
libraries, then I would think the following could be a problem. Load A, which
brings in C and is patched. Load B which is marked as not needing to be
patched
because C was already loaded, unload A and B which unloads C. Load B which
brings
in C, but since B was added to the nopatching_set this time C doesn't get
patched.
I don't like how likely it is to bite someone, or if it is worth looking at
even
more complicated ways to detect that problem.
Original comment by da...@fries.net
on 16 Jan 2010 at 10:51
Attachments:
Original issue reported on code.google.com by
andrey.s...@gmail.com
on 24 Dec 2009 at 7:16