CreateToolhelp32Snapshot fails spuriously in an MT application

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

I have a multithreaded applicaion that calls LoadLibrary from multiple 
threads. Sometimes the library fails to load with the following error on 
the console:

Check failed: sidestep::SIDESTEP_SUCCESS == 
PreamblePatcher::Patch(windows_fn_[i], perftools_fn_[i], &origstub_fn_[i])

What is the expected output? What do you see instead?

The expected behavior is consistent loading of the libraries, with the same 
result as if without tcmalloc.

What version of the product are you using? On what operating system?

Windows XP, MSVC 7.1, STLPort 5.1.4, perftools 1.4 + patch from issue 199 
and minor modifications for STLPort.

Please provide any additional information below.

The problem is related to issue 199: origstub_fn_[i] is dirty when calling 
PreamblePatcher::RawPatch from LibcInfoWithPatchFunctions<T>::Patch. 
However, the proposed patch from that issue does not help in my case 
because after cleaning origstub_fn_[i] and patching the functions, the 
resulting functions become double-patched. This leads to an infinite loop 
when calling, e.g. operator delete - it calls 
LibcInfoWithPatchFunctions<0>::Perftools_delete, which calls 
LibcInfoWithPatchFunctions<1>::Perftools_delete, which calls 
LibcInfoWithPatchFunctions<0>::Perftools_delete and so on.

The root of the problem is in the PatchAllModules function, which is called 
from LoadLibrary. It appears that CreateToolhelp32Snapshot can fail 
occasionally, which leads to assuming the process has no modules at all and 
marking all module_libcs as invalid. The next call to LoadLibrary, for 
which CreateToolhelp32Snapshot succeeds, will double-patch the modules and 
lead to the described behavior.

I'm not sure what causes CreateToolhelp32Snapshot to fail (in my case 
GetLastError returns 8, Not enough storage is available to process this 
command, but I clearly have enough memory). MSDN mentions that it may 
spuriously return ERROR_BAD_LENGTH (24). Probably, when multiple threads 
call to LoadLibrary, CreateToolhelp32Snapshot fails to synchronize with the 
library manager properly, even though the call is for the current process.

One other thing I noticed. The patch_all_modules_lock in PatchAllModules is 
locked after traversing the modules snapshot. But module_libcs is used 
during the traverse, which may lead to a race condition. However, moving 
patch_all_modules_lock before it does not help my situation.

Original issue reported on code.google.com by andrey.s...@gmail.com on 24 Dec 2009 at 7:16

GoogleCodeExporter commented 9 years ago

I managed to work around the problem with CreateToolhelp32Snapshot by replacing 
its 
usage with PSAPI. I did not eliminate Toolhelp32 completely from sources, but 
it 
should be doable, since the only thing left of it is a dummy usage of 
MODULEENTRY32 
and MAX_MODULE_NAME32.

However, the race condition in PatchAllModules I described is still present. It 
shows 
with the new PSAPI-related code as occasional failures of the 
GetModuleInformation 
function when multiple threads call LoadLibrary/FreeLibrary. It appears that 
FreeLibrary may invalidate module handle that is about to be passed to 
GetModuleInformation in another thread.

I could not work around this race condition, because when I tried to move the 
spin 
lock acquisition above the EnumProcessModules call I began to observe 
deadlocks. I 
don't know what causes it, but it looks like it blocks on acquisition of some 
OS 
resource during various API calls, such as GetModuleInformation and 
GetProcAddress. 
This resource somehow comes already acquired by another thread that is running 
LoadLibrary/FreeLibrary and gets blocked in the patch_all_modules_lock. It 
looks like 
some lock-free approach is needed to solve this problem.

Along with my changes in PatchAllModules with regard to PSAPI I also replaced 
std::vector with a stack-based solution in order to remove the dependency on 
the 
dynamic memory allocation that is actually being patched. I'm not sure how 
correct it 
is to allocate memory before patching and free it after patching. Besides, it's 
a bit 
of a speedup. However, I didn't check all other functions that are called 
during the 
patching process in order to fix it there, so there may be more work to do in 
this 
direction.

One thing I'm not sure about is the reason of the PatchMainExecutableLocked 
function 
and if it is used correctly now, in context of PSAPI. AFAIU, all modules, 
including 
the executable, are patched by PatchOneModuleLocked, since EnumProcessModules 
will 
return it as the first element of the hModules array. Is the call to 
PatchMainExecutableLocked needed then?

Anyway, please find my patch against 1.4 attached. Besides the described 
changes, the 
patch also includes two significant changes:

1. In atomicops-internals-x86-msvc.h I enabled MSVC atomic intrinsics to 
squeeze as 
much performance as possible.
2. In windows/port.h I added #include <cstdio>, which is required to compile 
with 
STLPort. The reason for it is that perftools defines nasty macros, such as 
vsnprintf, 
which interfere with STLPort. The include allows STLPort to define vsnprintf as 
intended before redefining it with the macro.

Original comment by andrey.s...@gmail.com on 24 Dec 2009 at 2:29

Attachments:

PerfTools_1.4.patch

GoogleCodeExporter commented 9 years ago

Oh, I forgot to mention that in order to compile perftools with the patch, 
psapi.lib 
should be added to MSVC project.

Original comment by andrey.s...@gmail.com on 24 Dec 2009 at 2:32

GoogleCodeExporter commented 9 years ago

Thank you for the detailed bug report!  

} It looks like some lock-free approach is needed to solve this problem.

The easiest solution, I think, is to just to copy all the me32 data structures 
inside
the CreateToolhelp32Snapshot -- I should have done that in the first place (the
current collect-then-patch structure was an attempt to get rid of the 
winapi-internal
deadlocks you observed).  It should be enough to just copy the information 
needed for
SameAsME32.  Then we can iterate over the copies, outside the CreateToolhelp 
loop,
with the patch_all_modules spinlock.

Do you think this is maybe the same issue as in issue 135?

I'm not sure I understand the pros and cons of moving away from toolhelp API to 
the
psapi.  Do we know that's more reliable?  I wonder if it would just be better to
retry CreateToolhelp32Snapshot a few times if it fails.

Original comment by csilv...@gmail.com on 24 Dec 2009 at 10:37

Added labels: Priority-Medium, Type-Defect

GoogleCodeExporter commented 9 years ago

Also, some of the authors of the windows-patching code think it will never be 
that
reliable, especially in multi-threaded contexts.  You may want to consider 
following
the approach chromium uses, of patching mscrt and using that; it's a bit more 
work on
your part, but a lot more dependable.  There are some details here:

http://groups.google.com/group/google-perftools/browse_thread/thread/41cd3710af8
5e57b/71113ef1e9d41ba2?lnk=gst&q=mbelshe#71113ef1e9d41ba2

If you go this route, and it works well for you, let me know; once this 
technique has
been used and understood enough, I plan to add some instructions to the
README.windows file.

Original comment by csilv...@gmail.com on 24 Dec 2009 at 10:45

GoogleCodeExporter commented 9 years ago

> Do you think this is maybe the same issue as in issue 135?

I think, it is related but hardly the same. My original problem was with 
CreateToolhelp32Snapshot, which is not covered there. Besides, that issue dates 
back 
to 1.2, and was fixed in 1.3, and I'm using 1.4.

> I'm not sure I understand the pros and cons of moving away from toolhelp API 
to the
psapi.  Do we know that's more reliable?  I wonder if it would just be better 
to 
retry CreateToolhelp32Snapshot a few times if it fails.

Well, PSAPI never failed during my tests. Soon I'll have more robust testing 
done on 
my modified version, which will show whether my solution was final. But 
honestly, I 
think it showed better already.

As for retrying CreateToolhelp32Snapshot, I had that thought at first but 
rejected it 
for a few reasons. First, on what error codes would you retry? MSDN mentions 
ERROR_BAD_LENGTH, but on my setup I get ERROR_NOT_ENOUGH_MEMORY. Who knows what 
other 
errors it may return under these unstable conditions? Second, for how long to 
spin? 
There is no ultimate answer for this because of different thread numbers in 
applications and different hardware matters. And finally, what to do if after 
spinning CreateToolhelp32Snapshot it doesn't succeed? Bailing out the 
application or 
failing LoadLibrary is not acceptable for me, nor I think will be acceptable 
for 
other users. Therefore I decided to get rid of CreateToolhelp32Snapshot 
entirely.

The PSAPI has no drawbacks, AFAICT. The only downside I see is that it's not 
available on Windows 9x, but who cares?

> The easiest solution, I think, is to just to copy all the me32 data 
structures 
inside the CreateToolhelp32Snapshot -- I should have done that in the first 
place 
(the current collect-then-patch structure was an attempt to get rid of the 
winapi-
internal deadlocks you observed).  It should be enough to just copy the 
information 
needed for SameAsME32.  Then we can iterate over the copies, outside the 
CreateToolhelp loop, with the patch_all_modules spinlock.

I'm not sure I understand. Do you mean that you want to patch 
CreateToolhelp32Snapshot too and duplicate module_libcs? You would have to 
acquire 
the spinlock in order to do that, and in order to avoid deadlocks, you would 
have to 
release it before passing on to the real CreateToolhelp32Snapshot. This is 
where race 
condition happens - another thread may get through to the part of 
PatchAllModules 
where module_libcs is modified, and your duplicates become inactual.

What I meant by "some lock-free approach" is removing patch_all_modules_lock 
completely. One can keep a pointer to the actual module_libcs array, read it 
with 
acquire semantics before the "collect" part of PatchAllModules, then create a 
duplicate of it, then collect and modify it, and then store the new pointer 
with CAS. 
If it fails, then retry the whole loop. Something like that.

> You may want to consider following the approach chromium uses, of patching 
mscrt 
and using that

It describes patching the static CRT, and I'm using dynamic. I have a large 
number of 
projects here, many of them exchange memory between each other, so patching the 
static CRT is not suitable. Also I feel very uneasy about modifying dynamic 
CRT, as 
it is also used in other projects, many of them are out of my control. Mixing 
around 
with modified and original CRT is a dangerous thing I would not want to try.

Original comment by andrey.s...@gmail.com on 25 Dec 2009 at 6:49

GoogleCodeExporter commented 9 years ago

} The PSAPI has no drawbacks, AFAICT. The only downside I see is that it's not 
} available on Windows 9x, but who cares?

The other downside of using psapi is having to add a dependency on a new 
library.  I
don't know very much about windows development, so I don't know how much 
complexity
that adds, if any, to using perftools.  My goal is to make it as easy as 
possible.

} Do you mean that you want to patch
} CreateToolhelp32Snapshot too and duplicate module_libcs?

No, that wasn't it.  I was just suggesting storing the output of the toolhelp
iteration (the me32 structures) in our own data structure.  We can do that 
without
the lock.  Then we can examine that output within the lock.  This separates our 
use
of the lock from the windows system calls.

It's pretty straightforward, so I can just do it when I get back from vacation.

} What I meant by "some lock-free approach" is removing patch_all_modules_lock
} completely

Right, I understand the proposal.  But I think it's more complicated than 
needed to
address the problems here.

Original comment by csilv...@gmail.com on 25 Dec 2009 at 4:26

GoogleCodeExporter commented 9 years ago

> The other downside of using psapi is having to add a dependency on a new 
library.  I
> don't know very much about windows development, so I don't know how much 
complexity
> that adds, if any, to using perftools.  My goal is to make it as easy as 
possible.

I don't think it poses any problem since psapi.dll is shipped with Windows. On 
NT6
(Vista and later) PSAPI v2 is even implemented by kernel32.dll, so I don't 
expect
this API to go away any time soon.

Original comment by andrey.s...@gmail.com on 25 Dec 2009 at 8:46

GoogleCodeExporter commented 9 years ago

> No, that wasn't it.  I was just suggesting storing the output of the toolhelp
> iteration (the me32 structures) in our own data structure.  We can do that 
without
> the lock.  Then we can examine that output within the lock.  This separates 
our use
> of the lock from the windows system calls.

Ah, I see. You're right, this is a simpler solution.

> It's pretty straightforward, so I can just do it when I get back from 
vacation.

Ok, have a good holiday!

Original comment by andrey.s...@gmail.com on 25 Dec 2009 at 8:52

GoogleCodeExporter commented 9 years ago

I think it's safest to try this patch in two stages: one to fix the race 
condition,
and one to move to psapi.  I've done the first one.  I don't have great
multi-threaded tests, but it passes all the unittests, so I think it's not any 
worse
than what came before.  Would you like to try it and see how it works for you?  
I've
attached the patch.

Original comment by csilv...@google.com on 31 Dec 2009 at 3:32

GoogleCodeExporter commented 9 years ago

Oops, really attached this time.

Original comment by csilv...@google.com on 31 Dec 2009 at 3:34

Attachments:

p

GoogleCodeExporter commented 9 years ago

Thanks for the patch but I won't be able to experiment with it until after the 
New 
Year holidays. I took a look at it though, and it seems it does resolve race 
condition over module_libcs.

Still, I'd like to repeat my concern with regard to using std::vector in the 
patching 
code. Is it correct to use it? IIUC, it allocates memory before patching and 
releases 
after. Does it not pose a problem for tcmalloc?

Meanwhile, the extensive tests on my patched version (with PSAPI) passed, there 
were 
no errors similar to CreateToolhelp32Snapshot failures that I whitnessed 
earlier. I 
think, after the holidays I will try to integrate your copying approach into my 
version with PSAPI.

Original comment by andrey.s...@gmail.com on 31 Dec 2009 at 1:05

GoogleCodeExporter commented 9 years ago

Ah, I must have missed your concerns about vector the first time around. 
Interesting; I hadn't thought of that.  I'm not sure it can be a problem, 
though.  We
already have code in perftools that supports tcmalloc freeing memory allocated 
by
windows; we need it anyway because windows is not totally consistent in the 
order it
allocates and frees things itself.  I think the vector code would be handled in 
the
same way.  It should be fine.  Have you seen any actual problems there?

} Meanwhile, the extensive tests on my patched version (with PSAPI) passed, 
there were 
} no errors similar to CreateToolhelp32Snapshot failures that I whitnessed 
earlier.

Great.  Once I've verified this part of the patch isn't broken, I'll work on 
applying
the psapi part.  Another patch would be great.  But I wouldn't worry about the 
vector
stuff for this patch, yet.  If it turns out to be a problem, we can handle it 
then.

Original comment by csilv...@gmail.com on 31 Dec 2009 at 5:49

Changed state: Started

GoogleCodeExporter commented 9 years ago

OK, I got the race-condition patch reviewed and checked in, and managed to get 
a 
change to use psapi reviewed and checked in as well.  Both are in the svn 
repository 
now.  Would you like to try the perftools version in svn-trunk now and see if 
it fixes 
the problems you have?  If not, let me know (perhaps in the form of a patch) 
and I'll 
try to fix it up.  All perftools unittests pass, but they don't test module 
loading/unloading very well.

Original comment by csilv...@gmail.com on 6 Jan 2010 at 12:35

GoogleCodeExporter commented 9 years ago

Here's another version of patch-functions.cc -- the whole file, not just a 
patch.  Can 
you try dropping this in and seeing how it works?  It should speed up 
performance a 
bit on module load/unload, but I don't have very good tests of those functions 
(since 
I don't really program in windows), so I'm not sure how well they work.  The 
unittests 
all pass, at any rate.  Let me know how this works for you.

Original comment by csilv...@gmail.com on 9 Jan 2010 at 3:49

Attachments:

patch_functions.cc

GoogleCodeExporter commented 9 years ago

I'm back from holidays.

Thanks for all your work. I looked at patch-functions.cc and will give it a 
try. 
However, I have a few notes about it:

1. In PatchAllModules, you seem to have missed the check if cbNeeded is within 
hModules boundaries. Theoretically, EnumProcessModules may return more than 
hModules 
size.
2. I'm not sure why dwFlags and lpFileName should be taken into account in the 
MapKey 
ordering. The nolibc_set container contains information about modules loaded 
into the 
application, and modules are really identified by handles (or by complete file 
names, 
but since LoadLibrary may accept relative and absolute paths, this is much less 
reliable), but not by flags.
3. Does this code really patch all modules loaded by LoadLibrary? If I 
LoadLibrary("A.dll"), and A.dll is linked with B.dll, is B.dll patched by 
PatchOneModule? AFAIK, linked modules are not loaded via subsequent calls to 
LoadLibrary.

Original comment by andrey.s...@gmail.com on 11 Jan 2010 at 6:57

GoogleCodeExporter commented 9 years ago

BTW, I noticed you have added a note to README.windows, that when linking with 
static 
builds of tcmalloc users should add a dependency on psapi.lib. You can remove 
this 
requirement if you add #pragma comment(lib, "psapi.lib") to the 
Windows-specific source 
code. Of course, this will work with MSVC-compatible compilers, but I think it 
would be 
nice anyway.

Original comment by andrey.s...@gmail.com on 11 Jan 2010 at 7:13

GoogleCodeExporter commented 9 years ago

I verified my note #3 from the previous post and it seems I was right. 
LibcInfo::PopulateWindowsFn don't get called for the dependent dlls that are 
loaded 
implicitly.

Original comment by andrey.s...@gmail.com on 11 Jan 2010 at 8:31

GoogleCodeExporter commented 9 years ago

Thank you for your comments.  I know very little about how windows DLL works, 
so I 
appreciate your careful thinking about all the issues involved.

The dependent DLLs is a good point, and must be why we wrote LoadLibraryEx to 
call 
PatchAllFunctions originally.  I'll have to move back to that.

If A.dll depends on B.dll, and one calls LoadLibraryEx("A.dll"), and then calls 
FreeLibrary(A_dll_handle), does that free B.dll as well?  If so, it looks like 
I'll 
have to call PatchAllFunctions for FreeLibrary as well; that is, the code we 
had 
before was about as efficient as it gets.

I'll have to figure out another way of making the patching/unpatching more 
efficient, 
then.

In the meantime, can you try out the version in svn?  I want to make sure that 
version, at least, is working as intended.

Original comment by csilv...@gmail.com on 11 Jan 2010 at 3:04

GoogleCodeExporter commented 9 years ago

} 2. I'm not sure why dwFlags and lpFileName should be taken into account in 
the 
MapKey 
} ordering. The nolibc_set container contains information about modules loaded 
into 
the 
} application, and modules are really identified by handles (or by complete 
file 
names, 
} but since LoadLibrary may accept relative and absolute paths, this is much 
less 
} reliable), but not by flags.

I was storing the dwFlags because I thought that a module-name + flags would 
unique 
identify a dll, even if the dll search path changed.  But reading the docs more 
carefully I see that's not true, and I'll just have to get the absolute path 
(via a 
windows call) before caching.

Original comment by csilv...@gmail.com on 11 Jan 2010 at 3:36

GoogleCodeExporter commented 9 years ago

> If A.dll depends on B.dll, and one calls LoadLibraryEx("A.dll"), and then 
calls 
> FreeLibrary(A_dll_handle), does that free B.dll as well?

If B.dll is no longer used (that is, its reference count drops to zero) then 
it's 
unloaded. Similarly, there will be no subsequent call to FreeLibrary("B.dll") 
in that 
case.

>I'll have to figure out another way of making the patching/unpatching more
> efficient, then.

You can try enumerating modules before and after the genuine 
LoadLibrary/FreeLibrary 
and only process the difference. As for the LoadLibrary case, I suspect that 
the 
difference between the snapshots will be in the end of the list, but that's not 
documented anywhere.

> In the meantime, can you try out the version in svn?

I'll try it out tomorrow.

> I was storing the dwFlags because I thought that a module-name + flags would 
unique 
> identify a dll, even if the dll search path changed.  But reading the docs 
more 
> carefully I see that's not true, and I'll just have to get the absolute path 
(via a 
> windows call) before caching.

I don't think that messing with file names and paths is a good idea in the 
first 
place. Think about case differences, forward/backward slashes, subst drives and 
junctions. This will be unreliable. Module handles are much more reliable and 
simpler 
to work with, IMHO.

Original comment by andrey.s...@gmail.com on 11 Jan 2010 at 6:34

GoogleCodeExporter commented 9 years ago

} Module handles are much more reliable and simpler to work with, IMHO.

I don't think that will work for this use-case.  If I load a module, unload it, 
and 
then load it again, am I guaranteed to get the same module handle both times?  
I 
assume not.

In any case, while all the problems you worry about are real, they're not an 
impediment in this case.  I don't care if I say two modules are different when 
they're actually the same: all I lose is a little efficiency in that I can't 
short-
circuit the patching process quite as often.  What's important to me is that if 
I say 
two modules are the same, they really are.  It looks like an absolute path name 
should be enough to ensure that.

Original comment by csilv...@gmail.com on 11 Jan 2010 at 6:38

GoogleCodeExporter commented 9 years ago

> I don't think that will work for this use-case.  If I load a module, unload 
it, and 
> then load it again, am I guaranteed to get the same module handle both times? 
 I 
> assume not.

But in that case the module will be unpatched in the FreeLibrary, won't it? So 
it has 
to be patched again.

> In any case, while all the problems you worry about are real, they're not an 
> impediment in this case.

Fair enough.

Original comment by andrey.s...@gmail.com on 11 Jan 2010 at 6:52

GoogleCodeExporter commented 9 years ago

} But in that case the module will be unpatched in the FreeLibrary, won't it? 
So it
} has to be patched again.

The module may not have required patching at all.  The bad case to worry about 
it
   while (user_types_a_key()) {
      h = LoadLibrary("mydll");
      DoCoolThingInMyDll();
      FreeLibrary(h);
   }
In the common case, mydll has no libc functions and doesn't require patching.  
But 
storing h in a data structure doesn't help us make shortcut in LoadLibraryExW.  
We 
need the filename.

I've written pretty verbose comments in this version of the file (attached), so 
hopefully it will make things a bit clearer.  I think it addresses all the 
problems 
you've pointed out so far.  Let me know how it works for you.

Original comment by csilv...@gmail.com on 11 Jan 2010 at 10:48

GoogleCodeExporter commented 9 years ago

Looks like the attach didn't go through.

Original comment by andrey.s...@gmail.com on 12 Jan 2010 at 2:54

GoogleCodeExporter commented 9 years ago

There's another copy in comment 9 of issue 206.  Does it work to get it from 
there?

Original comment by csilv...@gmail.com on 12 Jan 2010 at 3:22

GoogleCodeExporter commented 9 years ago

I tried patch-functions.cc from that issue. It works, according to a few of my 
tests. 

However the optimization doesn't give much in my case, although I didn't do 
time 
measurements. For example, in one of my tests  LibcInfo::PopulateWindowsFn is 
called 
1941 times before the optimization and 1776 after. I think the main reason for 
this 
is that many of our dlls depend on other dlls implicitly, which are not stored 
in the 
patching_set/nopatching_set containers. One of the dependent dlls is 
msvcrt.dll, 
which actually defines the functions to patch. This makes every dll loaded with 
LoadLibrary that depends on msvcrt.dll (which basically means _every_ dll in my 
case) 
to be considered as patchable. In other words, patching_set contains A.dll, 
B.dll, 
C.dll, msvcrt.dll, where A.dll, B.dll and C.dll depend on msvcrt.dll. And 
nopatching_set is nearly empty.

I think, these sets could do a better job if they were maintained by 
PatchAllModules, 
and if patching_set only contained modules that have to be patched _themselves_ 
(not 
these depend on the modules to be patched). In the described example, 
patching_set 
should only contain msvcrt.dll, and the rest of modules should be in the 
nopatching_set.

Again, I didn't profile the test, so I don't know how faster it have become 
with the 
optimization, and would my proposal make it any faster.

Original comment by andrey.s...@gmail.com on 12 Jan 2010 at 1:43

GoogleCodeExporter commented 9 years ago

Also, a few minor comments, if I may:

* In PatchAllModules, the check "if (i >= kMaxModules)" could be made before 
the 
loop.
* In Perftools_LoadLibraryExW, the szFullPath can be sized with the PATH_MAX 
constant.
* In Perftools_FreeLibrary, you can erase the element from patching_set within 
the 
first lock scope of patch_all_modules_lock in order to avoid the second lock. 
Also, 
count+erase can be replaced by a more optimal find+erase(by iterator).

Don't take that as a nitpicking, though. :)

Original comment by andrey.s...@gmail.com on 12 Jan 2010 at 1:54

GoogleCodeExporter commented 9 years ago

> I think, these sets could do a better job if they were maintained by
> PatchAllModules, and if patching_set only contained modules that have to be 
patched
> _themselves_ (not these depend on the modules to be patched). In the described
> example, patching_set should only contain msvcrt.dll, and the rest of modules 
should
> be in the nopatching_set.

On the second thought, no, that would be incorrect as it would lead to missing 
to 
unpatch modules on FreeLibrary.

Original comment by andrey.s...@gmail.com on 12 Jan 2010 at 2:03

GoogleCodeExporter commented 9 years ago

} Also, a few minor comments, if I may:

Thanks for the review!  I've made the relevant changes.

} In other words, patching_set contains A.dll, B.dll, 
} C.dll, msvcrt.dll, where A.dll, B.dll and C.dll depend on msvcrt.dll. And 
} nopatching_set is nearly empty.

When you call FreeLibrary(A), does this cause msvcrt.dll to unload?  I would 
have 
thought the main executable would depend on msvcrt directly, so its refcount 
would be 
high enough it stays loaded even after you unload A.dll (along with B.dll and 
C.dll).

If that's the case, then nopatching_set should still include A, since A didn't 
cause 
any patching to happen: msvcrt was already patched when the it was loaded by 
the main 
executable.  I guess that's not happening in your case?  Do you know why not?

In any case, as long as this patch doesn't degrade performance for you, I'm 
happy.  
It's intended to help for certain cases where the current behavior is too slow 
(in 
issue 206), and if you don't fall into that category, then I'm not totally 
surprised 
if the patch doesn't help you a lot.

Original comment by csilv...@gmail.com on 12 Jan 2010 at 7:13

GoogleCodeExporter commented 9 years ago

> When you call FreeLibrary(A), does this cause msvcrt.dll to unload?  I would 
have 
> thought the main executable would depend on msvcrt directly, so its refcount 
would
> be high enough it stays loaded even after you unload A.dll (along with B.dll 
and
> C.dll).

That's right.

> If that's the case, then nopatching_set should still include A, since A didn't
> cause any patching to happen: msvcrt was already patched when the it was 
loaded
> by the main executable.  I guess that's not happening in your case?  Do you 
know
> why not?

You are right. I misinterpreted my test results when I wrote that. Actually, 
it's 
patching_set that is empty, and nopatching_set contains all the dlls.

Original comment by andrey.s...@gmail.com on 13 Jan 2010 at 7:04

GoogleCodeExporter commented 9 years ago

Today I tried some other tests with the recent patch_functions.cc and observed 
a 
deadlock when FreeLibrary is called concurrently. One thread is in the 
PopulateWindowsFn attempting to call GetProcAddress, and the other one is in 
the 
PatchAllModules trying to acquire patch_all_modules_lock scoped lock. I think 
that a 
similar deadlock can occur when calling LoadLibrary concurrently as well.

Original comment by andrey.s...@gmail.com on 13 Jan 2010 at 8:06

GoogleCodeExporter commented 9 years ago

Do you think it would be possible to collect ProcAddresses before the 
patch_all_modules_lock in PatchAllModules is acquired? Perhaps, the collected 
addresses 
could be stored in the modules vector elements and later used to apply patching?

Original comment by andrey.s...@gmail.com on 13 Jan 2010 at 8:24

GoogleCodeExporter commented 9 years ago

BTW, SVN trunk is also affected with the deadlock.

Original comment by andrey.s...@gmail.com on 13 Jan 2010 at 8:38

GoogleCodeExporter commented 9 years ago

I'm confused why there would be deadlock for calling ProcAddresses and 
acquiring the 
patch_all_modules lock.  Is the theory that they're both trying to acquire the 
same 
windows lock internally?  But I think we do our own spinlock implementation on 
windows, which doesn't make any windows syscalls.  Can you give a bit more 
detail of 
the stacktraces when these routines are deadlocking?

Oh, maybe you're running with your own patch that uses the InterlockedFoo 
methods for 
spinlocks, rather than our own custom thing?  Try taking that out and seeing if 
the 
deadlock goes away.  I'm not convinced those speed things up in any case.  (At 
least, 
according to our internal tests, they don't.)

Original comment by csilv...@gmail.com on 13 Jan 2010 at 6:44

GoogleCodeExporter commented 9 years ago

> Is the theory that they're both trying to acquire the same windows lock 
internally?

Yes. I think one of the LoadLibrary calls is made by system. Not sure how that 
happens, though.

> Can you give a bit more detail of the stacktraces when these routines are
> deadlocking?

There's not much more I can tell as I don't see Windows symbols. One thread is:

    ntdll.dll!7c90e514()    
    ntdll.dll!7c90df5a()    
    ntdll.dll!7c91b24b()    
    ntdll.dll!7c94b1b0()    
    ntdll.dll!7c927784()    
    ntdll.dll!7c927573()    
    ntdll.dll!7c91005d()    
    ntdll.dll!7c94bc4c()    
    ntdll.dll!7c901046()    
    ntdll.dll!7c917e50()    
    ntdll.dll!7c90da0a()    
    kernel32.dll!7c8021eb()     
    psapi.dll!76bf186c()    
    ntdll.dll!7c90da0a()    
    psapi.dll!76bf20f7()    
    ntdll.dll!7c91005d()    
    ntdll.dll!7c917ec0()    
    kernel32.dll!7c80ae7e()     
>   libtcmalloc_minimal.dll!`anonymous 
namespace'::LibcInfo::PopulateWindowsFn(const 
`anonymous-namespace'::ModuleEntryCopy 
& me32={...})  Line 465 C++
    libtcmalloc_minimal.dll!`anonymous namespace'::PatchAllModules()  Line 683 + 
0xa C++
    libtcmalloc_minimal.dll!`anonymous 
namespace'::WindowsInfo::Perftools_LoadLibraryExW(const unsigned short * 
lpFileName=0x7ffdec00, void * hFile=0x00000000, unsigned long dwFlags=0)  Line 
966 
C++
    kernel32.dll!7c801d72()     
    kernel32.dll!7c801da8()     
    FwcWsp.dll!5560e8d6()   

The other one is this:

    ntdll.dll!7c90e514()
    ntdll.dll!7c90d21a()
    kernel32.dll!7c8023f1()
    kernel32.dll!7c802455()
>   libtcmalloc_minimal.dll!SpinLock::SlowLock()  Line 117 + 0x23   C++
    libtcmalloc_minimal.dll!`anonymous namespace'::PatchAllModules()  Line 654 + 
0x1f    C++
    libtcmalloc_minimal.dll!`anonymous 
namespace'::WindowsInfo::Perftools_LoadLibraryExW(const unsigned short * 
lpFileName=0x7ffdfc00, void * hFile=0x00000000, unsigned long dwFlags=0)  Line 
966 
C++
    kernel32.dll!7c801d72()
    kernel32.dll!7c801da8()
    VAS_ResultNavigatorBuilder.dll!...
    VAS_ResultNavigatorBuilder.dll!...
    VAS_ResultNavigatorBuilder.dll!...
    VAS_ResultNavigatorBuilder.dll!...
    VAS_ResultNavigatorBuilder.dll!...
    VAS_ResultNavigatorBuilder.dll!...
    VAS_ResultNavigatorBuilder.dll!...
    VAS_ResultNavigatorBuilder.dll!_DllMainCRTStartup(void * 
hDllHandle=0x01e70000, unsigned long dwReason=1, void * lpreserved=0x00000000)  
Line 
272 + 0xd   C
    ntdll.dll!7c90118a()
    ntdll.dll!7c91c4fa()
    ntdll.dll!7c91c285()
    ntdll.dll!7c91d01e()
    ntdll.dll!7c91d31b()
    ntdll.dll!7c91c22c()
    ntdll.dll!7c916371()

VAS_ResultNavigatorBuilder.dll is one of our dlls, it eventually calls 
LoadLibrary. I 
can't tell much of the other thread origins as it seems not to contain any of 
our 
dlls. By the FwcWsp.dll I found that it relates to Microsoft Internet Security 
and 
Acceleration Server, whatever that is.

> Oh, maybe you're running with your own patch that uses the InterlockedFoo 
methods
> for spinlocks, rather than our own custom thing?

What custom thing? I use perftools 1.4 as a base version + my initial patch I 
attached here + patch_functions.cc from trunk. If you mean those "#pragma 
intrinsic" 
in my patch, then yes, they are there. But they do not generate any 
Interlocked* 
function calls but rather inline assembly (which is faster according to tests 
on my 
modules).

Original comment by andrey.s...@gmail.com on 13 Jan 2010 at 7:09

GoogleCodeExporter commented 9 years ago

OK, I understand what's going on now, I think.  It's a lock inversion problem: 
the 
call to LoadLibraryExW (and/or FreeLibrary) acquires some windows lock, and 
then 
acquires my global spinlock.  In the meantime, we acquire my global spinlock, 
and 
then try to acquire some windows lock in GetProcAddress.  I think this is 
basically 
what you were saying.

This is going to be tricky to address.  One way to address it is to not make 
*any* 
windows calls in places I currently hold the patch spinlock.  That may not be 
very 
practical; I'll check.  The other solution would be to not hold the patch 
spinlock as 
much during actual patching.  I'm not sure that's practical either.  I'll take 
a look 
and see what we can do.

In the meantime, it would be really useful if we could add an example program 
to our 
test suite that exercises this code: loads and frees modules a lot in multiple 
threads, while making allocations of various sorts.  Do you think you could 
write 
such a thing (or perhaps strip down some of your existing example programs)?  
If you 
can find a small program that tickles this bug, and can be made into a 
unittest, I'd 
be very glad to add it to the test suite.

Original comment by csilv...@gmail.com on 13 Jan 2010 at 7:23

GoogleCodeExporter commented 9 years ago

It turns out there was only one windows system call that I had to move (as far 
as I 
can tell from my audit), so I've done so.  I had to rearrange the file a bit to 
get 
everything defined where it needed to be, but it's all working now.  Again: it 
passes 
unittests, but i don't know how it will work on your testcase.  The new version 
of 
patch_functions.cc is attached.

Original comment by csilv...@gmail.com on 13 Jan 2010 at 10:11

Attachments:

patch_functions.cc

GoogleCodeExporter commented 9 years ago

Looks like the deadlock is fixed by this. It didn't appear on the test case 
that 
showed it before. I noticed however that the check for cbNeeded in 
PatchAllModules is 
still inside the loop, but that's not critical. Thanks a lot for fixing it, I 
think, 
I'll try more extensive testing in the coming few days.

As for the standalone test, it won't be easy. I did some research on the thread 
origins that was apparently holding the system lock. It turns out that it is a 
part 
of ISA firewall. This thread is injected in any application that uses sockets 
(and my 
test does so), so in order to truly reproduce the issue the test has to load 
libraries, use sockets and run with ISA firewall in action. This is quite a 
complex 
environment.

I can try to write an artifical multithreaded test that loads/unloads libraries 
concurrently, but I'm not sure it will show the deadlock. It will take some 
time 
also, not before the next week.

Original comment by andrey.s...@gmail.com on 14 Jan 2010 at 6:04

GoogleCodeExporter commented 9 years ago

Great!, I'm glad the deadlock is fixed.  It looks like the problems that 
triggered 
your initial bug report are all fixed as well?  If this looks like it's all 
good to 
you, I'll get it into the next release (hopefully soon).

Original comment by csilv...@gmail.com on 14 Jan 2010 at 4:17

GoogleCodeExporter commented 9 years ago

Yes, I have no complaints left. In a few days I'll have my extensive testing 
results on 
this version and see if something else comes up (I don't think it will). Thanks 
for all 
your commitment!

Original comment by andrey.s...@gmail.com on 14 Jan 2010 at 4:29

GoogleCodeExporter commented 9 years ago

Here's a patch relative to 'Comment 38  by csilvers, Jan 13'
It fixes the startup time and the OPENGL32 runtime performance problems we were
seeing.  It uses loaded_map as a referenced counted list of modules loaded, 
that way
if the module is listed in LoadLibrary it can quickly skip any other test, 
including
the nopatching_set string compare.  The FreeLibrary will skip the 
PatchAllModules if
it still has a reference count.  If the library was loaded before the patching 
was
applied and a libary is referenced/unreferenced by LoadLibrary/FreeLibrary like
OPENGL32 was in our case, it will get a reference count of 1 in the first
PatchAllModules and go between 1 and 2 after that.

I'm copying the loaded_map to a set to do set comparisons in PatchAllModules to
compare the last set of modules with the current, then dealing with the added or
removed modules and not looking at anything that was previously loaded.  I've 
only
tested it in our application.  I don't know if there is a faster way than 
copying
over the loaded_map to a set to do the set operation to get the added or removed
modules, feel free to check on that.

Original comment by da...@fries.net on 15 Jan 2010 at 11:24

Attachments:

patch_loaded_map_01-13-2010.patch

GoogleCodeExporter commented 9 years ago

The attached file is empty.

Original comment by andrey.s...@gmail.com on 16 Jan 2010 at 9:35

GoogleCodeExporter commented 9 years ago

I'm not sure what happened there, here's the full file.

I didn't update all the file comments, so it will need some more work.  I added 
some extra print messages behind #ifdef DEBUG to help track what is happening.

Original comment by da...@fries.net on 16 Jan 2010 at 4:35

Attachments:

patch_functions.cc

GoogleCodeExporter commented 9 years ago

It seems like you've reworked PatchAllModules() more than is necessary: moving
everything to sets etc instead of just keeping the bool and vector like we had
before.  Is there a reason you did that?  It's hard for me to tell, but it 
seems like
the major change for this patch is to updating refcounts of loaded libraries in
PatchAllModules, not just in LoadLibraryExW.  Is that right?  If so, can't you 
just
make the change by adding one line to PatchAllModules, that updates the refcount
there?  But maybe I'm not understanding what you're trying to do, or where the 
time
win comes from your patch.

Original comment by csilv...@gmail.com on 16 Jan 2010 at 5:11

GoogleCodeExporter commented 9 years ago

In short, the version I posted scans only the modules that were loaded since 
the 
last PatchAllModules run, and avoids calling PatchAllModules in FreeLibrary on 
a 
library that didn't require any patching.

The patching_map reference count method only keeps track of libraries when 
PatchAllModules returns true, ie they or a library they were linked against and 
pulled in at the same time needed to be patched.  The nopatching_set helps on 
LoadLibrary to skip out on ones that don't need patching, but that check will 
be a 
series of string compares.  The FreeLibrary routine will only skip 
PatchAllModules 
if the library called for Free is in the patching_map and results in a greater 
than 
0 reference count.  Other modules that didn't need to be patched (80 to what? 3 
in 
our program), invoke PatchAllModules such as OPENGL32 in our case every 
FreeLibrary 
call.

PatchAllModules modules vector builds up a list of the currently loaded 
modules.  
Then it goes through the set of libc modules doing a vector erase on only the 
libc 
patched modules (which is a linear operation every time you erase from the 
middle 
because everything after has to be moved, not that it is probably contributing 
much 
to the total time).  Every time it goes though PatchAllModules, every module 
which 
isn't in libc gets scanned for malloc symbols, I expect that is the big 
timesink 
short of skipping PatchAllModules altogether when it isn't required.

In the version I posted, the loaded_map is the previous list of modules.  
PatchAllModules get the current list EnumProcessModules (and skips calling 
GetModuleInformation by the way on every module) and the set operations 
produces 
the modules that were add or removed since the previous PatchAllModule.  The 
loaded_map is updated for those module changes.  Previously finding out if any 
libc 
functions were unloaded required traversing the modules vector, now it's a set 
find 
and only on the modules that were removed, which half of the time will be zero, 
so 
maybe that for loop needs to be in an 'if (!removed.empty())' check now that I 
think about it.  Then just the added modules have GetModuleInformation called 
on 
them and only the added modules are scanned, which the first time through will 
be 
all of them and afterwards will be one or a very small number of them.

Original comment by da...@fries.net on 16 Jan 2010 at 5:49

GoogleCodeExporter commented 9 years ago

OK, let's not focus on the vector-traversal efficiency just yet: as you say, 
these
objects are all very small, and O(n) vs O(n^2) is likely not a big deal.  In any
case, that's easy to optimize later.

For now, can you make a minimally invasive patch that adds the functionality 
that is
actually helping you?  (The data structures that keep track of what is and isn't
currently loaded, it sounds like.)  That is, no new set objects -- in fact new 
data
structures kept to as minimum as you can.  That will make it clearer what the 
patch
is doing.  Then, in a separate step, we can optimize further if it's called for.

Original comment by csilv...@gmail.com on 16 Jan 2010 at 6:17

GoogleCodeExporter commented 9 years ago

I think this is the minimally invasive patch required to not have a performance 
regression as compared to the Microsoft allocator.  Can you suggest an 
alternative 
to the std::set container or set_difference operation (or why are you opposed 
to 
them)?

For PatchAllModules to be efficient it needs to only look at the modules loaded 
since the last scan.  To do that it needs a list of modules that were loaded 
last.  
If it was using a vector it would need to sort both first to find the added 
list.  
That would make it more complicated and leave the rest of the logic 
substantially 
the same as the version I posed.

Original comment by da...@fries.net on 16 Jan 2010 at 6:44

GoogleCodeExporter commented 9 years ago

I'm not opposed to the way you're doing it now, I just don't understand it.  
There's
lots of state stored in global variables that are shared between multiple 
routines,
you are copying data around from one data structure to another; overall it just 
seems
very tangled to me.  I feel a more minimal change would keep the tangling to a 
minimum.

In lieu of that, I think I need another version of the patch with much more 
extensive
comments around the parts you've added.  I'd like to get a new version of 
perftools
out next week, but my time is tight, so I'm hoping it won't be too difficult to
puzzle out what's going on here.

} To do that it needs a list of modules that were loaded last.

You just need a list of modules that had been loaded at some point, no?  It 
doesn't
matter if it's the last scan or an earlier scan.  I'd think you can just reuse 
the
existing patched_modules data structure.  I'm not suggesting you store this
information in a vector, I'm suggesting you try to fit what you're doing in 
with the
existing control flow and data structures, rather than changing it all about.

Original comment by csilv...@gmail.com on 16 Jan 2010 at 6:55

GoogleCodeExporter commented 9 years ago

I'll go through and update the comments and post those.  As to code changes, 
all my 
computers at home run Linux, so I'm not going to be able to do any code changes 
this weekend anyway.

I think PatchAllModules does need the last list of loaded modules 
PatchAllModules 
scanned to clear out the list of modules not loaded in a FreeLibrary in case a 
later LoadLibrary loads one at the same location.  I'm not sure what you were 
getting at for that question.

Which data structure is patched_modules?  The patching_map or loaded_map I 
renamed 
it to and reuse it in which context?

Original comment by da...@fries.net on 16 Jan 2010 at 7:12

GoogleCodeExporter commented 9 years ago

I didn't remove the #ifdef DEBUG print messages, but I was expecting them to be 
removed before the final commit.  I can produce a version with them removed if 
you 
want.

Looks like I had a bug between PatchAllModules and FreeLibrary.  If the 
reference 
count reaches 0 (but the reference count was wrong and it is still loaded), it 
would stay zero through PatchAllModules.  Erase the entry in FreeLibrary before 
calling PatchAllModules so it can detect this condition and add it back with a 
reference count of 1.

I think loaded_copy could be removed, it would be ugly though.  std::map can't 
be 
used for set operations, std::set only has a key, but if that key was a class 
that 
held both a module handle and a reference count, but the comparison was only on 
the 
module handle and the reference count was marked mutable, I think it would 
work.  
It sound too ugly for me though.

Slight nitpick note, the description for nopatching_set may not be completely 
correct.  If both library A and B are linked against library C and only C has 
malloc routines to patch and the application isn't linked against any of the 
libraries, then I would think the following could be a problem.  Load A, which 
brings in C and is patched.  Load B which is marked as not needing to be 
patched 
because C was already loaded, unload A and B which unloads C.  Load B which 
brings 
in C, but since B was added to the nopatching_set this time C doesn't get 
patched.  
I don't like how likely it is to bite someone, or if it is worth looking at 
even 
more complicated ways to detect that problem.

Original comment by da...@fries.net on 16 Jan 2010 at 10:51

Attachments:

patch_functions_comments.patch

caohaiwd / gperftools

CreateToolhelp32Snapshot fails spuriously in an MT application #200