Closed GoogleCodeExporter closed 9 years ago
Original comment by `` on 18 Jan 2012 at 9:46
One thing we could do is cache module information -- either the module name, or
ideally, the full path to the module (I don't know if that's available via
LoadLibraryExW). The idea is that if we see once that a library doesn't have
any
libc symbols in it, we can mark that down so we ignore it at subsequent loads
and
unloads.
The current LoadLibraryExW was written under the assumption it would be called
rarely. Clearly that's not always true, and in the situations it's called
frequently, it's the same module being loaded and unloaded again.
(btw, thanks for tracking down what the problem is! I would have been stumped
otherwise.)
Would this caching idea work for you?
Original comment by csilv...@gmail.com
on 7 Jan 2010 at 2:03
I discovered that tcmalloc was causing GL SwapBuffers to take 4.17ms instead
of 0.17ms with the default allocator, and not expecting it to need to allocate
any memory I put a breakpoint at swap buffers then tcmalloc to discover the
module request/free issue.
I'm looking into the different options to address the performance issue.
Original comment by da...@fries.net
on 7 Jan 2010 at 7:58
Try the following version of patch_functions.cc. I've changed it to only try
to
patch/unpatch the module in question, when it gets the LoadLibrary/FreeLibrary
windows
calls. I also keep a cache of modules that do not have any libc info, so it's
fast to
load those after the first time.
All unittests pass, but I don't have good test that load and unload lots of
windows
modules, so no promises. But try it out and let me know how it works.
Original comment by csilv...@gmail.com
on 9 Jan 2010 at 3:47
Attachments:
There's some discussion of this patch going on in issue 201. It looks like
it's not
quite ready to try yet. I hope to have a new candidate patch (or rather, a new
version of patch_functions.cc) up shortly.
Original comment by csilv...@gmail.com
on 11 Jan 2010 at 3:56
I did test out your patch_functions.cc from January 8, and it does fix our
application start up time issue. With the Microsoft malloc the program was
taking 7
seconds, with the 1.4 tcmalloc it was 11 to 12 with the subversion tcmalloc and
above
file it was back to 6.21 seconds. The caching is making a big difference.
As far as OPENGL32 frame time load/reference/unreference I'm looking into it.
LoadLibrary, FreeLibrary is no longer the bottleneck, but the performance isn't
what
the Microsoft allocator gives. As in 3.7 ms frame time vs 6.6 ms for tcmalloc.
Original comment by da...@fries.net
on 11 Jan 2010 at 7:02
Looking over the shoulder of my colleague, I noticed that you don't call
PatchAllModules on a module that is new to the system. The only reason that I
state
this is that DLLs can implicitly load other modules into the system that they
depend
on. Implicitly loaded modules are handled by the OS loader and do not call to
the
LoadLibrary routine. It might be best to call a function that compares the
cache to
the current and only patch those functions that need to be patched.
Ultimately, I think this problem should be addressed by Microsoft. I am going
to try
and submit some information detailing this problem, but I'm pretty sure that
this
problem will not be addressed on their side. Loading and unloading the same
library
just to call a function seems like a bad idea on their part, but I am sure they
had
their reasons for doing what they did. There are still valid reasons for
having the
cache though, as we did notice an impact on application initialization time
without
the cache in place.
Original comment by Ryan.H.K...@gmail.com
on 11 Jan 2010 at 7:02
On second thought after I look at both our window make current and swap times
the
total is compariable to the Microsoft allocator. Sometimes when starting the
frame
time is greater in the 2D vs 3D window, and sometimes it switches, I shouldn't
have
looked at just one. Ignore the above comment about a possible performance
issue for
the frame time.
Original comment by da...@fries.net
on 11 Jan 2010 at 10:23
Ryan's comment is right on. Here's a new version of patch_functions.cc, that
should
work better. Again, it passes all the unittests, but I don't have good
module-loading
and -unloading test cases to test it out on, so your feedback is appreciated.
Original comment by csilv...@gmail.com
on 11 Jan 2010 at 10:39
Let me try that attaching again...
Original comment by csilv...@gmail.com
on 11 Jan 2010 at 10:49
Attachments:
Over in issue 201, there's been some reports of deadlock in the current
patch_functions.cc, when multiple threads are loading/unloading libraries at
the same
time. Here's a new version of patch_functions.cc, that I think addresses that.
When
you get a chance, can you try this patch and see how it works for you?
Original comment by csilv...@gmail.com
on 13 Jan 2010 at 10:12
Boy, what is it with me and attaching files? Trying again...
Original comment by csilv...@gmail.com
on 13 Jan 2010 at 10:13
Attachments:
01-08-2009 update, startup 6 to 7 seconds, current+swap 2d&3d (from notes) =
5.71 ms
This was the PatchOneModule, ModuleEntryCopy version.
01-13-2009 update, startup 10 seconds, current+swap 2d&3d = 17.71 ms
Original comment by da...@fries.net
on 14 Jan 2010 at 4:11
Can you remind me what the goal timings are?
Unfortunately, the 01-08 update got its speed (partially) by being incorrect.
:-(
I wonder if the optimizations I've put in are working. Can you perhaps add
printfs to
the LoadLibraryExW and FreeLibrary overload methods, to verify that the module
you
keep loading/unloading actually makes it into the nopatch_set? If not, let's
try to
understand why.
Original comment by csilv...@gmail.com
on 14 Jan 2010 at 4:28
We have about three times we are looking at. There's startup time, static
screen
update time, and dynamic screen update time. The default visual studio malloc
was
tripping over all the allocations in the dynamic screen update time. Switching
to
tcmalloc addresses the dynamic, but the last released version is doubling the
startup
time and trippling our base (static) screen update. It's slower but more
deterministic.
I can see a problem in free library. It is only using the patching_map which
looks
like it is only when the load library needed to patch something for that
module. I
don't think the reference count will work anyway. The reference counting
doesn't
take into account if the library was already loaded before the first tcmalloc
LoadLibrary, so it sees load ref=1, free ref=0. Or if a loaded library was
linked
against another library and brought it in and the original reference free
library was
called, and it was still loaded.
I have seen a deadlock in some of my test modifications. From what I was
seeing one
of the routines in PatchAllModules is executed with the spinlock held, but was
calling a routine that called LoadLibrary which would find the spinlock held.
In
that case the stack trace didn't show this recursion, I had to step until it
recoursed. On Windows their mutex locks (critical sections) are recursive and
would
solve that recursion deadlock if the data structures wouldn't be corrupted in
that case.
What deadlock was being seen?
Original comment by da...@fries.net
on 14 Jan 2010 at 6:44
} I can see a problem in free library. It is only using the patching_map which
looks
} like it is only when the load library needed to patch something for that
module.
Why is that a problem?
Did you manage to run with the printf's inserted, like I mentioned? That would
be
good to see.
} The reference counting doesn't take into account if the library was
} already loaded before the first tcmalloc LoadLibrary
That's true, but it's ok. The reference count is just a small optimization;
without
it FreeLibrary may do a bit of unnecessary work trying to unpatch necessarily,
when
it could have known the unpatch was a noop. In the situation you describe, one
might
see a bit of extra work from time to time along those lines, as well.
However, it shouldn't affect the use-case you've described in your original bug
report, which is why I'm confused.
} What deadlock was being seen?
I document it in patch_functions.cc now. It wasn't a problem with recursive
locking,
it was a problem with lock inversion with respect to some internal windows lock.
Original comment by csilv...@gmail.com
on 14 Jan 2010 at 6:53
Early Load exit OPENGL32 handle 5ed00000 time 0.007 ms
Late free exit handle 5ed00000 time 3.825 ms
Early Load exit OPENGL32 handle 5ed00000 time 0.008 ms
Late free exit handle 5ed00000 time 3.820 ms
Early Load exit OPENGL32 handle 5ed00000 time 0.006 ms
Late free exit handle 5ed00000 time 3.752 ms
Early Load exit OPENGL32 handle 5ed00000 time 0.008 ms
Late free exit handle 5ed00000 time 3.925 ms
I included the changes to produce the above output. As you can see the
FreeLibrary
is expensive.
Original comment by da...@fries.net
on 14 Jan 2010 at 7:31
Attachments:
timing.patch is empty. Can you try attaching it again?
Original comment by csilv...@gmail.com
on 15 Jan 2010 at 1:54
Merging with issue 201, where the discussion seems to have moved.
Original comment by csilv...@gmail.com
on 18 Jan 2010 at 11:03
Original issue reported on code.google.com by
Ryan.H.K...@gmail.com
on 7 Jan 2010 at 1:42