ddio / gperftools

Automatically exported from code.google.com/p/gperftools
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

OpenGL Performance Issue In MakeCurrent And SwapBuffers #205

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Create an OpenGL application.
2. Our application will call MakeCurrent twice and Swapbuffers once during
the render pass.
3. A very noticeable drop in frame time is noticed.  An average frame time
for our application is around 3 to 4 ms.  Using TCMalloc, our application
would see 17 - 20 ms frame times.
4. A colleague investigated further and noticed that GDI32.dll was calling
to load and unload OpenGL32.  Becuase of this, the following functions
would be called: Perftools_LoadLibraryExW and Perftools_FreeLibrary.  The
two functions being called each frame is a large performance hit.  A test
was conducted and the functions were changed to not conduct the patch if
OPENGL32 was being loaded or unloaded.  This solved the performance issue
we were seeing.

What is the expected output? What do you see instead?
These functions are going to get hit each frame.  Traversing all the
modules each frame and trying to patch them can be time consuming.  The
WIN32 API, to my knowledge, does not have an easy way to get to the current
reference count of the module, but it might be better to check the
reference count of the module or move patching to be done in the DllMain
entry point.  The DllMain will not solve for statically linking TCMalloc
though.

What version of the product are you using? On what operating system?
TCMalloc 1.4
Windows XP SP3

Original issue reported on code.google.com by Ryan.H.K...@gmail.com on 7 Jan 2010 at 1:42

GoogleCodeExporter commented 9 years ago

Original comment by `` on 18 Jan 2012 at 9:46

GoogleCodeExporter commented 9 years ago
One thing we could do is cache module information -- either the module name, or 
ideally, the full path to the module (I don't know if that's available via 
LoadLibraryExW).  The idea is that if we see once that a library doesn't have 
any 
libc symbols in it, we can mark that down so we ignore it at subsequent loads 
and 
unloads.

The current LoadLibraryExW was written under the assumption it would be called 
rarely.  Clearly that's not always true, and in the situations it's called 
frequently, it's the same module being loaded and unloaded again.

(btw, thanks for tracking down what the problem is!  I would have been stumped 
otherwise.)

Would this caching idea work for you?

Original comment by csilv...@gmail.com on 7 Jan 2010 at 2:03

GoogleCodeExporter commented 9 years ago
I discovered that tcmalloc was causing GL SwapBuffers to take 4.17ms instead
of 0.17ms with the default allocator, and not expecting it to need to allocate
any memory I put a breakpoint at swap buffers then tcmalloc to discover the
module request/free issue.

I'm looking into the different options to address the performance issue.

Original comment by da...@fries.net on 7 Jan 2010 at 7:58

GoogleCodeExporter commented 9 years ago
Try the following version of patch_functions.cc.  I've changed it to only try 
to 
patch/unpatch the module in question, when it gets the LoadLibrary/FreeLibrary 
windows 
calls.  I also keep a cache of modules that do not have any libc info, so it's 
fast to 
load those after the first time.

All unittests pass, but I don't have good test that load and unload lots of 
windows 
modules, so no promises.  But try it out and let me know how it works.

Original comment by csilv...@gmail.com on 9 Jan 2010 at 3:47

Attachments:

GoogleCodeExporter commented 9 years ago
There's some discussion of this patch going on in issue 201.  It looks like 
it's not 
quite ready to try yet.  I hope to have a new candidate patch (or rather, a new 
version of patch_functions.cc) up shortly.

Original comment by csilv...@gmail.com on 11 Jan 2010 at 3:56

GoogleCodeExporter commented 9 years ago
I did test out your patch_functions.cc from January 8, and it does fix our
application start up time issue.  With the Microsoft malloc the program was 
taking 7
seconds, with the 1.4 tcmalloc it was 11 to 12 with the subversion tcmalloc and 
above
file it was back to 6.21 seconds.  The caching is making a big difference.

As far as OPENGL32 frame time load/reference/unreference I'm looking into it. 
LoadLibrary, FreeLibrary is no longer the bottleneck, but the performance isn't 
what
the Microsoft allocator gives.  As in 3.7 ms frame time vs 6.6 ms for tcmalloc.

Original comment by da...@fries.net on 11 Jan 2010 at 7:02

GoogleCodeExporter commented 9 years ago
Looking over the shoulder of my colleague, I noticed that you don't call
PatchAllModules on a module that is new to the system.  The only reason that I 
state
this is that DLLs can implicitly load other modules into the system that they 
depend
on.  Implicitly loaded modules are handled by the OS loader and do not call to 
the
LoadLibrary routine.  It might be best to call a function that compares the 
cache to
the current and only patch those functions that need to be patched.

Ultimately, I think this problem should be addressed by Microsoft.  I am going 
to try
and submit some information detailing this problem, but I'm pretty sure that 
this
problem will not be addressed on their side.  Loading and unloading the same 
library
just to call a function seems like a bad idea on their part, but I am sure they 
had
their reasons for doing what they did.  There are still valid reasons for 
having the
cache though, as we did notice an impact on application initialization time 
without
the cache in place.

Original comment by Ryan.H.K...@gmail.com on 11 Jan 2010 at 7:02

GoogleCodeExporter commented 9 years ago
On second thought after I look at both our window make current and swap times 
the
total is compariable to the Microsoft allocator.  Sometimes when starting the 
frame
time is greater in the 2D vs 3D window, and sometimes it switches, I shouldn't 
have
looked at just one.  Ignore the above comment about a possible performance 
issue for
the frame time.

Original comment by da...@fries.net on 11 Jan 2010 at 10:23

GoogleCodeExporter commented 9 years ago
Ryan's comment is right on.  Here's a new version of patch_functions.cc, that 
should 
work better.  Again, it passes all the unittests, but I don't have good 
module-loading 
and -unloading test cases to test it out on, so your feedback is appreciated.

Original comment by csilv...@gmail.com on 11 Jan 2010 at 10:39

GoogleCodeExporter commented 9 years ago
Let me try that attaching again...

Original comment by csilv...@gmail.com on 11 Jan 2010 at 10:49

Attachments:

GoogleCodeExporter commented 9 years ago
Over in issue 201, there's been some reports of deadlock in the current 
patch_functions.cc, when multiple threads are loading/unloading libraries at 
the same 
time.  Here's a new version of patch_functions.cc, that I think addresses that. 
 When 
you get a chance, can you try this patch and see how it works for you?

Original comment by csilv...@gmail.com on 13 Jan 2010 at 10:12

GoogleCodeExporter commented 9 years ago
Boy, what is it with me and attaching files?  Trying again...

Original comment by csilv...@gmail.com on 13 Jan 2010 at 10:13

Attachments:

GoogleCodeExporter commented 9 years ago
01-08-2009 update, startup 6 to 7 seconds, current+swap 2d&3d (from notes) = 
5.71 ms
This was the PatchOneModule, ModuleEntryCopy version.
01-13-2009 update, startup 10 seconds, current+swap 2d&3d = 17.71 ms

Original comment by da...@fries.net on 14 Jan 2010 at 4:11

GoogleCodeExporter commented 9 years ago
Can you remind me what the goal timings are?

Unfortunately, the 01-08 update got its speed (partially) by being incorrect. 
:-(

I wonder if the optimizations I've put in are working.  Can you perhaps add 
printfs to 
the LoadLibraryExW and FreeLibrary overload methods, to verify that the module 
you 
keep loading/unloading actually makes it into the nopatch_set?  If not, let's 
try to 
understand why.

Original comment by csilv...@gmail.com on 14 Jan 2010 at 4:28

GoogleCodeExporter commented 9 years ago
We have about three times we are looking at.  There's startup time, static 
screen
update time, and dynamic screen update time.  The default visual studio malloc 
was
tripping over all the allocations in the dynamic screen update time.  Switching 
to
tcmalloc addresses the dynamic, but the last released version is doubling the 
startup
time and trippling our base (static) screen update.  It's slower but more 
deterministic.

I can see a problem in free library.  It is only using the patching_map which 
looks
like it is only when the load library needed to patch something for that 
module.  I
don't think the reference count will work anyway.  The reference counting 
doesn't
take into account if the library was already loaded before the first tcmalloc
LoadLibrary, so it sees load ref=1, free ref=0.  Or if a loaded library was 
linked
against another library and brought it in and the original reference free 
library was
called, and it was still loaded.

I have seen a deadlock in some of my test modifications.  From what I was 
seeing one
of the routines in PatchAllModules is executed with the spinlock held, but was
calling a routine that called LoadLibrary which would find the spinlock held.  
In
that case the stack trace didn't show this recursion, I had to step until it
recoursed.  On Windows their mutex locks (critical sections) are recursive and 
would
solve that recursion deadlock if the data structures wouldn't be corrupted in 
that case.

What deadlock was being seen?

Original comment by da...@fries.net on 14 Jan 2010 at 6:44

GoogleCodeExporter commented 9 years ago
} I can see a problem in free library.  It is only using the patching_map which 
looks
} like it is only when the load library needed to patch something for that 
module.

Why is that a problem?

Did you manage to run with the printf's inserted, like I mentioned?  That would 
be 
good to see.

} The reference counting doesn't take into account if the library was
} already loaded before the first tcmalloc LoadLibrary

That's true, but it's ok.  The reference count is just a small optimization; 
without 
it FreeLibrary may do a bit of unnecessary work trying to unpatch necessarily, 
when 
it could have known the unpatch was a noop.  In the situation you describe, one 
might 
see a bit of extra work from time to time along those lines, as well.

However, it shouldn't affect the use-case you've described in your original bug 
report, which is why I'm confused.

} What deadlock was being seen?

I document it in patch_functions.cc now.  It wasn't a problem with recursive 
locking, 
it was a problem with lock inversion with respect to some internal windows lock.

Original comment by csilv...@gmail.com on 14 Jan 2010 at 6:53

GoogleCodeExporter commented 9 years ago
Early Load exit OPENGL32 handle 5ed00000 time  0.007 ms
Late free exit handle 5ed00000 time  3.825 ms
Early Load exit OPENGL32 handle 5ed00000 time  0.008 ms
Late free exit handle 5ed00000 time  3.820 ms
Early Load exit OPENGL32 handle 5ed00000 time  0.006 ms
Late free exit handle 5ed00000 time  3.752 ms
Early Load exit OPENGL32 handle 5ed00000 time  0.008 ms
Late free exit handle 5ed00000 time  3.925 ms

I included the changes to produce the above output.  As you can see the 
FreeLibrary
is expensive.

Original comment by da...@fries.net on 14 Jan 2010 at 7:31

Attachments:

GoogleCodeExporter commented 9 years ago
timing.patch is empty.  Can you try attaching it again?

Original comment by csilv...@gmail.com on 15 Jan 2010 at 1:54

GoogleCodeExporter commented 9 years ago
Merging with issue 201, where the discussion seems to have moved.

Original comment by csilv...@gmail.com on 18 Jan 2010 at 11:03