Use GLX_EXT_buffer_age and/or GLX_OML_swap_method

Hi,

I don't use compton, but I was passing along and noticed that you care about performance, and some laments that development is stalling. So here's something to keep you busy :)

With GLX, you now have basically three presentation strategies with different disadvantages:

Redraw everything and swap (extraneous redraws)
Blit from old framebuffer by hand, redraw what changed, and swap (nouveau seems to hit a software fallback — probably a driver weakness, but you get punished)
Use CopySubBufferMESA (does not vsync)

(it also seems that with --glx-swap-method (undocumented in the man page) you are making assumptions about how the driver is implementing swaps)

Please have a look into using

GLX_OML_swap_method — an older extension, exposed by Mesa but not nVidia drivers, lets you ask for SwapBuffers semantics (copy or swap) explicitly
and/or GLX_EXT_buffer_age — the newer extension, exposed by nVidia drivers (but not by Mesa GLX, hopefully in future), lets you know how old the contents of the buffer you're drawing into are

Either should allow you to use SwapBuffers (thus getting VSync) and optimize redrawing.

I don't use compton, but I was passing along and noticed that you care about performance, and some laments that development is stalling. So here's something to keep you busy :)

Thanks for helping, really! :-)

Blit from old framebuffer by hand, redraw what changed, and swap (nouveau seems to hit a software fallback — probably a driver weakness, but you get punished)

Yeah, this is --glx-copy-from-front. With nvidia-drivers the performance looks okay but for nouveau it's horrifying.

Use CopySubBufferMESA (does not vsync)

Yes, --glx-use-copysubbuffermesa. Some people say it doesn't break VSync but some complains it does.

(it also seems that with --glx-swap-method (undocumented in the man page) you are making assumptions about how the driver is implementing swaps) Please have a look into using

GLX_OML_swap_method — an older extension, exposed by Mesa but not nVidia drivers, lets you ask for SwapBuffers semantics (copy or swap) explicitly

and/or GLX_EXT_buffer_age — the newer extension, exposed by nVidia drivers (but not by Mesa GLX, hopefully in future), lets you know how old the contents of the buffer you're drawing into are Either should allow you to use SwapBuffers (thus getting VSync) and optimize redrawing.

Thanks for the extensive information, firstly. :-)

Yep, the man page isn't updated very frequently. I think I've get the man page up-to-date in richardgv-dev branch. Our documentation quality is always pretty poor, though.

We could use GLX_OML_swap_method, but there's one issue: Many drivers use triple buffer by default as far as I know and GLX_OML_swap_method cannot report this. And I have the fear that it may report bad value to us (broken drivers), so what we chose eventually is letting user specify the amount of buffers (or to say, the maximum amount of buffers) the driver has manually through --glx-swap-method. In the code in richardgv-dev branch, --glx-swap-method supports 0 (undefined), 1 (copy), 2 (exchange), 3, 4, 5, or 6 buffers. (We could support more, but I guess the era of 4,096-buffers hasn't yet arrived. :-) )

Within the code in richardgv-dev branch we already have support for GLX_EXT_buffer_age, as --glx-swap-method buffer-age. Seemingly nvidia-drivers isn't acting too well on this, though. It forever tells me the buffer age is 2, but actually with that I see things broken occasionally and have to use --glx-swap-method 3.

I've just merged those features from richardgv-dev branch to master.

By the way, there are a few questions that I wish to ask. I know just too little about OpenGL...

Currently we do region operations through X Fixes (even with OpenGL backend), this adds a few roundtrips to X thus decreases performance in a certain degree. (No stress for GPU, though, just CPU.) Compiz uses its own region operation implementation to speed this up and eliminate all possibilities of overlapping rectangles in a region. (X Fixes doesn't guarantee the rectangles it returns doesn't overlap, although in reality xorg-server's implementation is probably trying to prevent this.) We could use Xlib's client side region implementation, but its format is not completely compatible with X Fixes (specifically, X Fixes uses x, y, width, height to define a rectangle while Xlib uses x1, y1, x2, y2) and probably XCB doesn't actually have it -- correct me if I'm wrong. Do you have any suggestions about this?
Is using a stencil buffer always inefficient? We use stencil buffer by default to set paint region because my fear about overlapping rectangles in regions, stated above, but seemingly it's 10%-15% slower compared with --glx-no-stencil mode.
Another problem is related to blurring. Blur calculation involves close-by pixels, and those pixels (1px or 2px, depending on blur radius) outside the repaint area are involved in calculation as well (with either GLX backend + --glx-swap-buffer/--glx-use-copysubbuffermesa or X Render backend), yet those pixels don't always have correct values, frequently causing a "linear" corruption issue around the damaged area (reported in #104). My workaround is to enlarge the repaint area by one or two pixel, but don't actually paint the extra area out to screen, only using them for blur calculation. With --glx-use-copysubbuffermesa or X Render backend this works pretty good, but it cannot be done with --glx-swap-buffer: We don't have one more buffer except the back buffer to paint on, and --glx-swap-buffer uses glXSwapBuffer() to swap the whole buffer out, so we cannot prevent the enlarged one or two pixel region from being swapped to front, and the result is, still linear corruption around the enlarged repainted region. Any suggestions, please? Should we just read those pixels from front buffer with... Huh, glCopyPixels()?
I saw some people having VSync enabled on driver level and compton (GLX backend) is indeed throttled to refresh rate, and there's still tearing, unless they enable VSync inside compton. Worse yet, sometimes with both driver VSync and compton VSync enabled, compton is throttled below refresh rate (60Hz -> 49Hz with radeon driver) and it causes an artificial appearance of slowness. I guess I need some professional guide about this issue.
We almost extensively use OpenGL fixed pipeline functions, those legacy OpenGL 1/2 stuffs, just to keep compatibility with older cards. We use them for color inversion, blending, painting texture with opacity, etc. Are those fixed pipeline functions significantly slower than modern (GLSL-based) solutions? (Okay, ARB assembly might be faster but I don't want to touch those nasty things...)
Presently, our window background blur is done this way: Create a new texture, use glCopyTexSubImage2D() to read current backbuffer into the texture, paint this texture back to back buffer with a GLSL blurring shader. Are there any faster ways?
We implemented support for GL_TEXTURE_RECTANGLE (we fallback to it if non-power-of-two GL_TEXTURE_2D textures are not supported) and non-y-inverted texture purely based on guesses. No reported breakages so far, but I have not slightest idea whether they work. I just wonder if they are widespread enough to cause problems, and whether there's something special that I must take care of with those things.
I guess I went pretty far looking for different OpenGL debuggers but I've found no satisfactory OpenGL profiler so far. I can't get buGLe running, the older gDebugger from Gremedy doesn't seemingly do much on profiling. I don't have a ATi graphic card and AMD's new gDebugger and CodeXL doesn't seemingly do OpenGL profiling here. apitrace supports profiling but not glXCreatePixmap(), glXBindTextureToPixmapEXT(), etc., which are pretty important for a compositor.
Does abundant OpenGL state changes (like glEnable(XXX); glDisable(XXX);) actually affect performance?

(github is doing something odd with formatting for email replies; I've removed my previous response that got all garbled)

Heh, this is what I get for not checking the branches first. :)

First of all, I have zero experience wtih Xlib, xcb, and compositing, and my experience with OpenGL is not that great either. I've omitted some questions — it means I have nothing to say on the matter.

Currently we do region operations through X Fixes ... Do you have any suggestions about this?

I don't even understand why you need that — for performance (prevent overdraw), or for correctness?

Is using a stencil buffer always inefficient? We use stencil buffer by default to set paint region because my fear about overlapping rectangles in regions, stated above, but seemingly it's 10%-15% slower compared with --glx-no-stencil mode.

Obviously stencil is not free (10-15% loss does not seem that bad, actually). And you can't even use it with CopySubBufferMESA (it's ignored when blitting, right?).

Another problem is related to blurring. ... Any suggestions, please?

Again, I'll admit that I failed to build a good mental model of the problem you were explaining. However let me point out that blurring is contradictory to performance. Also, blur that propagates breadth-wise forces you to redraw transparent windows back-to-front when something changes beneath, i.e. you sacrifice even more performance or get artifacts (imagine a white dot in a black window, and then dragging semi-transparent windows on top — each makes the bright spot wider).

Presently, our window background blur is done this way: Create a new texture, use glCopyTexSubImage2D() to read current backbuffer into the texture, paint this texture back to back buffer with a GLSL blurring shader. Are there any faster ways?

Eeek. I don't see how you can avoid a copy, but you can save one draw by feeding the shader with two textures (to-be-blurred background, and foreground). By the way, using texture2Doffset in the shader should be a bit better for performance.

We implemented support for GL_TEXTURE_RECTANGLE (we fallback to it if non-power-of-two GL_TEXTURE_2D textures are not supported) and non-y-inverted texture purely based on guesses. No reported breakages so far, but I have not slightest idea whether they work. I just wonder if they are widespread enough to cause problems, and whether there's something special that I must take care of with those things.

You need to account for different coordinates range (0-w,0-h, instead of 0-1,0-1) in the shader for texture_rectangle. I wouldn't implement such things before somebody asked. Well, in any case you can test them yourself — just patch the code to pretend that extensions are missing to test your fallbacks.

I guess I went pretty far looking for different OpenGL debuggers but I've found no satisfactory OpenGL profiler so far. I can't get buGLe running

I could. Feel free to contact me by e-mail to discuss this further.

Regarding apitrace, if it really doesn't support those calls, file a bug; its author is usually responsive.

Does abundant OpenGL state changes (like glEnable(XXX); glDisable(XXX);) actually affect performance?

I think the important question is how many separate draw calls you're doing. The state changes per se don't cost that much — you can write a simple benchmark to test how many such calls you can do per second.

I'm really really sorry, amonakov. Today my whole day is filled by boring schoolwork. I did read your reply but I have no time to finish my length reply. (My replies are usually length. :-D ) I just want to tell you that I'm still alive and caring about the problems, but I need some extra time.

Heh, this is what I get for not checking the branches first. :)

Never mind. :-)

First of all, I have zero experience wtih Xlib, xcb, and compositing, and my experience with OpenGL is not that great either. I've omitted some questions — it means I have nothing to say on the matter.

Still, thanks for your answers! :-)

Currently we do region operations through X Fixes ... Do you have any suggestions about this?

I don't even understand why you need that — for performance (prevent overdraw), or for correctness?

Both for performance and correctness.

Compositing is not drawing a single texture to buffer but a huge number of them, each with different options (opacity, negation, etc.). To avoid drawing the same area twice (with 100% opacity) or drawing outside the region that actually needs to be repainted, we do a lot of region operations to get a painting region for each texture, then paint the texture to buffer either by setting up the mask with stencil buffer and paint a large rectangle (default mode), or paint each rectangle in the region separately (--glx-no-stencil).

Correctness is for that not all windows are rectangular. For strangely shaped regions we have to paint according to the window region reported by X Shape extension.

Is using a stencil buffer always inefficient? We use stencil buffer by default to set paint region because my fear about overlapping rectangles in regions, stated above, but seemingly it's 10%-15% slower compared with --glx-no-stencil mode.

Obviously stencil is not free (10-15% loss does not seem that bad, actually). And you can't even use it with CopySubBufferMESA (it's ignored when blitting, right?).

Huh, as stated above we use stencil buffer not when swapping but for defining the paint region when painting each and every window texture to back buffer.

Another problem is related to blurring. ... Any suggestions, please?

Again, I'll admit that I failed to build a good mental model of the problem you were explaining. However let me point out that blurring is contradictory to performance. Also, blur that propagates breadth-wise forces you to redraw transparent windows back-to-front when something changes beneath, i.e. you sacrifice even more performance or get artifacts (imagine a white dot in a black window, and then dragging semi-transparent windows on top — each makes the bright spot wider).

Huh, no, not really. The artifact only appears on the border of the repaint area. And it will go away once the pixel is repainted, so I imagine it's very unlikely to produce a large artifact this way.

Presently, our window background blur is done this way: Create a new texture, use glCopyTexSubImage2D() to read current backbuffer into the texture, paint this texture back to back buffer with a GLSL blurring shader. Are there any faster ways?

Eeek. I don't see how you can avoid a copy, but you can save one draw by feeding the shader with two textures (to-be-blurred background, and foreground). By the way, using texture2Doffset in the shader should be a bit better for performance.

Combine those two stages sounds like a good idea, thanks! I'm considering using linear sampling for the blur, as well.

As for texture2Doffset(), it's provided by GL_EXT_gpu_shader4, which is available only since GeForce 8 Series (November 2006), right? Actually I'm not feeling particularly well for adding it. Huh, let me think...

We implemented support for GL_TEXTURE_RECTANGLE (we fallback to it if non-power-of-two GL_TEXTURE_2D textures are not supported) and non-y-inverted texture purely based on guesses. No reported breakages so far, but I have not slightest idea whether they work. I just wonder if they are widespread enough to cause problems, and whether there's something special that I must take care of with those things.

You need to account for different coordinates range (0-w,0-h, instead of 0-1,0-1) in the shader for texture_rectangle. I wouldn't implement such things before somebody asked. Well, in any case you can test them yourself — just patch the code to pretend that extensions are missing to test your fallbacks.

Oops, I never realized my card actually supports GL_TEXTURE_RECTANGLE. I must be an idiot... I will push the fix out later. Looks like they are almost equal in speed on my GTX 670. I still have no way to test non-y-inverted textures, though. And thanks!

I guess I went pretty far looking for different OpenGL debuggers but I've found no satisfactory OpenGL profiler so far. I can't get buGLe running

I could. Feel free to contact me by e-mail to discuss this further.

Oh, I guess the ebuild I wrote has some problems. I modified it and it works right now. But thanks. :-)

Regarding apitrace, if it really doesn't support those calls, file a bug; its author is usually responsive.

Huh, I guess it isn't that easy. glXBindPixmapEXT() is connected to a X pixmap and with apitrace's architecture this pixmap already doesn't exist when replaying. I'm not completely sure if it's possible for apitrace to maintain some of its information.

Does abundant OpenGL state changes (like glEnable(XXX); glDisable(XXX);) actually affect performance?

I think the important question is how many separate draw calls you're doing. The state changes per se don't cost that much — you can write a simple benchmark to test how many such calls you can do per second.

Oh, I see. :-)

Update: I have no idea why this would happen, but seemingly texture2Doffset() causes degradation in performance here, ranging from 2% to 5%, according my benchmark.

Yes, --glx-use-copysubbuffermesa. Some people say it doesn't break VSync but some complains it does.

On recent nvidia and intel hardware/drivers I believe it "partly" breaks vsync, most of the screen has no tearing, but near the top inch or so there is tearing.

The intel developers no longer recommend compositors use this (quote from an intel developer from phoronix):

Yes, the tearing you see in Kwin's compositor is because they use MESA_copy_sub_buffer rather than composite the whole framebuffer and swap. MESA_copy_sub_buffer was originally created as an optimisation and all the compositors were encouraged to use it. In retrospect, it was a bad idea, hurting the bandwidth constrained IGP devices the most. The very same devices that it claimed to be designed for. And now continued use of that extension is an extreme pessimation.

@bwat47:

On recent nvidia and intel hardware/drivers I believe it "partly" breaks vsync, most of the screen has no tearing, but near the top inch or so there is tearing.

Ah, I see, thanks for the info. Yes, theoretically we frequently call glXCopySubBufferMESA() for many times just to paint the content to front buffer (many rectangles), while with glXSwapBuffers() it's only one call, so it seems reasonable that glXCopySubBufferMESA() breaks VSync in cases.

The intel developers no longer recommend compositors use this (quote from an intel developer from phoronix):

MESA_copy_sub_buffer was originally created as an optimisation and all the compositors were encouraged to use it. In retrospect, it was a bad idea, hurting the bandwidth constrained IGP devices the most. The very same devices that it claimed to be designed for. And now continued use of that extension is an extreme pessimation.

Intel developers, or Intel driver developers? If the are Intel developers, they are speaking as if they have produced some awesome enough chips that make full-screen repaint acceptable! :-D Just count how much work I've done mostly for their crappy products...

And if MESA_copy_sub_buffer is not good enough in their eyes, what is an alternative?

That was a post from one intel driver developer :)

From what I can see GLX_EXT_buffer_age will be the main alternative. Its not yet supported in mesa glx but probably will be in the future (afiak mesa EGL already has patches for the equivalent extension). I remember reading somewhere that it probably won't be implemented in mesa until DRI3 because its tricky to implement properly in DRI2.

At the moment I think compiz just does fullscreen repaints, and mutter and kwin still use mesa_copy_sub_buffer so they both have the issue where there's tearing at the top. I'm not sure about this, but it seems that kwin's solution to this issue on intel seems to be similar to compton's --glx-copy-from-front option (correct me if I'm wrong, I haven't yet tried this compton option): https://bugs.kde.org/show_bug.cgi?id=307965#c99 edit: added --glx-copy-from-front to my compton commandline on my intel system. Seems to perform fine, can't say it performs noticeable better or worse and there's still no tearing (performance with compton on this intel card was quite good to begin with). I think firefox's smooth scrolling feels slightly more responsive with it though.

EDIT: Here's a post from another intel developer from the comments here (Robert Bragg): http://shnatsel.blogspot.com/2013/03/why-your-desktop-wont-be-running.html

To further comment on this point in particular, I'd like point out that I originally developed and implemented the EGL_EXT_buffer_age and EGL_EXT_swap_buffers_with_damage extensions for Wayland. You can see the history here: https://github.com/rib/gl-extensions (The buffer age extension starts off as EGL_INTEL_start_frame) With the architecture of the current open source X11 drivers these extensions basically can't be supported so it's going to take a considerable overhaul (I.e. DRI3) to enable these, whereas it's a piece of cake to support these with Wayland.

So it looks like the extension won't be implemented in X until DRI3.

@richardgv, that's the opensource way, removing support for features before an alternative is finished! :D

@bwat47:

Thanks for the info!

From what I can see GLX_EXT_buffer_age will be the main alternative. Its not yet supported in mesa glx but probably will be in the future (afiak mesa EGL already has patches for the equivalent extension). I remember reading somewhere that it probably won't be implemented in mesa until DRI3 because its tricky to implement properly in DRI2.

We have support for GLX_EXT_buffer_age and we have a way to handle this without GLX_EXT_buffer_age: --glx-swap-method 3, which should work for most triple-buffer implementations including Intel chips, so this isn't a major concern at least for compton, I suppose.

it seems that kwin's solution to this issue on intel seems to be similar to compton's --glx-copy-from-front option (correct me if I'm wrong, I haven't yet tried this compton option): https://bugs.kde.org/show_bug.cgi?id=307965#c99

The kwin patch seemingly uses triple-buffer damage tracking and glCopyPixels(), somehow resembling a mixture of --glx-swap-method 3 and --glx-copy-from-front, which is indeed more efficient than compton's approach. Let me see if I could implement this as well in compton

that's the opensource way, removing support for features before an alternative is finished!

I suppose it wouldn't have happened if Linux occupied 90% market share. I guess the key problem is still, Windows don't need this, so Intel doesn't care about this.

Regarding apitrace, if it really doesn't support those calls, file a bug; its author is usually responsive.

Huh, I guess it isn't that easy. glXBindPixmapEXT() is connected to a X pixmap and with apitrace's architecture this pixmap already doesn't exist when replaying. I'm not completely sure if it's possible for apitrace to maintain some of its information.

It's actually been on my todo list for quite some time, as apitrace already supports the EGL equivalent of GLX_EXT_texture_from_pixmap, by emitting fake glTexImage2D calls. Try https://github.com/apitrace/apitrace/commit/151c370e340598cc2279e14e4513fa3247add12a . I only tested with trivial EXT_texture_from_pixmap demo, and not a full-fledged compositor.

@jrfonseca:

Thanks for the fast work!

I just tried and started getting this:

apitrace: tracing to /home/richard/git/compton/compton.5.trace
*** Error in `./compton': double free or corruption (!prev): 0x00000000020fc180 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7cade)[0x7f2143a0cade]
/lib64/libc.so.6(+0x7d7d7)[0x7f2143a0d7d7]
./compton[0x41ac02]
./compton[0x4120a1]
./compton[0x411388]
./compton[0x40aaa0]
./compton[0x408ce3]
./compton[0x407dd6]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f21439b4bb5]
./compton[0x406ad9]
======= Memory map: ========
00400000-00433000 r-xp 00000000 08:07 3019047                            /home/richard/git/compton/compton
00433000-00434000 rw-p 00033000 08:07 3019047                            /home/richard/git/compton/compton
01d70000-03f46000 rw-p 00000000 00:00 0                                  [heap]
40665000-40667000 r-xs 00000000 08:07 3059575                            /home/richard/.nvidia/gl2P6p6p (deleted)
415ce000-4166b000 rw-p 00000000 00:00 0
7f213ef21000-7f213f021000 rw-s 104948000 00:05 369                       /dev/nvidia0
7f213f3d6000-7f213f8cb000 rw-p 00000000 00:00 0
7f213f8cb000-7f213facb000 rw-s 1dbc59000 00:05 369                       /dev/nvidia0
7f213facb000-7f213facd000 rw-s 00000000 08:07 3059575                    /home/richard/.nvidia/gl2P6p6p (deleted)
7f213facd000-7f213fccd000 rw-s 198561000 00:05 369                       /dev/nvidia0
7f213fccd000-7f213fecd000 rw-s 1041a0000 00:05 369                       /dev/nvidia0
7f213fecd000-7f213ffcd000 rw-s 15e1e5000 00:05 369                       /dev/nvidia0
7f213ffcd000-7f213ffd4000 rw-s 18081e000 00:05 369                       /dev/nvidia0
7f213ffd4000-7f213fff4000 rw-s e8060000 00:05 369                        /dev/nvidia0
7f213fff4000-7f2140034000 rw-s 180a63000 00:05 369                       /dev/nvidia0
7f2140034000-7f2140054000 rw-s 1eb2e7000 00:05 369                       /dev/nvidia0
7f2140054000-7f2140094000 rw-s 1a09a1000 00:05 369                       /dev/nvidia0
7f2140094000-7f21400b4000 rw-s 103210000 00:05 369                       /dev/nvidia0
7f21400b4000-7f21402b5000 rw-p 00000000 00:00 0
7f21402b5000-7f2140858000 r--p 00000000 08:06 2386175                    /usr/lib64/locale/locale-archive
7f2140858000-7f2140a89000 rw-p 00000000 00:00 0
7f2140a89000-7f2140a8e000 r-xp 00000000 08:06 2364866                    /usr/lib64/libXdmcp.so.6.0.0
7f2140a8e000-7f2140a8f000 rw-p 00004000 08:06 2364866                    /usr/lib64/libXdmcp.so.6.0.0
7f2140a8f000-7f2140a90000 rw-p 00000000 00:00 0
7f2140a90000-7f2140a93000 r-xp 00000000 08:06 2364863                    /usr/lib64/libXau.so.6.0.0
7f2140a93000-7f2140a94000 rw-p 00002000 08:06 2364863                    /usr/lib64/libXau.so.6.0.0
7f2140a94000-7f2140a9b000 r-xp 00000000 08:06 549635                     /lib64/librt-2.17.so
7f2140a9b000-7f2140c9a000 ---p 00007000 08:06 549635                     /lib64/librt-2.17.so
7f2140c9a000-7f2140c9b000 r--p 00006000 08:06 549635                     /lib64/librt-2.17.so
7f2140c9b000-7f2140c9c000 rw-p 00007000 08:06 549635                     /lib64/librt-2.17.so
7f2140c9c000-7f2140c9d000 rw-p 00000000 00:00 0
7f2140c9d000-7f2140cc1000 r-xp 00000000 08:06 2377887                    /usr/lib64/libxcb.so.1.1.0
7f2140cc1000-7f2140cc2000 rw-p 00023000 08:06 2377887                    /usr/lib64/libxcb.so.1.1.0
7f2140cc2000-7f2140cd8000 r-xp 00000000 08:06 526983                     /lib64/libz.so.1.2.8
7f2140cd8000-7f2140cd9000 rw-p 00015000 08:06 526983                     /lib64/libz.so.1.2.8
7f2140cd9000-7f2140cda000 rw-p 00000000 00:00 0
7f2140cda000-7f2142720000 r-xp 00000000 08:06 2408162                    /usr/lib64/libnvidia-glcore.so.319.17
7f2142720000-7f214291f000 ---p 01a46000 08:06 2408162                    /usr/lib64/libnvidia-glcore.so.319.17
7f214291f000-7f2143216000 rwxp 01a45000 08:06 2408162                    /usr/lib64/libnvidia-glcore.so.319.17
7f2143216000-7f214322f000 rwxp 00000000 00:00 0
7f214322f000-7f2143232000 r-xp 00000000 08:06 2408163                    /usr/lib64/libnvidia-tls.so.319.17
7f2143232000-7f2143431000 ---p 00003000 08:06 2408163                    /usr/lib64/libnvidia-tls.so.319.17
7f2143431000-7f2143432000 rw-p 00002000 08:06 2408163                    /usr/lib64/libnvidia-tls.so.319.17
7f2143432000-7f2143447000 r-xp 00000000 08:06 410736                     /usr/lib64/gcc/x86_64-pc-linux-gnu/4.8.0/libgcc_s.so.1
7f2143447000-7f2143448000 rw-p 00014000 08:06 410736                     /usr/lib64/gcc/x86_64-pc-linux-gnu/4.8.0/libgcc_s.so.1
7f2143448000-7f2143449000 rw-p 00000000 00:00 0
7f2143449000-7f214354f000 r-xp 00000000 08:06 410745                     /usr/lib64/gcc/x86_64-pc-linux-gnu/4.8.0/libstdc++.so.6.0.18
7f214354f000-7f2143550000 ---p 00106000 08:06 410745                     /usr/lib64/gcc/x86_64-pc-linux-gnu/4.8.0/libstdc++.so.6.0.18
7f2143550000-7f214355a000 r--p 00106000 08:06 410745                     /usr/lib64/gcc/x86_64-pc-linux-gnu/4.8.0/libstdc++.so.6.0.18
7f214355a000-7f214355b000 rw-p 00110000 08:06 410745                     /usr/lib64/gcc/x86_64-pc-linux-gnu/4.8.0/libstdc++.so.6.0.18
7f214355b000-7f214356f000 rw-p 00000000 00:00 0
7f214356f000-7f2143571000 r-xp 00000000 08:06 549625                     /lib64/libdl-2.17.so
7f2143571000-7f2143771000 ---p 00002000 08:06 549625                     /lib64/libdl-2.17.so
7f2143771000-7f2143772000 r--p 00002000 08:06 549625                     /lib64/libdl-2.17.so
7f2143772000-7f2143773000 rw-p 00003000 08:06 549625                     /lib64/libdl-2.17.so
7f2143773000-7f214378b000 r-xp 00000000 08:06 549637                     /lib64/libpthread-2.17.so
7f214378b000-7f214398a000 ---p 00018000 08:06 549637                     /lib64/libpthread-2.17.so
7f214398a000-7f214398b000 r--p 00017000 08:06 549637                     /lib64/libpthread-2.17.so
7f214398b000-7f214398c000 rw-p 00018000 08:06 549637                     /lib64/libpthread-2.17.so
7f214398c000-7f2143990000 rw-p 00000000 00:00 0
7f2143990000-7f2143b32000 r-xp 00000000 08:06 549652                     /lib64/libc-2.17.so
7f2143b32000-7f2143d32000 ---p 001a2000 08:06 549652                     /lib64/libc-2.17.so
7f2143d32000-7f2143d36000 r--p 001a2000 08:06 549652                     /lib64/libc-2.17.so
7f2143d36000-7f2143d38000 rw-p 001a6000 08:06 549652                     /lib64/libc-2.17.so
7f2143d38000-7f2143d3c000 rw-p 00000000 00:00 0
7f2143d3c000-7f2143d83000 r-xp 00000000 08:06 2381260                    /usr/lib64/libdbus-1.so.3.7.3
7f2143d83000-7f2143d85000 rw-p 00046000 08:06 2381260                    /usr/lib64/libdbus-1.so.3.7.3
7f2143d85000-7f2143d91000 r-xp 00000000 08:06 2377542                    /usr/lib64/libconfig.so.9.1.3
7f2143d91000-7f2143d92000 rw-p 0000c000 08:06 2377542                    /usr/lib64/libconfig.so.9.1.3
7f2143d92000-7f2143edb000 r-xp 00000000 08:06 2413174                    /usr/lib64/libX11.so.6.3.0
7f2143edb000-7f2143ee1000 rw-p 00149000 08:06 2413174                    /usr/lib64/libX11.so.6.3.0
7f2143ee1000-7f2143fd6000 r-xp 00000000 08:06 549634                     /lib64/libm-2.17.so
7f2143fd6000-7f21441d5000 ---p 000f5000 08:06 549634                     /lib64/libm-2.17.so
7f21441d5000-7f21441d6000 r--p 000f4000 08:06 549634                     /lib64/libm-2.17.so
7f21441d6000-7f21441d7000 rw-p 000f5000 08:06 549634                     /lib64/libm-2.17.so
7f21441d7000-7f21442b0000 r-xp 00000000 08:06 2408159                    /usr/lib64/opengl/nvidia/lib/libGL.so.319.17
7f21442b0000-7f21444b0000 ---p 000d9000 08:06 2408159                    /usr/lib64/opengl/nvidia/lib/libGL.so.319.17
7f21444b0000-7f21444ef000 rwxp 000d9000 08:06 2408159                    /usr/lib64/opengl/nvidia/lib/libGL.so.319.17
7f21444ef000-7f2144505000 rwxp 00000000 00:00 0
7f2144505000-7f21446f6000 r-xp 00000000 08:06 5849                       /usr/lib64/x86_64-linux-gnu/apitrace/wrappers/glxtrace.so
7f21446f6000-7f214472f000 rw-p 001f1000 08:06 5849                       /usr/lib64/x86_64-linux-gnu/apitrace/wrappers/glxtrace.so
7f214472f000-7f2144735000 rw-p 00000000 00:00 0
7f2144735000-7f2144744000 r-xp 00000000 08:06 549650                     /lib64/ld-2.17.so
7f2144744000-7f2144745000 r-xp 0000f000 08:06 549650                     /lib64/ld-2.17.so
7f2144745000-7f2144757000 r-xp 00010000 08:06 549650                     /lib64/ld-2.17.so
7f2144757000-7f2144758000 rw-p 00000000 00:00 0
7f2144758000-7f214475e000 r-xp 00000000 08:06 2364348                    /usr/lib64/libsnappy.so.1.1.4
7f214475e000-7f214475f000 rw-p 00005000 08:06 2364348                    /usr/lib64/libsnappy.so.1.1.4
7f214475f000-7f2144760000 rw-p 00000000 00:00 0
7f2144760000-7f214476a000 r-xp 00000000 08:06 2375684                    /usr/lib64/libXrandr.so.2.2.0
7f214476a000-7f214476b000 rw-p 00009000 08:06 2375684                    /usr/lib64/libXrandr.so.2.2.0
7f214476b000-7f214477d000 r-xp 00000000 08:06 2384812                    /usr/lib64/libXext.so.6.4.0
7f214477d000-7f214477e000 rw-p 00012000 08:06 2384812                    /usr/lib64/libXext.so.6.4.0
7f214477e000-7f214477f000 rw-p 00000000 00:00 0
7f214477f000-7f2144789000 r-xp 00000000 08:06 2384833                    /usr/lib64/libXrender.so.1.3.0
7f2144789000-7f214478a000 rw-p 00009000 08:06 2384833                    /usr/lib64/libXrender.so.1.3.0
7f214478a000-7f214478f000 r-xp 00000000 08:06 2381875                    /usr/lib64/libXfixes.so.3.1.0
7f214478f000-7f2144790000 rw-p 00005000 08:06 2381875                    /usr/lib64/libXfixes.so.3.1.0
7f2144790000-7f2144791000 rw-p 00000000 00:00 0
7f2144791000-7f2144793000 r-xp 00000000 08:06 2385350                    /usr/lib64/libXdamage.so.1.1.0
7f2144793000-7f2144794000 rw-p 00001000 08:06 2385350                    /usr/lib64/libXdamage.so.1.1.0
7f2144794000-7f2144796000 r-xp 00000000 08:06 2375888                    /usr/lib64/libXcomposite.so.1.0.0
7f2144796000-7f2144797000 rw-p 00002000 08:06 2375888                    /usr/lib64/libXcomposite.so.1.0.0
7f2144797000-7f214489c000 r-xp 00000000 08:06 272807                     /usr/lib64/binutils/x86_64-pc-linux-gnu/2.23.1/libbfd-2.23.1.so
7f214489c000-7f21448b3000 rw-p 00105000 08:06 272807                     /usr/lib64/binutils/x86_64-pc-linux-gnu/2.23.1/libbfd-2.23.1.so
7f21448b3000-7f21448b8000 rw-p 00000000 00:00 0
7f21448b8000-7f214491c000 r-xp 00000000 08:06 526793                     /lib64/libpcre.so.1.2.0
7f214491c000-7f214491d000 rw-p 00064000 08:06 526793                     /lib64/libpcre.so.1.2.0
7f214491d000-7f214491e000 rw-p 00000000 00:00 0
7f214491e000-7f214491f000 rw-s 180812000 00:05 369                       /dev/nvidia0
7f214491f000-7f2144923000 rw-s 104536000 00:05 369                       /dev/nvidia0
7f2144923000-7f2144924000 rw-s efd60000 00:05 369                        /dev/nvidia0
7f2144924000-7f2144925000 rw-s efd60000 00:05 369                        /dev/nvidia0
7f2144925000-7f2144926000 rw-s f6641000 00:05 369                        /dev/nvidia0
7f2144926000-7f2144927000 rw-s 209865000 00:05 369                       /dev/nvidia0
7f2144927000-7f2144928000 rw-s 198671000 00:05 369                       /dev/nvidia0
7f2144928000-7f2144929000 rw-s f6060000 00:05 369                        /dev/nvidia0
7f2144929000-7f214493e000 rw-s 20f3e4000 00:05 369                       /dev/nvidia0
7f214493e000-7f2144956000 rw-p 00000000 00:00 0
7f2144956000-7f2144957000 r--p 00021000 08:06 549650                     /lib64/ld-2.17.so
7f2144957000-7f2144958000 rw-p 00022000 08:06 549650                     /lib64/ld-2.17.so
7f2144958000-7f2144959000 rw-p 00000000 00:00 0
7fff2a25c000-7fff2a2a1000 rw-p 00000000 00:00 0                          [stack]
7fff2a2df000-7fff2a2e0000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
apitrace: warning: caught signal 6
apitrace: flushing trace due to an exception
apitrace: warning: caught signal 11
apitrace: warning: recursion handling signal 11
apitrace: info: taking default action for signal 11

GDB backtrace:

(gdb) bt full
#0  0x00007ffff706e299 in __GI_raise (sig=sig@entry=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
        resultvar = 0
        pid = 12084
        selftid = 12084
#1  0x00007ffff706f698 in __GI_abort () at abort.c:90
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x7fffffffcf2a, sa_sigaction = 0x7fffffffcf2a},
          sa_mask = {__val = {6, 140737339059413, 2, 140737488342846, 2, 140737339048484, 1,
              140737339059409, 3, 140737488342820, 12, 140737339059413, 2, 140737488343760,
              140737488343760, 140737488345520}}, sa_flags = 11, sa_restorer = 0x7fffffffd4d0}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007ffff70ad0e5 in __libc_message (do_abort=do_abort@entry=2,
    fmt=fmt@entry=0x7ffff71a0f08 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
        ap = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7fffffffd9c0,
            reg_save_area = 0x7fffffffd8d0}}
        ap_copy = {{gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7fffffffd9c0,
            reg_save_area = 0x7fffffffd8d0}}
        fd = 17
        on_2 = <optimized out>
        list = <optimized out>
        nlist = <optimized out>
        cp = <optimized out>
        written = <optimized out>
#3  0x00007ffff70b2ade in malloc_printerr (action=3,
    str=0x7ffff71a1010 "double free or corruption (!prev)", ptr=<optimized out>) at malloc.c:4902
        buf = "000000000114cd80"
---Type <return> to continue, or q <return> to quit---
        cp = <optimized out>
#4  0x00007ffff70b37d7 in _int_free (av=<optimized out>, p=0x114cd70, have_lock=0) at malloc.c:3758
        size = <optimized out>
        fb = <optimized out>
        nextchunk = <optimized out>
        nextsize = <optimized out>
        nextinuse = <optimized out>
        prevsize = <optimized out>
        bck = <optimized out>
        fwd = <optimized out>
        errstr = <optimized out>
        locked = <optimized out>
        __func__ = "_int_free"
#5  0x000000000041ac02 in glx_bind_pixmap (ps=0x473790, pptex=0x5bb508, pixmap=52429331,
    width=331, height=402, depth=24) at src/opengl.c:572
        ptex = 0x5bda80
        need_release = false
        GLX_TEX_DEF = {texture = 0, glpixmap = 0, pixmap = 0, target = 0, width = 0, height = 0,
          depth = 0, y_inverted = false}
#6  0x00000000004120a1 in paint_bind_tex (ps=0x473790, ppaint=0x5bb4f8, wid=0, hei=0, depth=0,
    force=true) at src/compton.h:192
No locals.
#7  0x0000000000411388 in win_paint_win (ps=0x473790, w=0x5bb440, reg_paint=52429328,
    pcache_reg=0x7fffffffddd8) at src/compton.c:1428
        y = -8928
        pict = 0
        x = 32767
        wid = 0
---Type <return> to continue, or q <return> to quit---
        hei = 0
        dopacity = 0
#8  0x000000000040aaa0 in paint_all (ps=0x473790, region=52429327, region_real=52429327,
    t=0x5bb440) at src/compton.c:1753
        cache_reg = {rects = 0x5f0180, nrects = 1}
        w = 0x5bb440
        reg_paint = 52429328
        reg_tmp = 52429328
        reg_tmp2 = 52429330
#9  0x0000000000408ce3 in session_run (ps=0x473790) at src/compton.c:6614
        all_damage_orig = 0
        t = 0x5bb440
        paint = 0
#10 0x0000000000407dd6 in main (argc=6, argv=0x7fffffffe0e8) at src/compton.c:6670
        ps_old = 0x0
(gdb) frame 5
#5  0x000000000041ac02 in glx_bind_pixmap (ps=0x473790, pptex=0x5bb508, pixmap=52429331,
    width=331, height=402, depth=24) at src/opengl.c:572
572       ps->glXBindTexImageProc(ps->dpy, ptex->glpixmap, GLX_FRONT_LEFT_EXT, NULL);

Something wrong when calling glXBindTexImageEXT?

By the way, is it possible to strip certain calls out of an apitrace trace file or to edit it directly? I would prefer to be able to strip some texture render calls out to debug a painting region issue.

@jrfonseca:

Valgrind log, in addition:

apitrace: tracing to /home/richard/git/compton/compton.11.trace
==15507== Invalid write of size 2
==15507==    at 0x404CF9F: ??? (in /home/richard/.nvidia/gl02u5e6 (deleted))
==15507==    by 0xC050A5F: ???
==15507==    by 0xBC80EFF: ???
==15507==    by 0x7FEFFF66F: ???
==15507==    by 0x190: ???
==15507==    by 0x744C475: ??? (in /usr/lib64/libnvidia-glcore.so.319.17)
==15507==    by 0x10BFCF59F: ???
==15507==  Address 0xbc81212 is 0 bytes after a block of size 399,186 alloc'd
==15507==    at 0x4C2C60B: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==15507==    by 0x4FA8FFF: glXBindTexImageEXT (in /usr/lib64/x86_64-linux-gnu/apitrace/wrappers/glxtrace.so)
==15507==    by 0x41AC01: glx_bind_pixmap (opengl.c:572)
==15507==    by 0x4120A0: paint_bind_tex (compton.h:192)
==15507==    by 0x411387: win_paint_win (compton.c:1428)
==15507==    by 0x40AA9F: paint_all (compton.c:1753)
==15507==    by 0x408CE2: session_run (compton.c:6614)
==15507==    by 0x407DD5: main (compton.c:6670)
==15507==
==15507== Invalid write of size 1
==15507==    at 0x404CFA3: ??? (in /home/richard/.nvidia/gl02u5e6 (deleted))
==15507==    by 0xC050A5F: ???
==15507==    by 0xBC80EFF: ???
==15507==    by 0x7FEFFF66F: ???
==15507==    by 0x190: ???
==15507==    by 0x744C475: ??? (in /usr/lib64/libnvidia-glcore.so.319.17)
==15507==    by 0x10BFCF59F: ???
==15507==  Address 0xbc81214 is 2 bytes after a block of size 399,186 alloc'd
==15507==    at 0x4C2C60B: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==15507==    by 0x4FA8FFF: glXBindTexImageEXT (in /usr/lib64/x86_64-linux-gnu/apitrace/wrappers/glxtrace.so)
==15507==    by 0x41AC01: glx_bind_pixmap (opengl.c:572)
==15507==    by 0x4120A0: paint_bind_tex (compton.h:192)
==15507==    by 0x411387: win_paint_win (compton.c:1428)
==15507==    by 0x40AA9F: paint_all (compton.c:1753)
==15507==    by 0x408CE2: session_run (compton.c:6614)
==15507==    by 0x407DD5: main (compton.c:6670)
==15507==
==15507== Invalid write of size 2
==15507==    at 0x404CF9F: ??? (in /home/richard/.nvidia/gl02u5e6 (deleted))
==15507==    by 0xC050F8B: ???
==15507==    by 0xBC812E3: ???
==15507==    by 0x7FEFFF66F: ???
==15507==    by 0x191: ???
==15507==    by 0x744C475: ??? (in /usr/lib64/libnvidia-glcore.so.319.17)
==15507==    by 0x10BFCF59F: ???
==15507==  Address 0xbc812e4 is not stack'd, malloc'd or (recently) free'd
==15507==
==15507== Invalid write of size 1
==15507==    at 0x404CFA3: ??? (in /home/richard/.nvidia/gl02u5e6 (deleted))
==15507==    by 0xC050F8B: ???
==15507==    by 0xBC812E3: ???
==15507==    by 0x7FEFFF66F: ???
==15507==    by 0x191: ???
==15507==    by 0x744C475: ??? (in /usr/lib64/libnvidia-glcore.so.319.17)
==15507==    by 0x10BFCF59F: ???
==15507==  Address 0xbc812e6 is not stack'd, malloc'd or (recently) free'd
==15507==
==15507== Invalid read of size 1
==15507==    at 0x4C2E15F: memcpy@@GLIBC_2.14 (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==15507==    by 0x4FD0C34: ??? (in /usr/lib64/x86_64-linux-gnu/apitrace/wrappers/glxtrace.so)
==15507==    by 0x4FCD571: ??? (in /usr/lib64/x86_64-linux-gnu/apitrace/wrappers/glxtrace.so)
==15507==    by 0x4FA923F: glXBindTexImageEXT (in /usr/lib64/x86_64-linux-gnu/apitrace/wrappers/glxtrace.so)
==15507==    by 0x41AC01: glx_bind_pixmap (opengl.c:572)
==15507==    by 0x4120A0: paint_bind_tex (compton.h:192)
==15507==    by 0x411387: win_paint_win (compton.c:1428)
==15507==    by 0x40AA9F: paint_all (compton.c:1753)
==15507==    by 0x408CE2: session_run (compton.c:6614)
==15507==    by 0x407DD5: main (compton.c:6670)
==15507==  Address 0xbc81212 is 0 bytes after a block of size 399,186 alloc'd
==15507==    at 0x4C2C60B: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==15507==    by 0x4FA8FFF: glXBindTexImageEXT (in /usr/lib64/x86_64-linux-gnu/apitrace/wrappers/glxtrace.so)
==15507==    by 0x41AC01: glx_bind_pixmap (opengl.c:572)
==15507==    by 0x4120A0: paint_bind_tex (compton.h:192)
==15507==    by 0x411387: win_paint_win (compton.c:1428)
==15507==    by 0x40AA9F: paint_all (compton.c:1753)
==15507==    by 0x408CE2: session_run (compton.c:6614)
==15507==    by 0x407DD5: main (compton.c:6670)
==15507==
==15507== Invalid read of size 1
==15507==    at 0x4C2E150: memcpy@@GLIBC_2.14 (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==15507==    by 0x4FD0C34: ??? (in /usr/lib64/x86_64-linux-gnu/apitrace/wrappers/glxtrace.so)
==15507==    by 0x4FCD571: ??? (in /usr/lib64/x86_64-linux-gnu/apitrace/wrappers/glxtrace.so)
==15507==    by 0x4FA923F: glXBindTexImageEXT (in /usr/lib64/x86_64-linux-gnu/apitrace/wrappers/glxtrace.so)
==15507==    by 0x41AC01: glx_bind_pixmap (opengl.c:572)
==15507==    by 0x4120A0: paint_bind_tex (compton.h:192)
==15507==    by 0x411387: win_paint_win (compton.c:1428)
==15507==    by 0x40AA9F: paint_all (compton.c:1753)
==15507==    by 0x408CE2: session_run (compton.c:6614)
==15507==    by 0x407DD5: main (compton.c:6670)
==15507==  Address 0xbc81213 is 1 bytes after a block of size 399,186 alloc'd
==15507==    at 0x4C2C60B: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==15507==    by 0x4FA8FFF: glXBindTexImageEXT (in /usr/lib64/x86_64-linux-gnu/apitrace/wrappers/glxtrace.so)
==15507==    by 0x41AC01: glx_bind_pixmap (opengl.c:572)
==15507==    by 0x4120A0: paint_bind_tex (compton.h:192)
==15507==    by 0x411387: win_paint_win (compton.c:1428)
==15507==    by 0x40AA9F: paint_all (compton.c:1753)
==15507==    by 0x408CE2: session_run (compton.c:6614)
==15507==    by 0x407DD5: main (compton.c:6670)
==15507==
--15507-- VALGRIND INTERNAL ERROR: Valgrind received a signal 11 (SIGSEGV) - exiting
--15507-- si_code=80;  Faulting address: 0x0;  sp: 0x402895de0

Something wrong when calling glXBindTexImageEXT?

Should be fixed with https://github.com/apitrace/apitrace/commit/d87aabdd0be10c2765c4b4e84508257b8e45a642

BTW, when you profile with apitrace, note that the take TexImage2D might bias the results a little.

By the way, is it possible to strip certain calls out of an apitrace trace file or to edit it directly? I would prefer to be able to strip some texture render calls out to debug a painting region issue.

There is apitrace trim command that allows to prune calls, manually or automatically. And qapitrace also allows to edit arguments (though not add or remove calls).

Also, apitrace does not retrace GLX calls literally -- it always emulates them, to allow run traces on different OSes. So it won't be particularly useful to profile or debug GLX calls, in particular GLX_OML_swap_method/GLX_EXT_buffer_age extensions. It should be still useful to debug/profile the GL calls though.

We almost extensively use OpenGL fixed pipeline functions, those legacy OpenGL 1/2 stuffs, just to keep compatibility with older cards. We use them for color inversion, blending, painting texture with opacity, etc. Are those fixed pipeline functions significantly slower than modern (GLSL-based) solutions? (Okay, ARB assembly might be faster but I don't want to touch those nasty things...)

Immediate mode (glBegin/glEnd) is very inneficient. You can save CPU overhead by using vertex arrays. But the ideal is really using VBOs.

Presently, our window background blur is done this way: Create a new texture, use glCopyTexSubImage2D() to read current backbuffer into the texture, paint this texture back to back buffer with a GLSL blurring shader. Are there any faster ways?

The way blur is typically done (e.g., Windows 7 Aero, XAML, etc) is to decompose the 2D blur into two 1D blur operations, using one texture as intermediate:

first you blur along one direction, into the temporary texture
then you blur along the orthogonal direction

You could store the blurred result in a 2nd temporary texture if you're confident it won't change frequently

texture samples are expensive, so you should choose the smalles number that gives you the intended siffect. Instead of using more texture samples for a wider kernel, simply use the same number of texture samples but sample them farther apart.

Also, you should avoid the divide inside the shader -- instead put the inverse in the uniform and multiply against that.

@jrfonseca:

Sorry for the latency. I took quite a bit time testing.

Should be fixed with https://github.com/apitrace/apitrace/commit/d87aabdd0be10c2765c4b4e84508257b8e45a642

Brilliant! I could confirm it works here during my brief test.

There are a few small issues that doesn't hurt much:

Binding RGBA texture seemingly is not working here (nvidia-drivers-319.17 + GTX 670). Always they appear purely white. I'm using apitrace/apitrace@33da20b484 . I should upload a trace file, but it's a bit too large.
qapitrace displays GL_INVALID_ENUM errors for the generated fake glTexImage2D() calls for glXBindTexImageEXT() here during replay. Again, unfortunately the trace file is a bit too large, but you might be able to reproduce this with compton.
During replay I found the bottom part is either not painted or not painted correctly. Because of the window decoration and other things, the bottom part of the glretrace window is outside screen here and nvidia-drivers probably doesn't bother to paint that part. An evidence is when I disable window decoration on the glretrace window, the broken part shrinks. Not a big deal, and I could use some workaround here, but maybe others will find this annoying.

BTW, when you profile with apitrace, note that the take TexImage2D might bias the results a little.

I see. Thanks. :-)

There is apitrace trim command that allows to prune calls, manually or automatically. And qapitrace also allows to edit arguments (though not add or remove calls).

Oh, when I run "Trim" from qapitrace I get no option to select, so I thought trimming is always automatic. Silly me. :-D

Also, apitrace does not retrace GLX calls literally -- it always emulates them, to allow run traces on different OSes. So it won't be particularly useful to profile or debug GLX calls, in particular GLX_OML_swap_method/GLX_EXT_buffer_age extensions. It should be still useful to debug/profile the GL calls though.

Ah, I see. :-)

Immediate mode (glBegin/glEnd) is very inneficient. You can save CPU overhead by using vertex arrays. But the ideal is really using VBOs.

Ah, thanks for the info. :-) I guess I will need some extra work if I wish to keep compatibility with OpenGL 1.x while supporting VBO in parallel, but I will do as much as I could. Presently I don't see CPU usage a problem for compton, though.

By the way, is glBegin()/glEnd() still less efficient if the number of rectangles/triangles are rather small?

The way blur is typically done (e.g., Windows 7 Aero, XAML, etc) is to decompose the 2D blur into two 1D blur operations, using one texture as intermediate:

first you blur along one direction, into the temporary texture

then you blur along the orthogonal direction

I see. I implemented it in one-pass for simplicity, because I'm a pure novice on OpenGL. I will work on that later.

You could store the blurred result in a 2nd temporary texture if you're confident it won't change frequently

Huh, but my fear is caching content will somehow costly (particularly if I want to do this with our X Render backend). I guess we have a bit too many textures to predict whether caching specific painting results would be efficient, so I'm only trying to reduce painting region but not caching any results.

texture samples are expensive, so you should choose the smalles number that gives you the intended siffect. Instead of using more texture samples for a wider kernel, simply use the same number of texture samples but sample them farther apart.

Sorry, but my knowledge is about those image operations is close to nothing. You mean, like this convolution kernel?

Users are able to specify a convolution kernel itself, and the GLSL compiler should be able to optimize texture2D(XXX) * 0 calls out, so I suppose it's already possible now.

Also, you should avoid the divide inside the shader -- instead put the inverse in the uniform and multiply against that.

Ah, division is slow in GPU, you meant? Thanks for the info. :-) But we are using factor_center in two places in the shader, once for multiplying with the color of center pixel, once for dividing the sum of colors. so if I want to get rid of division in the shader I would have to use two uniforms. Is that going to be slower?

Thanks for letting me know of the failures. I'll investigate them when I find a bit of time again to look at this again.

My comments on blur apply only to GL. With CPU things would be done slightly

Sorry, but my knowledge is about those image operations is close to nothing. You mean, like this convolution kernel?

1 1 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 1 1 1 1 1 1

Yeah. Take this small kernel:

1 1 1
1 4 1
1 1 1

You'd implement is by applying a horizontal convolution filter

1 2 1

and then a vertical convulution filter

1
2
1

If you want a wider blur effect, you could do use a longer vector:

1 2 4 2 1

but this is probably wasteful. Instead you keep the same coefficient vector, but sample them apart

1 _ 2 _ 1

effectively sampling with zero weights in between.

Users are able to specify a convolution kernel itself, and the GLSL compiler should be able to optimize texture2D(XXX) * 0 calls out, so I suppose it's already possible now.

Yes, in general GLSL compilers would optimze, but I'm not entirely sure they always do. You should double check with Mesa.

There are even further optimizations: Windows Aero leverages the GPU's builtin linear filters so that it does half the samples. Take convolution vector

1 1 1 1

The trivial way of doing this is to sample with nearest filtering at position 0, 1, 2, and 3, and then multiply the sum by 0.25.

The optimal way is to sample with linear filtering at position 0.5 and 2.5, and then multiply sum by 0.5. It's quite tricky to get this right though.

Yeah. Take this small kernel:
1 1 1
1 4 1
1 1 1
You'd implement is by applying a horizontal convolution filter
1 2 1 
and then a vertical convulution filter
1
2
1
If you want a wider blur effect, you could do use a longer vector:
1 2 4 2 1
but this is probably wasteful. Instead you keep the same coefficient vector, but sample them apart
1 _ 2 _ 1
effectively sampling with zero weights in between.

I see, thanks. :-) Let me see if I could implement this.

Yes, in general GLSL compilers would optimze, but I'm not entirely sure they always do. You should double check with Mesa.

Oh, okay, I will try to handle this in shader generation, then.

There are even further optimizations: Windows Aero leverages the GPU's builtin linear filters so that it does half the samples. Take convolution vector

1 1 1 1

The trivial way of doing this is to sample with nearest filtering at position 0, 1, 2, and 3, and then multiply the sum by 0.25.

The optimal way is to sample with linear filtering at position 0.5 and 2.5, and then multiply sum by 0.5. It's quite tricky to get this right though.

Looks like a cool plan! :-) I will try later.

There are a few small issues that doesn't hurt much:

I fixed issues 1-2.

During replay I found the bottom part is either not painted or not painted correctly. Because of the window decoration and other things, the bottom part of the glretrace window is outside screen here and nvidia-drivers probably doesn't bother to paint that part. An evidence is when I disable window decoration on the glretrace window, the broken part shrinks. Not a big deal, and I could use some workaround here, but maybe others will find this annoying.

Seems a NVIDIA specific issue. It doesn't happen with Mesa's llvmpipe. I'm surprised I haven't seen this before. I don't think there any quick fix. Probably the only way would be to use GLX pbuffers instead of windows when dumping state.

BTW, make NO_LIBCONFIG=1 fails. This fixes it:

diff --git a/src/compton.c b/src/compton.c
index 1111c19..59605ef 100644
--- a/src/compton.c
+++ b/src/compton.c
@@ -4468,6 +4468,7 @@ open_config_file(char *cpath, char **ppath) {

   return NULL;
 }
+#endif

 /**
  * Parse a floating-point number in matrix.
@@ -4597,6 +4598,7 @@ parse_conv_kern(session_t *ps, const char *src) {
   return parse_matrix(ps, src);
 }

+#ifdef CONFIG_LIBCONFIG
 /**
  * Parse a condition list in configuration file.
  */

I fixed issues 1-2.

Confirmed working. That's quick! :-)

Seems a NVIDIA specific issue. It doesn't happen with Mesa's llvmpipe. I'm surprised I haven't seen this before. I don't think there any quick fix. Probably the only way would be to use GLX pbuffers instead of windows when dumping state.

Oh, probably. Anyway, it isn't too important. Don't worry.

BTW, make NO_LIBCONFIG=1 fails. This fixes it:

Ah, thanks. Yesterday Spaulding reported the issue to me and it has been fixed in richardgv-dev branch, though.

Heh, sorry for the late reply. Still busy with my thesis. Despite all my efforts, my first implementation of multiple-pass blur seemingly introduces a roughly 7% drop in performance here: A 3x3 box blur uses 10.35 seconds to paint 10,000 frames here while a two-pass 1x3 + 3x1 uses 11.15 seconds. (The good thing is my texture reusing seemingly introduces a 10% performance boost for blur in general, though.) I have no idea what's wrong with this. Probably VBO will change something, let me try later...

The multi-pass blur code is now in richardgv-dev branch, Did some further tests. It brings performance boosts for 3x3 box blur with X Render backend (but benchmark results for X Render backends may not be accurate). For GLX backend, 3x3 box blur results in about 12% negative effect, but for 5x5 box blur it brings 9% performance boost.

@jrfonseca:

(Continuing the replies above...)

I just tried to implement drawing with GL_ARB_vertex_buffer_object, and seemingly it's causing some major performance degradation (~40%). I tried conditionally applying it only to cases when we need to paint more than 4 rectangles, and it results in almost negligible performance improvement (1%?). Probably it's because some issues in my implementation or because of the uncachable nature of painting regions. Callgrind shows glBufferDataARB() takes 13% of the total time and glDeleteBuffersARB() takes 27%, pretty much intolerable in my eyes. Huh, possibly I will give up supporting VBO.

Callgrind shows glBufferDataARB() takes 13% of the total time and glDeleteBuffersARB() takes 27%, pretty much intolerable in my eyes. Huh, possibly I will give up supporting VBO.

Creating (glBufferDataARB) and destruction tend to be heavy weight operations, and must be avoided. The ideal is to create, fill once, and the re-use (glDraw) many times. OpenGL driver will keep the data inside the GPU memory where it can be quicly used.

In your case you'll likely need to modify the vertices now and then (e.g, when windows move), so you might need to us the techniques described in http://www.opengl.org/wiki/Buffer_Object_Streaming .

If you use glDrawarrays without VBO, then OpenGL driver will do something like this internally.It will send data all the time (even when it doesn't change) but still better than glBegin/glEnd.

Creating (glBufferDataARB) and destruction tend to be heavy weight operations, and must be avoided. The ideal is to create, fill once, and the re-use (glDraw) many times. OpenGL driver will keep the data inside the GPU memory where it can be quicly used. In your case you'll likely need to modify the vertices now and then (e.g, when windows move), so you might need to us the techniques described in http://www.opengl.org/wiki/Buffer_Object_Streaming .

Thanks for your guidance. :-) Yes, I know that fill-once-and-reuse would be ideal, yet if the user is using --glx-no-stencil combined with --glx-use-copysubbuffermesa or --glx-swap-method (which usually brings the best performance right now), the painting region for each window texture is dependent on the full damaged region, which would be constantly changing even if windows are not being moved... I could do caching in other cases, but I'm afraid that isn't going to be too useful.

Well, I tried implementing this with a shared GL_STREAM_DRAW_ARB VBO and update buffer data on each paint with glBufferSubDataARB(). Probably because I made some mistakes again, this times it brings like 20% decrease in performance. Callgrind shows glBufferSubDataARB() takes about 10% time, while glDrawArrays() takes 44%. Huh... What should I do now?

If you use glDrawarrays without VBO, then OpenGL driver will do something like this internally.It will send data all the time (even when it doesn't change) but still better than glBegin/glEnd.

My ugly implementation using vertex array shows a 2% slowdown... Well, maybe I did something stupid in code again.

Update: Here's how I implemented this, compton-varray.patch uses vertex array while compton-vbo-stream.patch uses stream VBO. The code is slightly broken because I'm only testing performance right now. : https://gist.github.com/richardgv/5627338

chjj / compton

Use GLX_EXT_buffer_age and/or GLX_OML_swap_method #107