Xpra-org / xpra

Persistent remote applications for X11; screen sharing for X11, MacOS and MSWindows.
https://xpra.org/
GNU General Public License v2.0
1.98k stars 169 forks source link

ghost windows #258

Closed totaam closed 11 years ago

totaam commented 11 years ago

Issue migrated from trac ticket # 258

component: core | priority: critical | resolution: fixed

2013-02-13 10:24:17: antoine created the issue


I can reproduce it in kmail by moving the mouse cursor in a circle over the list of emails long enough. Previously this also happened with "gtkperf -a".

Related info:

  • r2688 tried to fix this problem already by undoing parts of r2281, but this is apparently not enough or not the real/unique source of the problem
  • r2281 caused problems when exceptions would abort parts of the cleanup function
  • windows info (corral, client, widget, ..)
  • 232: "deal with damage request storms"

  • 224: "synchronized X11 calls and performance"

  • r2304 caches window models for failed windows so we avoid trying to re-create them too many times
  • r2291: more OR window fixes
  • r2261: detecting dead OR windows

Clues:

  • this seems to be a race which is not always easy to trigger
  • occurs on override-redirect windows only (though I think this may just be a coincidence)
  • happens when the windows are destroyed very quickly after being shown

What may be happening (since this problem has been reported a while back and is still not fixed - this may well be wrong):

  • despite r2688, we fail to cleanup the window properly (unlikely for OR windows..)
  • we somehow miss the "unmap" event and therefore do not even try to cleanup the window
  • because of damage batching or other X11 calls, we may process calls on this window when it is already gone, maybe the "window is gone" error/event gets swallowed up as a generic XError then - if that's the case, we're in trouble!

What to do next:

  • finding a way to reproduce this reliably would really help - especially to ensure this bug is properly fixed, not just pushing the race deeper: maybe write some gtk code to do what gtkperf -a does with OR windows? (smaller/faster windows?), maybe force-enable batching?
  • maybe add some debug so we catch all the BadWindow or BadDamage calls and see if these correlate with the problem
totaam commented 11 years ago

2013-02-13 14:23:33: onlyjob commented


Probably this is a regression since it is not reproducible in 0.7.8

totaam commented 11 years ago

2013-02-13 15:46:00: onlyjob commented


I was looking for application where we could reproduce this issue easier and found filelight where I see a problem that is not exactly the same but might be related.

In filelight when I move mouse over the folders there is a semi-transparent rectangle hover-tooltip with folder name, size and number of files.

If filelight is running within Xpra session there are number of problems with popup tooltips:

  • They are not transparent (feature test ?)

  • They are often incompletely removed/drawn (another manifestation of #252 ?)

  • Soon enough they are stop displaying completely (bug).

When those popups are displayed the following appears in server log:

2013-02-14 02:40:57,084 not found transient_for=<gtk.gdk.Window object at 0x7ffca2bb2d70 (GdkWindow at 0x7ffca2ba46c0)>, xid=77594628
2013-02-14 02:40:57,210 not found transient_for=<gtk.gdk.Window object at 0x7ffca2bb2d70 (GdkWindow at 0x7ffca2ba46c0)>, xid=77594628
2013-02-14 02:40:57,683 not found transient_for=<gtk.gdk.Window object at 0x7ffca2bb2d70 (GdkWindow at 0x7ffca2ba46c0)>, xid=77594628
2013-02-14 02:40:58,093 not found transient_for=<gtk.gdk.Window object at 0x7ffca2bb2d70 (GdkWindow at 0x7ffca2ba46c0)>, xid=77594628
2013-02-14 02:40:58,121 not found transient_for=<gtk.gdk.Window object at 0x7ffca2bb2d70 (GdkWindow at 0x7ffca2ba46c0)>, xid=77594628

Just before those tooltips disappear completely the following appears in log (and the above log records are stop appearing):

2013-02-14 02:41:03,811 the window <OverrideRedirectWindowModel object at 0x1acec30 (wimpiggy+window+OverrideRedirectWindowModel at 0x3bc4b40)> is not composited!?
2013-02-14 02:41:03,966 the window <OverrideRedirectWindowModel object at 0x1acec30 (wimpiggy+window+OverrideRedirectWindowModel at 0x3bc4b40)> is not composited!?
2013-02-14 02:41:06,098 the window <OverrideRedirectWindowModel object at 0x1acec30 (wimpiggy+window+OverrideRedirectWindowModel at 0x3bc4b40)> is not composited!?
totaam commented 11 years ago

2013-02-13 22:37:49: onlyjob commented


I tried to reproduce by connecting to 0.8.3 from client 0.7.8: instead of expected "sticky" indestructible popup window on client side the server-side Xpra crashed:

found large packet (7874 bytes): new-window, argument types:[<type 'int'>, <type 'int'>, <type 'int'>, <type 'int'>, <type 'int'>, <type 'dict'>, <type 
'dict'>], sizes: [1, 1, 2, 4, 3, 22146, 2], packet head=['new-window', 1, 0, 63, 1280, 737, {'size-constraints': {'minimum-size': (307, 293)}, 'window-type': ['_NET_WM_WINDOW_T
YPE_NORMAL'], 'modal': False, 'title': 'XXXXXX@gmail.com/@FSF \xe2\x80\x93 KMail', 'class-instance': ['kmail', 'Kmail'], 'client-machine': 'debstor', 'pid': 12610, 'icon': ....
totaam commented 11 years ago

2013-02-13 22:46:23: onlyjob commented


Unreproducible with Xpra-0.8.4 on client side and 0.7.8 on server-side.

totaam commented 11 years ago

2013-02-14 02:10:51: onlyjob changed priority from major to critical

totaam commented 11 years ago

2013-02-14 02:10:51: onlyjob commented


Further testing revealed that it is possible to crash Xpra-server 0.8.4 from client 0.8.4 by circular mouse motion over list of emails in kmail.

r2281 is related to the problem. After hours of trying miscellaneous patches to partially undo r2281 I ended up with "ghost-popup.patch" (attached):

--- a/wimpiggy/composite.py
+++ b/wimpiggy/composite.py
@@ -69,8 +69,9 @@
             trap.swallow_synced(xcomposite_unredirect_window, self._window)
         if self._damage_handle:
             trap.swallow_synced(xdamage_stop, self._window, self._damage_handle)
             self._damage_handle = None
+            self._contents_handle = None        # needed to avoid crash
         self._window = None
.
     def acknowledge_changes(self):
         if self._damage_handle is not None and self._window is not None:

I have no idea how it work, but it eliminate server-side crashes with client-0.8.4 and reduces probability of indestructible ghost-popups in kmail. Unfortunately even with this patch vigorous testing still reproduces ghost-popups even though it is harder. Also patch does not help to avoid server-side crash (0.8.4) with client 0.7.8.

I'm raising this ticket's priority as server-side crashes is a serious issue.

totaam commented 11 years ago

2013-02-14 02:11:47: onlyjob uploaded file ghost-popup.patch (0.8 KiB)

totaam commented 11 years ago

2013-02-14 02:14:11: onlyjob commented


Apparently patch have no effect on filelight behaviour.

totaam commented 11 years ago

2013-02-14 02:54:20: antoine commented


Hang on, I thought the server-side crashes from comment:3 were with a 0.7.8 server? Or did you apply the patch in comment:5 on top of 0.7.x? I really don't see how it would make any difference going from 0.8.3 to 0.8.4 - sounds like you just got (un)lucky there. It looks like a separate issue to me, one for a new ticket. How does this crash manifest itself? Anything in the logs? The sample in comment:3 is just a warning message (it shouldn't be there - but still), not a fatal error. Does it reject the client connection or actually crash the server? Do you have more complete log messages, or even better, a gdb backtrace?

[[BR]]

As for the composite change, it's an odd one: invalidate_pixmap (which does clear the self._contents_handle) is already called from destroy and this is all done from the UI thread, so I don't see how it could be anything but None already... That is, unless:

  • a race: if we somehow call one of these methods from a non-UI thread (unlikely but will check)
  • the trap.swallow_synced X11 calls end up calling the gtk code (I didn't think was the case - but maybe it is)

The only way this would help is if we use window.get_property("client-contents") afterwards, and the only place we do this is from the UI thread (in window_source.process_damage_region), so by that point, the window.do_unmanaged function should have finished (since it also runs from the UI thread) and the reference to the composite window should be gone...

If anything it should be a call to self._cleanup_listening(), and indented to run in all cases (or maybe even keep both invalidate_pixmap calls since they're cheap):

--- src/wimpiggy/composite.py (revision 2715)
+++ src/wimpiggy/composite.py (working copy)
@@ -64,12 +64,12 @@
             log.warn("composite window %s already destroyed!", self)
             return
         remove_event_receiver(self._window, self)
-        self.invalidate_pixmap()
         if not self._already_composited:
             trap.swallow_synced(xcomposite_unredirect_window, self._window)
         if self._damage_handle:
             trap.swallow_synced(xdamage_stop, self._window, self._damage_handle)
             self._damage_handle = None
+        self.invalidate_pixmap()
         self._window = None

     def acknowledge_changes(self):

I would also consider adding this to do_get_property_contents_handle to prevent trying to get the pixels once we've destroyed the reference to the window (although, like I said above, this should never happen):

    def do_get_property_contents_handle(self, name):
        if self._window is None:
            self._contents_handle = None
            return None
totaam commented 11 years ago

2013-02-14 03:18:20: onlyjob commented


No, Server was always 0.8.4 (except for comment where I didn't upgrade it from 0.8.3 yet). Why would I test old server?

I don't know python and I don't understand the code so I have no idea how it work. However test results are reliable.

As for crash, the comment 3 already have fragment from server's log. How does it manifest? I move mouse over list of emails in kmail for a little while and server dies (doesn't respond to "attach" until "xpra upgrade"). It crashes all the time if client is 0.7.8. When client is 0.8.4 it doesn't crash when my patch is applied.

Unfortunately I can't provide GDB backtrace for this (yet). I'll try to think how can it can be done. My email (and therefore xpra server) is on the machine where I can't install *-dbg packages. I don't know how to reproduce in other applications but kmail. I need to build a dedicated test environment for this...

It looks like ghost-window make server very fragile so the crash itself perhaps is a standalone issue but somehow they are related...

totaam commented 11 years ago

2013-02-14 03:41:15: onlyjob commented


Last record in server log after crash is this:

The program 'xpra' received an X Window System error.
This probably reflects a bug in the program.
The error was 'BadWindow (invalid Window parameter)'.
  (Details: serial 301725 error_code 3 request_code 12 minor_code 0)
  (Note to programmers: normally, X errors are reported asynchronously;
   that is, you will receive the error a while after causing it.
   To debug your program, run it with the --sync command line
   option to change this behavior. You can then get a meaningful
   backtrace from your debugger if you break on the gdk_x_error() function.)
totaam commented 11 years ago

2013-02-14 16:07:48: antoine commented


My initial confusion was over what caused the crash, the log sample you had provided only showed the hello packet so I assumed this was a crash on connection - now we got that cleared!

Forgive me if I am being a bit slow, but are you saying that if the client is 0.7.8 then the patch does not prevent the crash? That would be quite odd, and the sign of a race.

A gdb backtrace would really help, even one without the debug symbols might give us a clue (better than nothing I guess).

totaam commented 11 years ago

2013-02-15 08:04:25: antoine commented


r2735 does something similar to your patch, but without being racy (I think), it also fixes an important bug (which should be backported to v0.7.x too)

This should improve things, though from what you're saying this may not solve the ghost window problem and/or crashes. Please let me know if this is a step in the right direction at least.. (and any kind of gdb bt would be awesome)

totaam commented 11 years ago

2013-02-15 08:06:17: onlyjob uploaded file xpra-0.8.4-backtrace.txt.xz (8.0 KiB)

gdk_x_error fired upon potentially problematic popup in kmail

totaam commented 11 years ago

2013-02-15 08:07:45: onlyjob commented


Please forgive me for providing incomplete information.

It is correct that patch did not prevent crash with client-0.7.8 connecting to server-0.8.4. However I think it could be that patch had little or no effect to stability of 0.8.4... because I don't have automated test the crash is hard to reproduce (it takes time) so it is possible that I just didn't try hard enough... Sometimes it takes longer to reproduce so I'm not convinced that patch actually fixes the problem.

Today I tried to get backtrace in "controlled environment" with *-gdb packages. The problem is that there is no backtrace on crash (application exit before gdb can get anything?) so I tried to set breakpoint to gdk_x_error but it fires too often. For what's it worth I'm attaching backtrace I've taken from interrupt on gdk_x_error that fired on popup, but it is not a crash backtrace.

I tried number of builds going down to r2276 and in all of them I can reproduce indestructible popups. Either regression was introduced earlier or maybe we just can't track it to one particular commit?

totaam commented 11 years ago

2013-02-15 08:46:14: onlyjob commented


By the way after crash Xpra cannot start on the same screen until I manually terminate process /usr/bin/Xorg-for-Xpra-:13... Can you detect this?

I tried xpra-0.8.4+r2735 but it crashed pretty quick so it probably didn't help. Again I couldn't get backtrace from crash (please advise); Attaching kmail popup's backtrace from breakpoint on gdk_x_error (I skipped number of breaks to let application start).

totaam commented 11 years ago

2013-02-15 08:47:25: onlyjob uploaded file xpra-0.8.4+r2275_backtrace.txt.xz (8.0 KiB)

break on gdk_x_error fired from potentially problematic popup in kmail

totaam commented 11 years ago

2013-02-15 08:48:41: onlyjob uploaded file xpra-0.8.4+r2735_backtrace.txt.xz (8.0 KiB)

break on gdk_x_error fired from potentially problematic popup in kmail

totaam commented 11 years ago

2013-02-15 08:50:25: onlyjob commented


Attachments xpra-0.8.4+r2275_backtrace.txt.xz​ and xpra-0.8.4+r2735_backtrace.txt.xz​ are same, sorry for typo in rev. number.

totaam commented 11 years ago

2013-02-15 09:25:08: antoine commented


I've looked at both stacktraces and although there are X errors there, these should not be fatal (xcomposite_unredirect_window can fail if the window is already destroyed I guess, we catch them using a trap.swallow call)

This probably isn't what is causing the server crash. So.. we need to find the real cause, and not having gdb to help complicates things - I don't understand why it wouldn't catch the crash. Can you rule out encoding issues by testing with png/rgb24 only? There was another report of an x264 crash today (#261) which makes me quite suspicious of the libav upgrade and the patching it requires to build in some cases.

totaam commented 11 years ago

2013-02-15 12:06:02: onlyjob commented


I'm not experienced with GDB so perhaps I don't know how to catch a crash... Your ideas are welcome... I'm following the procedure we discussed previously when we were troubleshooting another crash: gdb attach, set breakpoint etc.

I ruled out the encodings. I can reproduce at least with 3 of them: x264, png, rgb24; as well as with local mmap connection.

I've seen message in mail list and your answer. As you know I'm not testing your "official" packages. I'm very confident regarding our (Debian) packaging, libav and encodings. We just have a genuine but little understood problem...

totaam commented 11 years ago

2013-02-15 12:17:07: antoine changed status from new to accepted

totaam commented 11 years ago

2013-02-15 12:17:07: antoine changed owner from * to antoine*

totaam commented 11 years ago

2013-02-15 12:17:07: antoine commented


The gdb stuff is now [/wiki/Debugging#GettingaBacktrace documented here], you should use gdb python followed by attach $PID (with breakpoints as needed). Though the backtraces you provided looked good enough already.

You also said you could not start xpra again ("xpra upgrade" I assume) against the display after it has crashed? That would imply that the X11 server is left in an unusable state - which is a relatively hard thing to do. Do you have logs (-d all) for this case? Can you start an application against that display still? (ie: "DISPLAY=:NN xlsclients")

Maybe we're doing something totally illegal/buggy, causing the X11 server to crash. It would be worth trying other distros/versions to see if the problem is also present there (may help in narrowing things down).

totaam commented 11 years ago

2013-02-15 12:55:36: onlyjob commented


Yep, those GDB instructions is pretty much how I've taken those dumps.

Definitely xpra upgrade couldn't continue, but X was alive and well -- I could run apps or even xpra shadow and see that apps are still there working. Yes I could start new applications on that DISPLAY.

I'm attaching new xpra.log taken with "-d all". It is a complete log from start to crash where I reproduce the crash in less than a minute on 0.8.4+r2735.

totaam commented 11 years ago

2013-02-15 12:56:13: onlyjob uploaded file xpra-0.8.4+r2735_deblab-13.log.xz (39.4 KiB)

totaam commented 11 years ago

2013-02-15 12:59:01: onlyjob commented


The following was printed to terminal when Xpra crashed:

X Error: BadWindow (invalid Window parameter) 3
  Major opcode: 10 (X_UnmapWindow)
  Resource id:  0x800009
X Error: RenderBadPicture (invalid Picture parameter) 173
  Extension:    149 (RENDER)
  Minor opcode: 7 (RenderFreePicture)
  Resource id:  0x44
X Error: BadWindow (invalid Window parameter) 3
  Major opcode: 18 (X_ChangeProperty)
  Resource id:  0x800009
X Error: BadWindow (invalid Window parameter) 3
  Major opcode: 4 (X_DestroyWindow)
  Resource id:  0x800009
totaam commented 11 years ago

2013-02-15 13:35:06: antoine commented


Good idea to use "xpra shadow" or "xpra screenshot" to verify.

0x800009 is 8388617 in decimal. Now if we grep for this we find:

dock_tray(8388617) window=<gtk.gdk.Window object at 0x25c6460 (GdkWindow at 0x252f480)>, \
    geometry=(0, 0, 2560, 1280, 24), visual.depth=24
dock_tray(8388617) setting tray properties
dock_tray(8388617) resizing and reparenting
dock_tray(8388617) new tray container window 4194356
do_wimpiggy_child_map_event(<AdHocStruct object, contents: \
   {'delivered_to': <gtk.gdk.Window object at 0x241c6e0 (GdkWindow at 0x2008360)>, \ 
    'send_event': 0, 'override_redirect': 1, \
    'window': <gtk.gdk.Window object at 0x25c6500 (GdkWindow at 0x252f5a0)>, \
    'serial': 983L, 'type': 19, \
    'display': <gtk.gdk.Display object at 0x2189eb0 (GdkDisplayX11 at 0x2278230)>}>)
2013-02-15 23:44:02,455 Discovered new override-redirect window: 4194356 (tray=8388617)

So this looks like something is complaining about the tray window disappearing (which happens when xpra crashes or is upgraded - this needs to be moved to a separate process, but that's for another ticket). I don't think the system tray has anything to do with this bug, but it's worth trying the server with "-no-system-tray" just to be sure.


Now, the log also shows that before the actual crash, all we have are "will process ui packet pointer-position". Now it could just be that this is all that was happening at the time, or that the UI thread is already dead at this point. Either way, we're none the wiser. Time for the big guns, if you set (as per Debugging):

  • XPRA_X11_LOG=1 and maybe also:
  • XPRA_X11_DEBUG=1 Before starting xpra, the server will produce a very verbose logfile, but we are only interested in what happens just before the crash (say, the last second or so). Hopefully there will be some insight in there, because at the moment I just don't know.
totaam commented 11 years ago

2013-02-16 03:50:40: onlyjob uploaded file deblab-13.log.xz (11.1 KiB)

XPRA_X11_LOG=1 XPRA_X11_DEBUG=1 xpra start :13 -d all --no-system-tray

totaam commented 11 years ago

2013-02-16 03:52:41: onlyjob commented


Thank you for debbugging hints and fantastic interpretation of existing data. I attached tail of crash log with last ~4 seconds. I hope it makes sense to you.

totaam commented 11 years ago

2013-02-16 04:33:26: antoine commented


Hmmm, no definitive answer in the logs - though there are some clues: cursors and damage are the events that stand out. For cursors, see the patch below, for damage I will have to review by hand...

Does this help at all? (I don't see why the call would trigger window errors since windows are involved directly with cursors... but you never know):

--- src/xpra/server.py    (revision 2737)
+++ src/xpra/server.py    (working copy)
@@ -209,7 +209,7 @@
         self.send_cursor_pending = False
         self.cursor_data = None
         def get_default_cursor():
-            self.default_cursor_data = get_cursor_image()
+            self.default_cursor_data = trap.call_synced(get_cursor_image)
             log("get_default_cursor=%s", self.default_cursor_data)
         trap.swallow_synced(get_default_cursor)
         self._wm.enableCursors(True)
@@ -387,7 +387,7 @@

     def send_cursor(self):
         self.send_cursor_pending = False
-        self.cursor_data = get_cursor_image()
+        self.cursor_data = trap.call_synced(get_cursor_image)
         if self.cursor_data:
             pixels = self.cursor_data[7]
             if self.default_cursor_data and pixels==self.default_cursor_data[7]:

(afterthought: an easier way to test if this is the cause of the problem is to disable cursors with "--no-cursors")

If it does not, and maybe even if it does and if you have time, can you try with the modified error.py as per [/wiki/Debugging#X11errors Debugging X11 errors]? (setting XPRA_LOG_ALL # True and XPRA_TRACE_ALLTrue)

totaam commented 11 years ago

2013-02-16 09:50:03: onlyjob commented


Reproduced crash with "--no-cursors". Will get more detailed log soon.

totaam commented 11 years ago

2013-02-16 10:34:30: onlyjob commented


I can't reproduce neither ghost-window nor crash with modified "error.py"... Looks like synchronous mode is preventing the problem.

Please advise if there is anything I could do to troubleshoot this...

Also I've noticed that I was able to recover from crash (prior to applying debugging error.py) using xpra upgrade... Could it be effect of r2735 that I have applied?

totaam commented 11 years ago

2013-02-16 10:51:42: antoine commented


Well, that's excellent news at last. It means the problem was caused by #224.

That makes it very easy to spot where I've mistakenly used unsynced calls as there are only a few left outside areas which don't need them (in window.py those are followed by synced calls before returning and therefore safe):

  • in server_base._process_mouse_common:
    trap.swallow_unsynced(self._move_pointer, pointer)

    I think this one is OK as I can't see how it would fail - but then again the X11 API might well be different from the GTK API.

  • same method in server.py looks very suspicious as it uses the window instance (which may have disappeared already and therefore cause an X11 error which we do not catch and crashes GTK later..) - I think this may well be the one.
  • trap.swallow(get_targets, targets) in clipboard_base.py - not our problem here, but this is the last instance of unspecified synced/unsynced call and should probably be changed for a synced call (not to self)

Please replace trap.swallow_unsynced by trap.swallow_synced and see if the problem is solved. Fingers crossed, and thank you very much for your time!

totaam commented 11 years ago

2013-02-16 12:25:59: antoine commented


The clipboard one was ok actually (though I've cleaned it up in r2738)

Even before getting any feedback, I am convinced that we should use a synced call for raise_and_move (well, at least the raise portion which requires the window to still exist), so r2739 now uses synced X11 calls.

Does this help?

totaam commented 11 years ago

2013-02-16 23:43:54: onlyjob commented


Yes it is OK now. :) :) r2739 is totally fixed it: I can't reproduce neither indestructible "ghost" popups nor crashes. Fantastic, I'm so relieved that this testing marathon is over. :) Finally the issue has gone, hopefully forever. :)

I also tested with xpra-0.7.8 as client and couldn't crash latest Xpra-server any more. Great job.

Issues related to filelight hasn't changed so they are unrelated: I've made a new ticket #262.

As far as I'm concerned this ticket can be closed. Does it qualify for 0.8.5 release? ;)

totaam commented 11 years ago

2013-02-17 01:58:05: antoine changed status from accepted to closed

totaam commented 11 years ago

2013-02-17 01:58:05: antoine changed resolution from * to fixed*

totaam commented 11 years ago

2013-02-17 01:58:05: antoine commented


Definitely qualifies for 0.8.5, will try to push this today.

Thanks again for your time.