Closed totaam closed 11 years ago
Probably this is a regression since it is not reproducible in 0.7.8
I was looking for application where we could reproduce this issue easier and found
filelight
where I see a problem that is not exactly the same but might be related.In
filelight
when I move mouse over the folders there is a semi-transparent rectangle hover-tooltip with folder name, size and number of files.If
filelight
is running within Xpra session there are number of problems with popup tooltips:
They are not transparent (feature test ?)
They are often incompletely removed/drawn (another manifestation of #252 ?)
Soon enough they are stop displaying completely (bug).
When those popups are displayed the following appears in server log:
2013-02-14 02:40:57,084 not found transient_for=<gtk.gdk.Window object at 0x7ffca2bb2d70 (GdkWindow at 0x7ffca2ba46c0)>, xid=77594628 2013-02-14 02:40:57,210 not found transient_for=<gtk.gdk.Window object at 0x7ffca2bb2d70 (GdkWindow at 0x7ffca2ba46c0)>, xid=77594628 2013-02-14 02:40:57,683 not found transient_for=<gtk.gdk.Window object at 0x7ffca2bb2d70 (GdkWindow at 0x7ffca2ba46c0)>, xid=77594628 2013-02-14 02:40:58,093 not found transient_for=<gtk.gdk.Window object at 0x7ffca2bb2d70 (GdkWindow at 0x7ffca2ba46c0)>, xid=77594628 2013-02-14 02:40:58,121 not found transient_for=<gtk.gdk.Window object at 0x7ffca2bb2d70 (GdkWindow at 0x7ffca2ba46c0)>, xid=77594628
Just before those tooltips disappear completely the following appears in log (and the above log records are stop appearing):
2013-02-14 02:41:03,811 the window <OverrideRedirectWindowModel object at 0x1acec30 (wimpiggy+window+OverrideRedirectWindowModel at 0x3bc4b40)> is not composited!? 2013-02-14 02:41:03,966 the window <OverrideRedirectWindowModel object at 0x1acec30 (wimpiggy+window+OverrideRedirectWindowModel at 0x3bc4b40)> is not composited!? 2013-02-14 02:41:06,098 the window <OverrideRedirectWindowModel object at 0x1acec30 (wimpiggy+window+OverrideRedirectWindowModel at 0x3bc4b40)> is not composited!?
I tried to reproduce by connecting to 0.8.3 from client 0.7.8: instead of expected "sticky" indestructible popup window on client side the server-side Xpra crashed:
found large packet (7874 bytes): new-window, argument types:[<type 'int'>, <type 'int'>, <type 'int'>, <type 'int'>, <type 'int'>, <type 'dict'>, <type 'dict'>], sizes: [1, 1, 2, 4, 3, 22146, 2], packet head=['new-window', 1, 0, 63, 1280, 737, {'size-constraints': {'minimum-size': (307, 293)}, 'window-type': ['_NET_WM_WINDOW_T YPE_NORMAL'], 'modal': False, 'title': 'XXXXXX@gmail.com/@FSF \xe2\x80\x93 KMail', 'class-instance': ['kmail', 'Kmail'], 'client-machine': 'debstor', 'pid': 12610, 'icon': ....
Unreproducible with Xpra-0.8.4 on client side and 0.7.8 on server-side.
Further testing revealed that it is possible to crash Xpra-server 0.8.4 from client 0.8.4 by circular mouse motion over list of emails in
kmail
.r2281 is related to the problem. After hours of trying miscellaneous patches to partially undo r2281 I ended up with "ghost-popup.patch" (attached):
--- a/wimpiggy/composite.py +++ b/wimpiggy/composite.py @@ -69,8 +69,9 @@ trap.swallow_synced(xcomposite_unredirect_window, self._window) if self._damage_handle: trap.swallow_synced(xdamage_stop, self._window, self._damage_handle) self._damage_handle = None + self._contents_handle = None # needed to avoid crash self._window = None . def acknowledge_changes(self): if self._damage_handle is not None and self._window is not None:
I have no idea how it work, but it eliminate server-side crashes with client-0.8.4 and reduces probability of indestructible ghost-popups in
kmail
. Unfortunately even with this patch vigorous testing still reproduces ghost-popups even though it is harder. Also patch does not help to avoid server-side crash (0.8.4) with client 0.7.8.I'm raising this ticket's priority as server-side crashes is a serious issue.
ghost-popup.patch
(0.8 KiB)Apparently patch have no effect on
filelight
behaviour.
Hang on, I thought the server-side crashes from comment:3 were with a 0.7.8 server? Or did you apply the patch in comment:5 on top of 0.7.x? I really don't see how it would make any difference going from 0.8.3 to 0.8.4 - sounds like you just got (un)lucky there. It looks like a separate issue to me, one for a new ticket. How does this crash manifest itself? Anything in the logs? The sample in comment:3 is just a warning message (it shouldn't be there - but still), not a fatal error. Does it reject the client connection or actually crash the server? Do you have more complete log messages, or even better, a gdb backtrace?
[[BR]]
As for the composite change, it's an odd one:
invalidate_pixmap
(which does clear theself._contents_handle
) is already called fromdestroy
and this is all done from the UI thread, so I don't see how it could be anything butNone
already... That is, unless:
- a race: if we somehow call one of these methods from a non-UI thread (unlikely but will check)
- the
trap.swallow_synced
X11 calls end up calling the gtk code (I didn't think was the case - but maybe it is)The only way this would help is if we use
window.get_property("client-contents")
afterwards, and the only place we do this is from the UI thread (inwindow_source.process_damage_region
), so by that point, thewindow.do_unmanaged
function should have finished (since it also runs from the UI thread) and the reference to the composite window should be gone...If anything it should be a call to
self._cleanup_listening()
, and indented to run in all cases (or maybe even keep bothinvalidate_pixmap
calls since they're cheap):--- src/wimpiggy/composite.py (revision 2715) +++ src/wimpiggy/composite.py (working copy) @@ -64,12 +64,12 @@ log.warn("composite window %s already destroyed!", self) return remove_event_receiver(self._window, self) - self.invalidate_pixmap() if not self._already_composited: trap.swallow_synced(xcomposite_unredirect_window, self._window) if self._damage_handle: trap.swallow_synced(xdamage_stop, self._window, self._damage_handle) self._damage_handle = None + self.invalidate_pixmap() self._window = None def acknowledge_changes(self):
I would also consider adding this to
do_get_property_contents_handle
to prevent trying to get the pixels once we've destroyed the reference to the window (although, like I said above, this should never happen):def do_get_property_contents_handle(self, name): if self._window is None: self._contents_handle = None return None
No, Server was always 0.8.4 (except for comment where I didn't upgrade it from 0.8.3 yet). Why would I test old server?
I don't know python and I don't understand the code so I have no idea how it work. However test results are reliable.
As for crash, the comment 3 already have fragment from server's log. How does it manifest? I move mouse over list of emails in
kmail
for a little while and server dies (doesn't respond to "attach" until "xpra upgrade"). It crashes all the time if client is 0.7.8. When client is 0.8.4 it doesn't crash when my patch is applied.Unfortunately I can't provide GDB backtrace for this (yet). I'll try to think how can it can be done. My email (and therefore xpra server) is on the machine where I can't install *-dbg packages. I don't know how to reproduce in other applications but
kmail
. I need to build a dedicated test environment for this...It looks like ghost-window make server very fragile so the crash itself perhaps is a standalone issue but somehow they are related...
Last record in server log after crash is this:
The program 'xpra' received an X Window System error. This probably reflects a bug in the program. The error was 'BadWindow (invalid Window parameter)'. (Details: serial 301725 error_code 3 request_code 12 minor_code 0) (Note to programmers: normally, X errors are reported asynchronously; that is, you will receive the error a while after causing it. To debug your program, run it with the --sync command line option to change this behavior. You can then get a meaningful backtrace from your debugger if you break on the gdk_x_error() function.)
My initial confusion was over what caused the crash, the log sample you had provided only showed the hello packet so I assumed this was a crash on connection - now we got that cleared!
Forgive me if I am being a bit slow, but are you saying that if the client is 0.7.8 then the patch does not prevent the crash? That would be quite odd, and the sign of a race.
A gdb backtrace would really help, even one without the debug symbols might give us a clue (better than nothing I guess).
r2735 does something similar to your patch, but without being racy (I think), it also fixes an important bug (which should be backported to v0.7.x too)
This should improve things, though from what you're saying this may not solve the ghost window problem and/or crashes. Please let me know if this is a step in the right direction at least.. (and any kind of gdb bt would be awesome)
xpra-0.8.4-backtrace.txt.xz
(8.0 KiB)gdk_x_error fired upon potentially problematic popup in kmail
Please forgive me for providing incomplete information.
It is correct that patch did not prevent crash with client-0.7.8 connecting to server-0.8.4. However I think it could be that patch had little or no effect to stability of 0.8.4... because I don't have automated test the crash is hard to reproduce (it takes time) so it is possible that I just didn't try hard enough... Sometimes it takes longer to reproduce so I'm not convinced that patch actually fixes the problem.
Today I tried to get backtrace in "controlled environment" with *-gdb packages. The problem is that there is no backtrace on crash (application exit before gdb can get anything?) so I tried to set breakpoint to gdk_x_error but it fires too often. For what's it worth I'm attaching backtrace I've taken from interrupt on gdk_x_error that fired on popup, but it is not a crash backtrace.
I tried number of builds going down to r2276 and in all of them I can reproduce indestructible popups. Either regression was introduced earlier or maybe we just can't track it to one particular commit?
By the way after crash Xpra cannot start on the same screen until I manually terminate process
/usr/bin/Xorg-for-Xpra-:13
... Can you detect this?I tried xpra-0.8.4+r2735 but it crashed pretty quick so it probably didn't help. Again I couldn't get backtrace from crash (please advise); Attaching kmail popup's backtrace from breakpoint on gdk_x_error (I skipped number of breaks to let application start).
xpra-0.8.4+r2275_backtrace.txt.xz
(8.0 KiB)break on gdk_x_error fired from potentially problematic popup in kmail
xpra-0.8.4+r2735_backtrace.txt.xz
(8.0 KiB)break on gdk_x_error fired from potentially problematic popup in kmail
Attachments xpra-0.8.4+r2275_backtrace.txt.xz and xpra-0.8.4+r2735_backtrace.txt.xz are same, sorry for typo in rev. number.
I've looked at both stacktraces and although there are X errors there, these should not be fatal (
xcomposite_unredirect_window
can fail if the window is already destroyed I guess, we catch them using atrap.swallow
call)This probably isn't what is causing the server crash. So.. we need to find the real cause, and not having gdb to help complicates things - I don't understand why it wouldn't catch the crash. Can you rule out encoding issues by testing with png/rgb24 only? There was another report of an x264 crash today (#261) which makes me quite suspicious of the libav upgrade and the patching it requires to build in some cases.
I'm not experienced with GDB so perhaps I don't know how to catch a crash... Your ideas are welcome... I'm following the procedure we discussed previously when we were troubleshooting another crash:
gdb attach
, set breakpoint etc.I ruled out the encodings. I can reproduce at least with 3 of them: x264, png, rgb24; as well as with local mmap connection.
I've seen message in mail list and your answer. As you know I'm not testing your "official" packages. I'm very confident regarding our (Debian) packaging, libav and encodings. We just have a genuine but little understood problem...
The gdb stuff is now [/wiki/Debugging#GettingaBacktrace documented here], you should use
gdb python
followed byattach $PID
(with breakpoints as needed). Though the backtraces you provided looked good enough already.You also said you could not start xpra again ("xpra upgrade" I assume) against the display after it has crashed? That would imply that the X11 server is left in an unusable state - which is a relatively hard thing to do. Do you have logs (-d all) for this case? Can you start an application against that display still? (ie: "
DISPLAY=:NN xlsclients
")Maybe we're doing something totally illegal/buggy, causing the X11 server to crash. It would be worth trying other distros/versions to see if the problem is also present there (may help in narrowing things down).
Yep, those GDB instructions is pretty much how I've taken those dumps.
Definitely
xpra upgrade
couldn't continue, but X was alive and well -- I could run apps or evenxpra shadow
and see that apps are still there working. Yes I could start new applications on that DISPLAY.I'm attaching new xpra.log taken with "-d all". It is a complete log from start to crash where I reproduce the crash in less than a minute on 0.8.4+r2735.
xpra-0.8.4+r2735_deblab-13.log.xz
(39.4 KiB)The following was printed to terminal when Xpra crashed:
X Error: BadWindow (invalid Window parameter) 3 Major opcode: 10 (X_UnmapWindow) Resource id: 0x800009 X Error: RenderBadPicture (invalid Picture parameter) 173 Extension: 149 (RENDER) Minor opcode: 7 (RenderFreePicture) Resource id: 0x44 X Error: BadWindow (invalid Window parameter) 3 Major opcode: 18 (X_ChangeProperty) Resource id: 0x800009 X Error: BadWindow (invalid Window parameter) 3 Major opcode: 4 (X_DestroyWindow) Resource id: 0x800009
Good idea to use "xpra shadow" or "xpra screenshot" to verify.
0x800009
is8388617
in decimal. Now if we grep for this we find:dock_tray(8388617) window=<gtk.gdk.Window object at 0x25c6460 (GdkWindow at 0x252f480)>, \ geometry=(0, 0, 2560, 1280, 24), visual.depth=24 dock_tray(8388617) setting tray properties dock_tray(8388617) resizing and reparenting dock_tray(8388617) new tray container window 4194356 do_wimpiggy_child_map_event(<AdHocStruct object, contents: \ {'delivered_to': <gtk.gdk.Window object at 0x241c6e0 (GdkWindow at 0x2008360)>, \ 'send_event': 0, 'override_redirect': 1, \ 'window': <gtk.gdk.Window object at 0x25c6500 (GdkWindow at 0x252f5a0)>, \ 'serial': 983L, 'type': 19, \ 'display': <gtk.gdk.Display object at 0x2189eb0 (GdkDisplayX11 at 0x2278230)>}>) 2013-02-15 23:44:02,455 Discovered new override-redirect window: 4194356 (tray=8388617)
So this looks like something is complaining about the tray window disappearing (which happens when xpra crashes or is upgraded - this needs to be moved to a separate process, but that's for another ticket). I don't think the system tray has anything to do with this bug, but it's worth trying the server with "
-no-system-tray
" just to be sure.
Now, the log also shows that before the actual crash, all we have are "
will process ui packet pointer-position
". Now it could just be that this is all that was happening at the time, or that the UI thread is already dead at this point. Either way, we're none the wiser. Time for the big guns, if you set (as per Debugging):
XPRA_X11_LOG=1
and maybe also:XPRA_X11_DEBUG=1
Before starting xpra, the server will produce a very verbose logfile, but we are only interested in what happens just before the crash (say, the last second or so). Hopefully there will be some insight in there, because at the moment I just don't know.
deblab-13.log.xz
(11.1 KiB)XPRA_X11_LOG=1 XPRA_X11_DEBUG=1 xpra start :13 -d all --no-system-tray
Thank you for debbugging hints and fantastic interpretation of existing data. I attached tail of crash log with last ~4 seconds. I hope it makes sense to you.
Hmmm, no definitive answer in the logs - though there are some clues: cursors and damage are the events that stand out. For cursors, see the patch below, for damage I will have to review by hand...
Does this help at all? (I don't see why the call would trigger window errors since windows are involved directly with cursors... but you never know):
--- src/xpra/server.py (revision 2737) +++ src/xpra/server.py (working copy) @@ -209,7 +209,7 @@ self.send_cursor_pending = False self.cursor_data = None def get_default_cursor(): - self.default_cursor_data = get_cursor_image() + self.default_cursor_data = trap.call_synced(get_cursor_image) log("get_default_cursor=%s", self.default_cursor_data) trap.swallow_synced(get_default_cursor) self._wm.enableCursors(True) @@ -387,7 +387,7 @@ def send_cursor(self): self.send_cursor_pending = False - self.cursor_data = get_cursor_image() + self.cursor_data = trap.call_synced(get_cursor_image) if self.cursor_data: pixels = self.cursor_data[7] if self.default_cursor_data and pixels==self.default_cursor_data[7]:
(afterthought: an easier way to test if this is the cause of the problem is to disable cursors with "
--no-cursors
")If it does not, and maybe even if it does and if you have time, can you try with the modified
error.py
as per [/wiki/Debugging#X11errors Debugging X11 errors]? (settingXPRA_LOG_ALL # True
andXPRA_TRACE_ALLTrue
)
Reproduced crash with "--no-cursors". Will get more detailed log soon.
I can't reproduce neither ghost-window nor crash with modified "error.py"... Looks like synchronous mode is preventing the problem.
Please advise if there is anything I could do to troubleshoot this...
Also I've noticed that I was able to recover from crash (prior to applying debugging error.py) using
xpra upgrade
... Could it be effect of r2735 that I have applied?
Well, that's excellent news at last. It means the problem was caused by #224.
That makes it very easy to spot where I've mistakenly used unsynced calls as there are only a few left outside areas which don't need them (in
window.py
those are followed by synced calls before returning and therefore safe):
- in
server_base._process_mouse_common
:trap.swallow_unsynced(self._move_pointer, pointer)
I think this one is OK as I can't see how it would fail - but then again the X11 API might well be different from the GTK API.
- same method in
server.py
looks very suspicious as it uses the window instance (which may have disappeared already and therefore cause an X11 error which we do not catch and crashes GTK later..) - I think this may well be the one.trap.swallow(get_targets, targets)
inclipboard_base.py
- not our problem here, but this is the last instance of unspecified synced/unsynced call and should probably be changed for a synced call (not to self)Please replace
trap.swallow_unsynced
bytrap.swallow_synced
and see if the problem is solved. Fingers crossed, and thank you very much for your time!
The clipboard one was ok actually (though I've cleaned it up in r2738)
Even before getting any feedback, I am convinced that we should use a synced call for
raise_and_move
(well, at least the raise portion which requires the window to still exist), so r2739 now uses synced X11 calls.Does this help?
Yes it is OK now. :) :) r2739 is totally fixed it: I can't reproduce neither indestructible "ghost" popups nor crashes. Fantastic, I'm so relieved that this testing marathon is over. :) Finally the issue has gone, hopefully forever. :)
I also tested with xpra-0.7.8 as client and couldn't crash latest Xpra-server any more. Great job.
Issues related to
filelight
hasn't changed so they are unrelated: I've made a new ticket #262.As far as I'm concerned this ticket can be closed. Does it qualify for 0.8.5 release? ;)
Definitely qualifies for 0.8.5, will try to push this today.
Thanks again for your time.
Issue migrated from trac ticket # 258
component: core | priority: critical | resolution: fixed
2013-02-13 10:24:17: antoine created the issue