VirtualGL / virtualgl

Main VirtualGL repository
https://VirtualGL.org
Other
690 stars 104 forks source link

ERROR: in isDisplayExcluded: Unexpected NULL condition #153

Closed peci1 closed 3 years ago

peci1 commented 3 years ago

Hello, I'm running a setup consisting of TurboVNC server and VGL, and I've noticed the following issue when running git gui:

$ git gui [VGL] Shared memory segment ID for vglconfig: 753716 [VGL] VirtualGL v2.6.80 64-bit (Build 20200904) [VGL] ERROR: in isDisplayExcluded-- [VGL] 75: Unexpected NULL condition

I run the setup as following: The system has an NVidia GPU. X server :0 is GDM greeter, nothing else. X server :1 is standard Ubuntu desktop run by logging in from GDM. Both :0 and :1 are accelerated.

Then I remotely login to the machine over SSH and run the following command to start a remotely accessible accelerated X server :2 with a window manager:

SESSION_MANAGER= DBUS_SESSION_BUS_ADDRESS= vglrun +v -d :1 /opt/TurboVNC/bin/vncserver -fg -log /dev/stdout -geometry 1920x1080 -vgl -xstartup 'dbus-run-session xfce4-session'

The session starts, I can connect to it with a VNC client and then I use the started window manager to run other programs as needed. Most programs work (glxgears, darktable, libreoffice with HW acceleration), but git gui shows the error described above.

TurboVNC Server v2.2.80 (build 20200807).

Here is a result of glreadtest run on X server :2

$ /opt/VirtualGL/bin/glreadtest

GLreadtest v2.6.80 (Build 20200904)

/opt/VirtualGL/bin/glreadtest -h for advanced usage.
[VGL] Shared memory segment ID for vglconfig: 983096
[VGL] VirtualGL v2.6.80 64-bit (Build 20200904)
[VGL] Opening connection to 3D X server :1
FB Config = 0xb1
Rendering to Pbuffer (size = 701 x 701 pixels)
Using 1-byte row alignment

>>>>>>>>>>  PIXEL FORMAT:  RGB  <<<<<<<<<<
glDrawPixels():   209.7 Mpixels/sec
glReadPixels():   107.5 Mpixels/sec (min = 101.0, max = 110.9, sdev = 1.237)
glReadPixels() accounted for 100.00% of total readback time

>>>>>>>>>>  PIXEL FORMAT:  RGBA  <<<<<<<<<<
glDrawPixels():   215.2 Mpixels/sec
glReadPixels():   84.35 Mpixels/sec (min = 78.93, max = 84.94, sdev = 0.8140)
glReadPixels() accounted for 100.00% of total readback time

>>>>>>>>>>  PIXEL FORMAT:  BGR  <<<<<<<<<<
glDrawPixels():   213.7 Mpixels/sec
glReadPixels():   107.5 Mpixels/sec (min = 101.1, max = 110.9, sdev = 1.160)
glReadPixels() accounted for 100.00% of total readback time

>>>>>>>>>>  PIXEL FORMAT:  BGRA  <<<<<<<<<<
glDrawPixels():   215.1 Mpixels/sec
glReadPixels():   84.42 Mpixels/sec (min = 77.00, max = 84.96, sdev = 0.7847)
glReadPixels() accounted for 100.00% of total readback time

>>>>>>>>>>  PIXEL FORMAT:  ABGR  <<<<<<<<<<
glDrawPixels():   215.1 Mpixels/sec
glReadPixels():   84.21 Mpixels/sec (min = 78.11, max = 84.94, sdev = 0.9959)
glReadPixels() accounted for 100.00% of total readback time

>>>>>>>>>>  PIXEL FORMAT:  ARGB  <<<<<<<<<<
glDrawPixels():   214.8 Mpixels/sec
glReadPixels():   84.42 Mpixels/sec (min = 78.82, max = 84.96, sdev = 0.7817)
glReadPixels() accounted for 100.00% of total readback time

>>>>>>>>>>  PIXEL FORMAT:  RED  <<<<<<<<<<
glDrawPixels():   213.5 Mpixels/sec
glReadPixels():   293.8 Mpixels/sec (min = 229.4, max = 302.8, sdev = 5.570)
glReadPixels() accounted for 100.00% of total readback time
peci1 commented 3 years ago

The same happens when I delegate rendering via EGL to /dev/dri/card1 (when launching the :2 X server).

dcommander commented 3 years ago

I can't reproduce the issue using the latest stable release of both VirtualGL and TurboVNC or using the latest evolving code. You seem to be using an older evolving build of both VirtualGL and TurboVNC. Please try upgrading to the latest code.

peci1 commented 3 years ago

I've updated to latest dev versions (the build debs), and the error is still there. The basic problem is with the Tcl/Tk interpreter:

$ wish
[VGL] Shared memory segment ID for vglconfig: 7831620
[VGL] VirtualGL v2.6.80 64-bit (Build 20201209)
[VGL] ERROR: in isDisplayExcluded--
[VGL]    75: Unexpected NULL condition

$ ltrace -fS wish
...
[pid 30043] SYS_read(4, "COMBINING MACRON\n\n<dead_circumfl"..., 4096)                                                            = 4096
[pid 30043] SYS_read(4, "ALL LETTER O WITH COMBINING MACR"..., 4096)                                                              = 4096
[pid 30043] SYS_read(4, "C SMALL LETTER ER WITH COMBINING"..., 4096)                                                              = 4096
[pid 30043] SYS_read(4, "                     : "\342\212\226"   U"..., 4096)                                                     = 4096
[pid 30043] SYS_read(4, "\204"   U2344   # \342\216\225 > APL FUNCTIO"..., 4096)                                                  = 4096
[pid 30043] SYS_read(4, "NE\n<Multi_key> <minus> <U2193>  "..., 4096)                                                             = 4096
[pid 30043] SYS_read(4, "ONAL SYMBOL LEFT SHOE STILE\n<Mul"..., 4096)                                                             = 4096
[pid 30043] SYS_read(4, "        : "\306\220\314\202"   # LATIN CAPIT"..., 4096)                                                  = 4096
[pid 30043] SYS_read(4, "           \t: "\306\216\314\204"   # LATIN C"..., 4096)                                                 = 4092
[pid 30043] SYS_read(4, "", 4096)                                                                                                 = 0
[pid 30043] SYS_close(4)                                                                                                          = 0
[pid 30043] SYS_futex(0x7f6a94f599f0, 129, 0x7fffffff, 0)                                                                         = 0
[pid 30043] SYS_write(2, "[VGL] ERROR: in isDisplayExclude"..., 75[VGL] ERROR: in isDisplayExcluded--
[VGL]    75: Unexpected NULL condition
)                                                               = 75
[pid 30043] SYS_shmdt(0x7f6a95361000, 0, 0x755b, 30)                                                                              = 0
[pid 30043] SYS_shmctl(0x778053, 0, 0, 30)                                                                                        = -22
[pid 30043] SYS_getpid()                                                                                                          = 30043
[pid 30043] SYS_getpid()                                                                                                          = 30043
[pid 30043] SYS_exit_group(1 <no return ...>
[pid 30043] +++ exited (status 1) +++

$ strace -f wish
read(4, "  # LATIN SMALL LETTER U WITH HO"..., 4096) = 4096
read(4, "SMALL LETTER O WITH CIRCUMFLEX A"..., 4096) = 4096
read(4, "LL LETTER IOTA WITH MACRON\n<dead"..., 4096) = 4096
brk(0x55c65a15f000)                     = 0x55c65a15f000
read(4, "COMBINING MACRON\n\n<dead_circumfl"..., 4096) = 4096
read(4, "ALL LETTER O WITH COMBINING MACR"..., 4096) = 4096
read(4, "C SMALL LETTER ER WITH COMBINING"..., 4096) = 4096
read(4, "                     : \"\342\212\226\"   U"..., 4096) = 4096
read(4, "\204\"   U2344   # \342\216\225 > APL FUNCTIO"..., 4096) = 4096
read(4, "NE\n<Multi_key> <minus> <U2193>  "..., 4096) = 4096
read(4, "ONAL SYMBOL LEFT SHOE STILE\n<Mul"..., 4096) = 4096
read(4, "        : \"\306\220\314\202\"   # LATIN CAPIT"..., 4096) = 4096
read(4, "           \t: \"\306\216\314\204\"   # LATIN C"..., 4096) = 4092
read(4, "", 4096)                       = 0
close(4)                                = 0
futex(0x7ff737f469f0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
write(2, "[VGL] ERROR: in isDisplayExclude"..., 75[VGL] ERROR: in isDisplayExcluded--
[VGL]    75: Unexpected NULL condition
) = 75
shmdt(0x7ff73834e000)                   = 0
shmctl(7864391, IPC_RMID, NULL)         = -1 EINVAL (Invalid argument)
getpid()                                = 30145
getpid()                                = 30145
exit_group(1)                           = ?
+++ exited with 1 +++

But:

LD_PRELOAD= wish

works.

I couldn't find a reasonable way to set breakpoint to the VGL calls, as symbols seem to be missing

$ gdb -q wish
Reading symbols from wish...(no debugging symbols found)...done.
(gdb) set follow-fork-mode child
(gdb) break isDisplayStringExcluded(char*)
Function "isDisplayStringExcluded(char*)" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (isDisplayStringExcluded(char*)) pending.
(gdb) start
Function "main" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Temporary breakpoint 2 (main) pending.
Starting program: /usr/bin/wish 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[VGL] Shared memory segment ID for vglconfig: 8028195
[VGL] VirtualGL v2.6.80 64-bit (Build 20201209)
[VGL] ERROR: in isDisplayExcluded--
[VGL]    75: Unexpected NULL condition
[Inferior 1 (process 31023) exited with code 01]
(gdb) 

Do you have an idea how I could debug this further?

dcommander commented 3 years ago

isDisplayExcluded() is an inline function, so it might not be showing up in the debugger. You can set a breakpoint at faker.h:75, though.

Which O/S and version of Git are you using? That may help me to reproduce the issue.

dcommander commented 3 years ago

Note that the underlying problem has to do with new functionality in VirtualGL 3.0. VGL 3.0 now hangs certain internal VirtualGL Faker structures on the 2D X server connection, in order to avoid hashes. But I don't understand why that mechanism is failing in your case. I need to be able to reproduce the failure so I can develop that understanding.

peci1 commented 3 years ago

It's Ubuntu 18.04.5 LTS. tk versoin 8.6.0+9 (that provides the wish binary), git version 2.24.0.

Setting the breakpoint to faker.h:75 didn't help, the library has probably all debugging symbols stripped down, so the debugger doesn't know which assembly line(s) come from this source. I'll try to get a local debug build of VGL and figure out what happens.

dcommander commented 3 years ago

If you're using our official DEB package, then it does not contain debug symbols. You will have to use a custom build. I'll try to reproduce the issue with my Ubuntu 18.04 VM in the meantime.

peci1 commented 3 years ago

Ok, so the failing point is here:

https://github.com/tcltk/tk/blob/7570fbe3b848f0d05b9b16849ec6d24ca5ef899f/unix/tkUnixEmbed.c#L243

Some info from GDB:

#0  vglfaker::isDisplayExcluded (dpy=0x5555557e8e10) at /home/peci1/apps/virtualgl/server/faker.h:75
#1  XCreateWindow (dpy=0x5555557e8e10, parent=702, x=0, y=0, width=1, height=1, border_width=0, depth=24, c_class=1, visual=0x555555804230, valuemask=10768, attributes=0x5555557cec50)
    at /home/peci1/apps/virtualgl/server/faker-x11.cpp:224
#2  0x00007ffff75e410a in TkpMakeWindow () from /usr/lib/x86_64-linux-gnu/libtk8.6.so
#3  0x00007ffff75550fe in Tk_MakeWindowExist () from /usr/lib/x86_64-linux-gnu/libtk8.6.so
#4  0x00007ffff75f0d8b in ?? () from /usr/lib/x86_64-linux-gnu/libtk8.6.so
#5  0x00007ffff75eeed9 in Tk_SetAppName () from /usr/lib/x86_64-linux-gnu/libtk8.6.so
#6  0x00007ffff7554261 in TkCreateMainWindow () from /usr/lib/x86_64-linux-gnu/libtk8.6.so
#7  0x00007ffff755fcc7 in ?? () from /usr/lib/x86_64-linux-gnu/libtk8.6.so
#8  0x00007ffff755dc7b in ?? () from /usr/lib/x86_64-linux-gnu/libtk8.6.so
#9  0x00007ffff75566c5 in ?? () from /usr/lib/x86_64-linux-gnu/libtk8.6.so
#10 0x0000555555554acb in ?? ()
#11 0x00007ffff75458a3 in Tk_MainEx () from /usr/lib/x86_64-linux-gnu/libtk8.6.so
#12 0x000055555555498c in ?? ()
#13 0x00007ffff6d71bf7 in __libc_start_main (main=0x555555554960, argc=1, argv=0x7fffffffaff8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffafe8)
    at ../csu/libc-start.c:310
#14 0x00005555555549ca in _start ()
(gdb) p extData
$1 = (XExtData *) 0x0
(gdb) p minExtensionNumber 
$2 = 1
(gdb) p obj
$3 = {display = 0x5555557e8e10, gc = 0x5555557e8e10, visual = 0x5555557e8e10, screen = 0x5555557e8e10, pixmap_format = 0x5555557e8e10, font = 0x5555557e8e10}
peci1 commented 3 years ago
(gdb) p *dpy
$1 = {ext_data = 0x0, free_funcs = 0x5555557ea1e0, fd = 3, conn_checker = 0, proto_major_version = 11, proto_minor_version = 0, vendor = 0x5555557eb260 "The X.Org Foundation", resource_base = 52428800, 
  resource_mask = 2097151, resource_id = 0, resource_shift = 0, resource_alloc = 0x7ffff63b24a0 <_XAllocID>, byte_order = 0, bitmap_unit = 32, bitmap_pad = 32, bitmap_bit_order = 0, nformats = 7, 
  pixmap_format = 0x5555557ea230, vnumber = 11, release = 12008000, head = 0x0, tail = 0x0, qlen = 0, last_request_read = 8, request = 9, last_req = 0x555555800220 "\002", buffer = 0x555555800220 "\002", 
  bufptr = 0x555555800230 "\377\377\377", bufmax = 0x555555804220 "", max_request_size = 65535, db = 0x0, synchandler = 0x0, display_name = 0x5555557ea060 ":2.0", default_screen = 0, nscreens = 1, 
  screens = 0x5555557ea2e0, motion_buffer = 256, flags = 0, min_keycode = 8, max_keycode = 255, keysyms = 0x0, modifiermap = 0x0, keysyms_per_keycode = 0, 
  xdefaults = 0x5555557eac90 "\nXft.hintstyle:\thintnone\nXft.rgba:\tnone\nXcursor.theme:\t\nXcursor.size:\t0\nXcursor.theme_core:\t1\n", scratch_buffer = 0x0, scratch_length = 0, ext_number = 1, 
  ext_procs = 0x5555557eada0, event_vec = {0x7ffff63b4050 <_XUnknownWireEvent>, 0x7ffff63b4050 <_XUnknownWireEvent>, 0x7ffff63b4090 <_XWireToEvent> <repeats 34 times>, 
    0x7ffff63b4050 <_XUnknownWireEvent> <repeats 49 times>, 0x7ffff6406870, 0x7ffff63b4050 <_XUnknownWireEvent> <repeats 42 times>}, wire_vec = {0x7ffff63b4080 <_XUnknownNativeEvent>, 
    0x7ffff63b4080 <_XUnknownNativeEvent>, 0x0 <repeats 34 times>, 0x7ffff63b4080 <_XUnknownNativeEvent> <repeats 92 times>}, lock_meaning = 0, lock = 0x0, async_handlers = 0x0, bigreq_size = 4194303, 
  lock_fns = 0x0, idlist_alloc = 0x7ffff63b24f0 <_XAllocIDs>, key_bindings = 0x0, cursor_font = 0, atoms = 0x0, mode_switch = 0, num_lock = 0, context_db = 0x0, error_vec = 0x0, cms = {defaultCCCs = 0x0, 
    clientCmaps = 0x0, perVisualIntensityMaps = 0x0}, im_filters = 0x0, qfree = 0x0, next_event_serial_num = 1, flushes = 0x0, im_fd_info = 0x0, im_fd_length = 0, conn_watchers = 0x0, watcher_count = 0, 
  filedes = 0x5555557ea110 "\003", savedsynchandler = 0x0, resource_max = 2097146, xcmisc_opcode = 0, xkb_info = 0x5555557ead00, trans_conn = 0x0, xcb = 0x5555557ea080, next_cookie = 0, generic_event_vec = {
    0x0 <repeats 128 times>}, generic_event_copy_vec = {0x0 <repeats 128 times>}, cookiejar = 0x0}
peci1 commented 3 years ago

Weird is that if I edit faker.h to always return true, it works (I'd expect a segfault).

peci1 commented 3 years ago

Replicating the bug should be pretty simple in the end. Forget about the extra TurboVNC X server. I now logged directly onto the problematic machine and a plain vglrun +v -d :1 wish fails. The same with EGL.

peci1 commented 3 years ago

I've also tried it on another machine where VirtualGL was never installed and is closer to a clean Ubuntu installation, and it behaves the same.

dcommander commented 3 years ago

Reproduced

dcommander commented 3 years ago

Symptomatically, it appears from VirtualGL's point of view that Tk is calling XCreateWindow() without first calling XOpenDisplay(). However, examining the Tk source code reveals that it calls XkbOpenDisplay(), which wraps XOpenDisplay(), so I'm still unsure why the issue is occurring. It's easy enough to work around the issue, but since it's an issue that shouldn't occur, I really need to understand it first. At the moment, I'm stumped. Any additional help you can provide is appreciated.

dcommander commented 3 years ago

The piece of information I was missing was the fact that XkbOpenDisplay() and XOpenDisplay() are both provided by libX11, so the call to XOpenDisplay() is internal to libX11. For reasons I don't understand, VirtualGL intercepts the internal XOpenDisplay() call on CentOS but not on Ubuntu, which is why I was unable to reproduce the issue until I fired up an Ubuntu VM (my normal dev/test machines run CentOS.)

The issue was fixed by interposing XkbOpenDisplay() and performing the same faker initialization in the body of that function that is performed in the body of XOpenDisplay(). Commits will be pushed shortly.

peci1 commented 3 years ago

Thanks, that fixed it!