RTimothyEdwards / XCircuit

XCircuit circuit drawing and schematic capture tool
GNU General Public License v2.0
97 stars 25 forks source link

Segmentation fault for any menu item involving xcircuit::popupfilelist #8

Closed QuantumRipple closed 3 years ago

QuantumRipple commented 3 years ago

OS: Manjaro Linux, Kernel 5.7.15-1 XCircuit version: 3.10.28 from distro repositories, but also occurs with 3.10.26 compiled locally from AUR/github (even when compiled --with-python instead of TCL). TCL/TK version: 8.6.10

The problem occurs inside of a fresh Manjaro install on a VM as well as my primary system, but does not occur on Arch running in Windows Subsystem for Linux using the same versions of XCircuit and TCL.

Specifically, the segmentation fault occurs when wm deiconify .filelist is called, even if done outside of popupfilelist. I am able to deiconify other dialogs and read and write files using the commands the filelist dialogue would have eventually called to do so.

QuantumRipple commented 3 years ago

Associated stack trace:

Process 2954 (wish) of user 1000 dumped core.
Stack trace of thread 2954:
#0  0x00007f7cb7a0466a n/a (xcircuit.so + 0x3966a)
#1  0x00007f7cb7a05174 n/a (xcircuit.so + 0x3a174)
#2  0x00007f7cb7a862c9 n/a (xcircuit.so + 0xbb2c9)
#3  0x00007f7cb8bfc96f Tk_HandleEvent (libtk8.6.so + 0x4796f)
#4  0x00007f7cb8bfcc21 WindowEventProc (libtk8.6.so + 0x47c21)
#5  0x00007f7cb8b0102a Tcl_ServiceEvent (libtcl8.6.so + 0x10a02a)
#6  0x00007f7cb8b012a7 Tcl_DoOneEvent (libtcl8.6.so + 0x10a2a7)
#7  0x00007f7cb8bfd522 Tk_MainLoop (libtk8.6.so + 0x48522)
#8  0x00007f7cb8c0c1d2 Tk_MainEx (libtk8.6.so + 0x571d2)
#9  0x000055d5e2bbf052 n/a (wish8.6 + 0x1052)
#10 0x00007f7cb8856152 __libc_start_main (libc.so.6 + 0x28152)
#11 0x000055d5e2bbf08e _start (wish8.6 + 0x108e)
RTimothyEdwards commented 3 years ago

@QuantumRipple : Is it possible to get a stack trace from a version of xcircuit without (apparently) the symbols stripped, so I can see what line of what routine it was on when it crashed? "WindowEventProc" suggests that it was trying to do something like map the window and possibly got an invalid window passed to it. But I can't see what the callback routine in xcircuit was from that stack trace.

QuantumRipple commented 3 years ago

I'll try and figure out how to do the software build process with debugging symbols included... I can no longer reproduce this on my main Manjaro machine as of yesterday though, and have some very strange results to report.

It's not related to the kernel version - I tried both 4.4 (similar to what Arch+WSL is running) and upgraded to the latest 5.8 on the same system, no change.

I have my main machine, a secondary machine, and a VM (hosted on my main machine) all running Manjaro XFCE configured similarly (plus the Arch WSL w/ VcXsrv on a separate Windows10 machine that has never exhibited the problem). VM stopped segfaulting when I went back to re-test. Not sure why. Was still segfaulting every time from my main machine and secondary machine, but it ALSO segfaulted in the same place when my main machine uses X forwarding over SSH to the VM or secondary machine and does not segfault when the VM uses X forwarding over SSH to the main or secondary machine - it follows the X display, not the actual machine XCircuit is running on.

In continuing my blackbox debug I also changed my session on my main machine to KDE Plasma. Lo and behold, no segfault! I changed my session BACK to XFCE and now it doesn't segfault on my main machine either.

Only my secondary machine (which only has XFCE installed despite the fact I normally use it headless) still segfaults. I doubt it's a bug in XCircuit at this point, but maybe TCL/TK or XFCE.

RTimothyEdwards commented 3 years ago

That view (that it's a bug in Tcl/Tk or XFCE) is undermined by the fact that your stack trace above shows that the process was somewhere in an xcircuit routine when it crashed. What I can tell from the stack trace is that "WindowEventProc" was called, which was probably a callback from the "wm deiconify" Tcl command, most likely from the Expose event or Map event generated by X11. It is most likely caused by a timing issue between the window manager and the program, probably where xcircuit tries to get a pointer to the window from Tk. If Tk has not yet gotten any information about the window from X11, then it can end up with a NULL pointer to the window, and that causes the crash. I vaguely recall this happening before with XFCE and if I remember it correctly, the problem was a significant delay in XFCE reporting new windows to Tcl/Tk. If I know where the issue occurs, it's easy to stop the crash; it's a bit harder to make the behavior correct, but usually something like a "wait" command in Tcl/Tk will suffice to force Tcl/Tk to wait for valid information about the window before it makes a call to a callback function.

QuantumRipple commented 3 years ago

Stack trace with debug symbols. To trigger this I'm always starting xcircuit.sh then clicking File->Read XCircuit File.

System error log:

Stack trace of thread 5750:
#0  0x00007f08b1090176 listfiles (/scratch/xcircuit/lib/tcl/xcircuit.so + 0x3a176)
#1  0x00007f08b1090c56 newfilelist (/scratch/xcircuit/lib/tcl/xcircuit.so + 0x3ac56)
#2  0x00007f08b110ff20 xctk_listfiles (/scratch/xcircuit/lib/tcl/xcircuit.so + 0xb9f20)
#3  0x00007f08b219896f Tk_HandleEvent (libtk8.6.so + 0x4796f)
#4  0x00007f08b2198c21 WindowEventProc (libtk8.6.so + 0x47c21)
#5  0x00007f08b209d02a Tcl_ServiceEvent (libtcl8.6.so + 0x10a02a)
#6  0x00007f08b209d2a7 Tcl_DoOneEvent (libtcl8.6.so + 0x10a2a7)
#7  0x00007f08b2199522 Tk_MainLoop (libtk8.6.so + 0x48522)
#8  0x00007f08b21a81d2 Tk_MainEx (libtk8.6.so + 0x571d2)
#9  0x00005631dcae0052 n/a (wish8.6 + 0x1052)
#10 0x00007f08b1df2152 __libc_start_main (libc.so.6 + 0x28152)
#11 0x00005631dcae008e _start (wish8.6 + 0x108e)

GDB:

Thread 1 "wish" received signal SIGSEGV, Segmentation fault.
listfiles (w=0x55c1878f74a0, okaystruct=0x55c187a38760, calldata=0x0) at filelist.c:388
388       values.font = appdata.filefont->fid;
(gdb) bt
#0  listfiles (w=0x55c1878f74a0, okaystruct=0x55c187a38760, calldata=0x0) at filelist.c:388
#1  0x00007fd176457c56 in newfilelist (w=0x55c1878f74a0, okaystruct=0x55c187a38760) at filelist.c:547
#2  0x00007fd1764d6f20 in xctk_listfiles (clientData=0x55c187a38760, eventPtr=0x55c187d0bf10) at tclxcircuit.c:9696
#3  0x00007fd17755f96f in Tk_HandleEvent () from /usr/lib/libtk8.6.so
#4  0x00007fd17755fc21 in WindowEventProc () from /usr/lib/libtk8.6.so
#5  0x00007fd17746402a in Tcl_ServiceEvent () from /usr/lib/libtcl8.6.so
#6  0x00007fd1774642a7 in Tcl_DoOneEvent () from /usr/lib/libtcl8.6.so
#7  0x00007fd177560522 in Tk_MainLoop () from /usr/lib/libtk8.6.so
#8  0x00007fd17756f1d2 in Tk_MainEx () from /usr/lib/libtk8.6.so
#9  0x000055c185aef052 in ?? ()
#10 0x00007fd1771b9152 in __libc_start_main () from /usr/lib/libc.so.6
#11 0x000055c185aef08e in _start ()
RTimothyEdwards commented 3 years ago

Mmm, it's a font thing. . . Probably has to do with whether or not certain fonts exist for X11 on the system. There should be a way to avoid crashing under that condition. . .

RTimothyEdwards commented 3 years ago

On the other hand, the GUI initialization routine is calling the application database setup routine which is trying every which way to find a compatible X11 font, and as a 3rd attempt should accept any X11 font available on the system. On the system that crashes, if you run "xlsfonts", do you get anything at all? I guess I can go one step further and do a 4th attempt because the 3rd attempt is still looking for fonts of the type with various characteristics separated by dashes, so it would still not accept some of the fixed-bitmap fonts like "7x13".

Maybe just doing "apt install xfonts-terminus" would correct the issue. . .

QuantumRipple commented 3 years ago

Futher debug: xlsfonts lists 1023 fonts, of which 1020 match the expected dash format. Namely it has -xos4-terminus-medium-r-normal--14-140-72-72-c-80-iso10646-1 which should match your second try on tclxcircuit.c:9856.

I switched from xcircuit.sh to xcircexec as the font stuff happens on startup. However, both the call on tclxcircuit.c:9856 and the more generic 3rd try on tclxcircuit.c:9858 return null. I compiled libX11.so.6 (version 1.6.12) with debugging symbols to dig deeper, although I had to delete RPATH out of the compiled xcircexec ELF to get it to care about LD_LIBRARY_PATH pointing to my newly compiled version of libX11.so.6 at /usr/local/lib.

XLoadQueryFont is defined on line 92 of libX11's Font.c. The returned font pointer should come from either Font.c:105's _XF86LoadQueryLocaleFont(dpy, name, &font_result, (Font *)0) or Font.c:122's font_result = _XQueryFont(dpy, fid, seq).

_XF86LoadQueryLocaleFont(...) (Font.c:651) returns 0 because the locale portion of the charset ("UTF" from "UTF-8") does not match the end of the font search string before the trailing "-*" ("-14"). I think this is normal.

_XQueryFont(...) (Font.c:182) when examined in the debugger is falling into some kind of protection case on line 209 and returning null, but I don't really understand what it's doing at this level.

209    if (!_XReply (dpy, (xReply *) &reply,
210       ((SIZEOF(xQueryFontReply) - SIZEOF(xReply)) >> 2), xFalse)) {
211    if (seq)
212        DeqAsyncHandler(dpy, &async);
213    return (XFontStruct *)NULL;
214    }
QuantumRipple commented 3 years ago

Problem solved!

Fonts packages rung a bell. Pacman had installed some fonts packages on my main machine in an unrelated operation as dependencies for Xfig which I uninstalled shortly after, and restarting the X server was the actual cause for no-more-segfaults in XCircuit (not changing to KDE).

Installing the xorg-fonts-75dpi and/or xorg-fonts-100dpi, xorg-fonts-alias-75dpi, xorg-fonts-alias-100dpi packages (which conflicted with and removed the plain xorg-fonts-alias package) was the change needed to make it work on the secondary machine as well. The working xorg-fonts packages are actually orphaned on my main machine (explicitly installed on the secondary), I probably need to get in contact with the Arch maintainers to fix the dependency list.

I'm still not sure why XLoadQueryFont(dpy, "-*-*-medium-r-normal--14-*") failed to find -xos4-terminus-medium-r-normal--14-140-72-72-c-80-iso10646-1.

QuantumRipple commented 3 years ago

I have a bit more detail on the root cause. The plain xorg-fonts-alias package was deprecated earlier this year and split into four new packages: xorg-fonts-alias-100dpi, xorg-fonts-alias-75dpi, xorg-fonts-alias-cryllic, and xorg-fonts-alias-misc.

XCircuit does not actually seem to depend on any of the xorg-fonts packages - you can remove them all and it works just fine. What fixed the font segfault is the removal of xorg-fonts-alias. I can re-introduce problems by installing the new package xorg-fonts-alias-misc. This provides only /usr/share/fonts/misc/fonts.alias. The other 3 parts of the old package can be installed without problems.

misc/fonts.alias provided by my distribution (upstream Arch in this case) has an alias that matches "-*-*-medium-r-normal--14-*", but points to a font that does not exist (not installed). I think these null entries are shadowing the matching terminus font, causing the null return. Then the third attempt with all wildcards has the same problem - there are other aliases pointing to non-present fonts that may shadow the entire set of valid fonts.

Installing the real fonts package (xorg-fonts-misc, which depends on xorg-fonts-alias-misc, but not vice versa) instead of just the aliases also resolves the issue.

All of my machines, except my recently-reinstalled VM, had the old xorg-fonts-alias explicitly installed from base, even though nothing depended on it.