Hedgewars/hwengine 0.9.23 crashes on exit

hobbes1069 commented 6 years ago

I don't know if this problem is directly related to libglvnd or not but I'm guessing someone here can read the stack trace and give me some hints. I'm not sure which distros have implemented libglvnd but the crash is only being reported on Fedora and Hedgewars upstream hasn't been able to figure out what's causing it.

The crash is with hwengine in 0.9.23 which uses SDL2, prior to that release SDL 1 was used.

Starting program: /usr/bin/hwengine test.53.hwd
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff7f9b700 (LWP 2779)]
[New Thread 0x7fffd4f22700 (LWP 2780)]
[Thread 0x7fffd4f22700 (LWP 2780) exited]
[Thread 0x7ffff7f9b700 (LWP 2779) exited]

Thread 1 "hwengine" received signal SIGSEGV, Segmentation fault.
__GI__dl_catch_error (objname=0x28d2850, errstring=0x28d2858, 
    mallocedp=0x28d2848, operate=0x7ffff63b6060 <dlclose_doit>, args=0x2947b10)
    at dl-error-skeleton.c:187
187   c.objname = objname;

Thread 1 (Thread 0x7ffff7fcc740 (LWP 2760)):
#0  __GI__dl_catch_error (objname=0x28d2850, errstring=0x28d2858, 
    mallocedp=0x28d2848, operate=0x7ffff63b6060 <dlclose_doit>, args=0x2947b10)
    at dl-error-skeleton.c:187
        errcode = 0
        c = {objname = 0x400, errstring = 0x7ffff605b176 <_int_free+2390>, 
          malloced = 0x28d2848, errcode = 0x7fffffffd924, env = {{__jmpbuf = {
                281479271743489, 281479271743489, 281479271677953, 
                281479271743489, 281479271743489, 5721768570084979200, 
                281479271677953, 43303104}, __mask_was_saved = 0, 
              __saved_mask = {__val = {140737297179200, 43303112, 45847904, 0, 
                  5721768570084979200, 0, 43284752, 43284752, 140737297179200, 
                  0, 140737488345912, 140737297176248, 140737320989422, 0, 
                  43303104, 0}}}}}
        old = 0x2913040
#1  0x00007ffff63b6715 in _dlerror_run (
    operate=operate@entry=0x7ffff63b6060 <dlclose_doit>, args=0x2947b10)
    at dlerror.c:163
        result = 0x28d2840
#2  0x00007ffff63b608f in __dlclose (handle=<optimized out>) at dlclose.c:46
No locals.
#3  0x00007ffff47a4dee in CleanupVendorNameEntry (
    pEntry=pEntry@entry=0x2947910, unused=0x0) at libglxmapping.c:321
        vendor = 0x2947910
#4  0x00007ffff47a7361 in __glXMappingTeardown (doReset=doReset@entry=0)
    at libglxmapping.c:1061
        cur__GLXvendorNameHash = 0x2947910
        tmp__GLXvendorNameHash = 0x0
        pEntry = <optimized out>
        tmp = <optimized out>
#5  0x00007ffff479ffd7 in __glXFini () at libglx.c:2099
        glas = <optimized out>
#6  0x00007ffff7de65a3 in _dl_fini () at dl-fini.c:235
        array = 0x7ffff49ad6b0
        i = <optimized out>
        l = 0x28d2d80
        maps = 0x7fffffffdaa8
        i = <optimized out>
        l = <optimized out>
        nmaps = <optimized out>
        nloaded = <optimized out>
        ns = 0
        do_audit = <optimized out>
        __PRETTY_FUNCTION__ = "_dl_fini"
#7  0x00007ffff600bb88 in __run_exit_handlers (status=0, 
    listp=0x7ffff63af6f8 <__exit_funcs>, 
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at exit.c:83
        atfct = <optimized out>
        onfct = <optimized out>
        cxafct = <optimized out>
        f = <optimized out>
#8  0x00007ffff600bbda in __GI_exit (status=<optimized out>) at exit.c:105
No locals.
#9  0x00007ffff5ff1041 in __libc_start_main (main=0x7fffffffde60, argc=0, 
    argv=0x0, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x0) at ../csu/libc-start.c:342
        result = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {4312347, 140737488346736, 
                4219537, 0, 4212925, 140737320521786, 140737320521786, 2}, 
              mask_was_saved = -8344}}, priv = {pad = {0x200008000, 
              0x404894 <main_stub>, 0x0, 0x2f14a5126df18aa6}, data = {
              prev = 0x200008000, cleanup = 0x404894 <main_stub>, 
              canceltype = 0}}}
        not_first_call = <optimized out>
#10 0x00007fff00000000 in ?? ()
No symbol table info available.
#11 0x0000000000000000 in ?? ()
No symbol table info available.

kbrenneman commented 6 years ago

From that stack trace, it looks like it's crashing somewhere in a dlclose() call. It's probably not a bug in libglvnd itself (assuming you're using a relatively recent version). Ordinarily, I might suspect a bug in the vendor library, but it looks like it's not even getting that far.

It's not trying to call dlclose more than once, is it? I've seen some weird cases in Fedora and RHEL where DSO destructors get called multiple times or in unexpected orders, but I've never been able to figure out why.

nwnk commented 6 years ago

I'm not able to reproduce this on my F27 machine (with the 0.9.23 build of hedgewars from updates-testing). Is this specific to a particular configuration? I have both Mesa and NVIDIA's drivers installed but the GPU is an Intel Haswell.

hobbes1069 commented 6 years ago

Gwen Ciesla (limb) reported it first and I've been able to reproduce it on two F26 boxes with Nvidia and my wife's Core i5-8250U w/ intel graphics.

hobbes1069 commented 6 years ago

Does this valgrind output help at all?

==21760== 80 bytes in 1 blocks are possibly lost in loss record 285 of 541
==21760==    at 0x4C2EB6B: malloc (vg_replace_malloc.c:299)
==21760==    by 0x8DB1C66: RegisterStubCallbacks (GLdispatch.c:411)
==21760==    by 0x8DB1E9A: __glDispatchInit (GLdispatch.c:187)
==21760==    by 0x85F3E18: __glXInit (libglx.c:2030)
==21760==    by 0x4010D92: call_init (dl-init.c:72)
==21760==    by 0x4010D92: _dl_init (dl-init.c:120)
==21760==    by 0x4015CE9: dl_open_worker (dl-open.c:564)
==21760==    by 0x69B018E: _dl_catch_error (dl-error-skeleton.c:198)
==21760==    by 0x40151F8: _dl_open (dl-open.c:649)
==21760==    by 0x6657F25: dlopen_doit (dlopen.c:66)
==21760==    by 0x69B018E: _dl_catch_error (dl-error-skeleton.c:198)
==21760==    by 0x66586A4: _dlerror_run (dlerror.c:163)
==21760==    by 0x6657FB0: dlopen@@GLIBC_2.2.5 (dlopen.c:87)

besser82 commented 6 years ago

@hobbes1069 On the very top of that valgrind output, there must be some message about invalid read size… That's the actual useful one to trace this down…

hobbes1069 commented 6 years ago

Here's the full output: https://pastebin.com/yc8EJdn1

besser82 commented 6 years ago

There you can see the reason for the segfault:

==15181==
==15181== HEAP SUMMARY:
==15181==     in use at exit: 2,364,852 bytes in 1,245 blocks
==15181==   total heap usage: 76,248 allocs, 75,003 frees, 576,915,272 bytes allocated
==15181==
==15181== Searching for pointers to 1,245 not-freed blocks
==15181== Checked 44,733,872 bytes
==15181==
==15181== LEAK SUMMARY:
==15181==    definitely lost: 8,353 bytes in 5 blocks
==15181==    indirectly lost: 1,728 bytes in 38 blocks
==15181==      possibly lost: 287,471 bytes in 2 blocks
==15181==    still reachable: 2,067,300 bytes in 1,200 blocks
==15181==                       of which reachable via heuristic:
==15181==                         newarray           : 16 bytes in 1 blocks
==15181==                         multipleinheritance: 104 bytes in 1 blocks
==15181==         suppressed: 0 bytes in 0 blocks
==15181== Rerun with --leak-check=full to see details of leaked memory
==15181==
==15181== ERROR SUMMARY: 4 errors from 1 contexts (suppressed: 0 from 0)
==15181==
==15181== 4 errors in context 1 of 1:
==15181== Syscall param writev(vector[...]) points to uninitialised byte(s)
==15181==    at 0x67438E0: __writev_nocancel (syscall-template.S:84)
==15181==    by 0x9802F86: write_vec (xcb_conn.c:277)
==15181==    by 0x9802F86: _xcb_conn_wait (xcb_conn.c:522)
==15181==    by 0x9803384: _xcb_out_send (xcb_out.c:458)
==15181==    by 0x980340C: xcb_writev (xcb_out.c:406)
==15181==    by 0x9033205: _XSend (xcb_io.c:486)
==15181==    by 0x903372B: _XReply (xcb_io.c:573)
==15181==    by 0x901E1FE: XInternAtom (IntAtom.c:181)
==15181==    by 0x5767F5A: SetWindowBordered (SDL_x11window.c:349)
==15181==    by 0x5768EC2: X11_CreateWindow (SDL_x11window.c:521)
==15181==    by 0x575716C: SDL_CreateWindow_REAL (SDL_video.c:1475)
==15181==    by 0x575774A: ShouldUseTextureFramebuffer (SDL_video.c:220)
==15181==    by 0x575774A: SDL_VideoInit_REAL (SDL_video.c:544)
==15181==    by 0x56D1736: SDL_InitSubSystem_REAL (SDL.c:170)
==15181==  Address 0x87f1723 is 35 bytes inside a block of size 16,384 alloc'd
==15181==    at 0x4C30A1E: calloc (vg_replace_malloc.c:711)
==15181==    by 0x90232D5: XOpenDisplay (OpenDis.c:237)
==15181==    by 0x5766B7D: X11_CreateDevice (SDL_x11video.c:194)
==15181==    by 0x575762E: SDL_VideoInit_REAL (SDL_video.c:505)
==15181==    by 0x56D1736: SDL_InitSubSystem_REAL (SDL.c:170)
==15181==    by 0x40503C: ??? (in /usr/bin/hwengine)
==15181==    by 0x875356F: ???
==15181==    by 0x8752E2F: ???
==15181==    by 0x875356F: ???
==15181==    by 0x1FFEFFF90F: ???
==15181==    by 0x1FFEFFF7CF: ???
==15181==    by 0x1FF: ???
==15181==  Uninitialised value was created by a stack allocation
==15181==    at 0x902E960: XStoreColors (StColors.c:38)
==15181==
==15181== ERROR SUMMARY: 4 errors from 1 contexts (suppressed: 0 from 0)

The problem comes from the XServer…

hobbes1069 commented 6 years ago

I'm not a programmer, but I'm working with someone on the Fedora devel list and it has something to do with "not keeping %rsp 16-byte aligned"

hobbes1069 commented 6 years ago

Ok, looks like it's an FPC bug, but why does it seem to only happen on Fedora?

https://bugzilla.redhat.com/show_bug.cgi?id=1526848

NVIDIA / libglvnd

Hedgewars/hwengine 0.9.23 crashes on exit #141