ddugovic / capablanca

An Internet Chess Server supporting Chess variants (based on Lasker-2.2.3)
http://hgm.nubati.net/cgi-bin/gitweb.cgi?p=capablanca.git;a=summary
GNU General Public License v2.0
12 stars 7 forks source link

Crippling bug (making the ICS segfault) #4

Closed HGMuller closed 7 years ago

HGMuller commented 7 years ago

Note that I found a crippling bug, which after years of problem-free running suddenly made the ICS I operate completely unstable, making it crash typically within the hour since October. The culprit seemed to be a test for bughouse, (gameproc.c, line 348), which had to be reversed (if(gl != -1) instead of if(gl == -1)). As it was, the code would, at the end of each game, access the partner game of a bughouse match only when such a partner game did not exist. Leading to an out-of-bounds access on game_globals.garray, where the retrieved data would again be used as an array index (probably causing the segfaulting).

Also note that the ICS catches its own segfaults, but that the SIGSEGV handler called a diagnostic tool 'backtrace' that no longer seems to exist, so that nothing helpful would be in the created segv_PID files. To be able to find the above bug I changed the handler to call 'gcore' to create a core dump. It was not clear to me why such a core dump was not created automatically, (but it wasn't, despite me setting the 'ulimit' to unlimited), and also monitoring the ICS process with gdb from another terminal (after disabling the ICS SIGSEGV handler) did not work. But post-mortem analysis on the dump created by gcore worked fine, and revealed the offending code.

[Edit] Oh, I see that you removed the SIGSEGV handler alltogether. I suppose that if it produces core dumps spontaneously, this is fine. For it didn't seem to do that, however.

ddugovic commented 7 years ago

I'll review #2 and see if I can safely introduce a SIGSEGV handler.

[Edit] ... or see whether I can produce core dumps spontaneously.

ddugovic commented 7 years ago

On my system core dumps spontaneously occur (and if I install systemd-coredump I can access them):

$ coredumpctl list chessd
TIME                            PID   UID   GID SIG PRESENT EXE
Wed 2017-01-18 07:36:46 CST     663  1009  1010  11 * /usr/local/chessd/bin/chessd
Wed 2017-01-18 07:49:28 CST    3253  1009  1010  11 * /usr/local/chessd/bin/chessd

$ coredumpctl info 3253
           PID: 3253 (chessd)
           UID: 1009 (chessd)
           GID: 1010 (chessd)
        Signal: 11 (SEGV)
     Timestamp: Wed 2017-01-18 07:49:27 CST (1min 44s ago)
  Command Line: /usr/local/chessd/bin/chessd -p 5000 -f -T /home/chessd/capablanca/timeseal/timeseal_decoder-Linux-ELF-2.4
    Executable: /usr/local/chessd/bin/chessd
 Control Group: /system.slice/chessd.service
          Unit: chessd.service
         Slice: system.slice
       Boot ID: d94a87a0cd7b4990b445b406cd932e2c
    Machine ID: 472e67eae01c4e6cb0653b49973c8ffe
      Hostname: dugovic-host
      Coredump: /var/lib/systemd/coredump/core.chessd.1009.d94a87a0cd7b4990b445b406cd932e2c.3253.1484747367000000000000.xz
       Message: Process 3253 (chessd) of user 1009 dumped core.

                Stack trace of thread 3253:
                #0  0x00000000f74a65b1 move_calculate (chessd.so)
                #1  0x00000000f74a7d1c parse_move (chessd.so)
                #2  0x00000000f7496fc4 process_move (chessd.so)
                #3  0x00000000f7481069 process_prompt (chessd.so)
                #4  0x00000000f74813b6 process_input (chessd.so)
                #5  0x00000000f74ac781 select_loop (chessd.so)
                #6  0x0000000008048cf8 main_event_loop (chessd)
                #7  0x00000000080491f1 main (chessd)
                #8  0x00000000f7527637 __libc_start_main (libc.so.6)
                #9  0x0000000008048a91 _start (chessd)
ddugovic commented 7 years ago

Now I can patch the crippling bug.