YottaDB / YDB

Mirrored from https://gitlab.com/YottaDB/DB/YDB
Other
76 stars 37 forks source link

Handle edge case with multiple kill -9s on online rollback processes #192

Closed nars1 closed 6 years ago

nars1 commented 6 years ago

This is a debug-only issue that requires kill -9s of two online rollback processes in specific functions (mur_process_intrpt_recov() and wcs_recover()) that eventually causes a third online rollback to assert fail.

While the actual fix to the assert failure is a modification of the assert in mutex_salvage() (sr_unix/mutex.c), an additional change was done here to distinguish kill -9 of a wcs_recover vs a commit both of which result in early_tn != curr_tn. The salvage part needs to happen only for the commit cleanup. Not sure if this change has implications for pro builds (i.e. is user visible or not) but not spending any more time on this kill-9 usecase now since kill -9s are not offically supported anyways.

Below is the test case that demonstrates the issue.

> ver v63003a_r120 d
> source start.csh

> gdb $ydb_dist/mupip
(gdb) b mur_process_intrpt_recov
Function "mur_process_intrpt_recov" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (mur_process_intrpt_recov) pending.
(gdb) r journal -rollback -online -back -lost=x.los "*" -resync=19
Starting program: /usr/library/R120/dbg/mupip journal -rollback -online -back -lost=x.los "*" -resync=19
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
%YDB-I-MUJNLSTAT, Initial processing started at Wed Apr  4 10:58:17 2018
%YDB-I-MUJPOOLRNDWNSUC, Jnlpool section (id = 1292697632) belonging to the replication instance /extra1/testarea1/nars/test/temp/tmp/tmp/mumps.repl successfully rundown
%YDB-I-ORLBKSTART, ONLINE ROLLBACK started on instance INSTA corresponding to /extra1/testarea1/nars/test/temp/tmp/tmp/mumps.repl
%YDB-I-MUJNLSTAT, Backward processing started at Wed Apr  4 10:58:18 2018
%YDB-I-RESOLVESEQNO, Resolving until sequence number 19 [0x0000000000000013]
%YDB-I-MUJNLSTAT, Before image applying started at Wed Apr  4 10:58:18 2018
%YDB-I-ORLBKNOSTP, ONLINE ROLLBACK proceeding with database updates. MUPIP STOP will no longer be allowed
Breakpoint 1, mur_process_intrpt_recov () at /Distrib/YottaDB/R120/sr_port/mur_process_intrpt_recov.c:56
56      {
(gdb) b 171
Breakpoint 2 at 0x7ffff6cf0792: file /Distrib/YottaDB/R120/sr_port/mur_process_intrpt_recov.c, line 171.
(gdb) cont
Continuing.
Breakpoint 2, mur_process_intrpt_recov () at /Distrib/YottaDB/R120/sr_port/mur_process_intrpt_recov.c:171
171                     csd->turn_around_point = TRUE;
(gdb) quit
A debugging session is active.
        Inferior 1 [process 19506] will be killed.
Quit anyway? (y or n) y

> gdb $ydb_dist/mupip
(gdb) b wcs_recover
Function "wcs_recover" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (wcs_recover) pending.
(gdb) r journal -rollback -online -back -lost=x.los "*"
Starting program: /usr/library/R120/dbg/mupip journal -rollback -online -back -lost=x.los "*"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
%YDB-I-MUJNLSTAT, Initial processing started at Wed Apr  4 11:04:29 2018
%YDB-I-ORLBKSTART, ONLINE ROLLBACK started on instance INSTA corresponding to /extra1/testarea1/nars/test/temp/tmp/tmp/mumps.repl
Breakpoint 1, wcs_recover (reg=0x62c170) at /Distrib/YottaDB/R120/sr_port/wcs_recover.c:115
115     {
(gdb) b 279
Breakpoint 2 at 0x7ffff6e15114: file /Distrib/YottaDB/R120/sr_port/wcs_recover.c, line 279.
(gdb) cont
Continuing.
Breakpoint 2, wcs_recover (reg=0x62c170) at /Distrib/YottaDB/R120/sr_port/wcs_recover.c:279
279             for (cr = cr_lo, total_rip_wait = 0; cr < cr_hi; cr++, buffptr += blk_size)
(gdb) quit
A debugging session is active.
        Inferior 1 [process 20437] will be killed.
Quit anyway? (y or n) y

> $ydb_dist/mupip journal -rollback -online -back -lost=x.los "*"
%YDB-I-MUJNLSTAT, Initial processing started at Wed Apr  4 11:05:47 2018
%YDB-I-ORLBKSTART, ONLINE ROLLBACK started on instance INSTA corresponding to /extra1/testarea1/nars/test/temp/tmp/tmp/mumps.repl
%YDB-F-ASSERT, Assert failed in /Distrib/YottaDB/R120/sr_unix/mutex.c line 1068 for expression (cnl->update_underway_tn <= csd->trans_hist.curr_tn)
%YDB-F-ASSERT, Assert failed in /Distrib/YottaDB/R120/sr_unix/grab_crit.c line 106 for expression (0 == crit_count)
%YDB-F-NOCHLEFT, Unhandled condition exception (all handlers exhausted) - process terminating

> cat start.csh
setenv gtm_repl_instance mumps.repl
setenv ydb_gbldir mumps.gld
rm -f mumps.gld *.dat *.mjl* *.log* *.repl
gde exit
mupip create
mupip replicate -instance -name=INSTA
mupip set -replication=on -reg "*"
@ port = 5001
mupip replic -source -start -secondary=${HOST}:$port -log=source.log -buf=1 -instsecondary=INSTB
mumps -run ^%XCMD 'for i=1:1:10 set ^x=i'
sleep 2
mumps -run ^%XCMD 'for i=1:1:10 set ^x=i'
@ srcsrvrpid = `grep Pid source.log | sed 's/.*Pid \[//;s/]//g' | awk '{print $1}'`