OpenClovis / SAFplus-Availability-Scalability-Platform

Middleware that provides libraries, GUI, and code generator to design multi-node (clustered) applications that are highly available, redundant, and scalable. Provides sub-second node and application fault detection and failover, and useful application libraries including distributed hash tables (checkpoint), event, logging, and communications. Implements SA-Forum APIs where applicable. Used anywhere reliability is a must -- like telecom, wireless, defense and enterprise computing. Download stable release with installer from: ftp.openclovis.com
www.openclovis.com
GNU General Public License v2.0
20 stars 13 forks source link

AMF healthcheck timer delete should be async to avoid deadlocks #50

Closed karthick18 closed 11 years ago

karthick18 commented 11 years ago

AMF healthcheck timer delete should be using clTimerDeleteAsync instead of clTimerDelete as its done with cpmMutex lock held also grabbed on healthcheck timer callback. Otherwise we can deadlock is the healthcheck timer delete and the timer callback fire at the same time in which case, both would deadlock on the cpmMutex considering the synchronous clTimerDelete call would wait for any running callbacks to finish. And they can't finish as the clTimerDelete context would have grabbed the same mutex before trying to delete the healthcheck timer. One such deadlock occurrence --

(gdb) thr 2

[Switching to thread 2 (Thread 0x7f4ff317f700 (LWP 16943))]

0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132

132 ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.

(gdb) p *mutex

$1 = {data = {lock = 2, count = 0, owner = 17356, nusers = 1, kind = 0, spins = 0, list = {prev = 0x0, next = 0x0}},

size = "\002\000\000\000\000\000\000\000\314C\000\000\001", '\000' <repeats 26 times>, align = 2}

(gdb) bt

(gdb) thr 28

[Switching to thread 28 (Thread 0x7f4ff1a41700 (LWP 17356))]

0 0x00007f4ff62dd52d in nanosleep () at ../sysdeps/unix/syscall-template.S:82

82 ../sysdeps/unix/syscall-template.S: No such file or directory.

(gdb) bt

0 0x00007f4ff62dd52d in nanosleep () at ../sysdeps/unix/syscall-template.S:82

1 0x00007f4ff47e5a5f in cosPosixTaskDelay (timer=...) at posix/clCommonCos.c:805

2 0x00007f4ff47f8676 in clOsalTaskDelay (timer=timer@entry=...) at osal.c:270

3 0x00007f4ff482110c in timerDeleteLocked (pTimer=pTimer@entry=0x7f4fc4035e18, pTimerHandle=pTimerHandle@entry=0x1f85508, asyncFlag=asyncFlag@entry=0,

pFreeTimer=pFreeTimer@entry=0x7f4ff1a4076e) at clTimerTree.c:893

4 0x00007f4ff4821f88 in timerDelete (pTimerHandle=0x1f85508, asyncFlag=asyncFlag@entry=0) at clTimerTree.c:943

5 0x00007f4ff4823257 in clTimerDelete (pTimerHandle=) at clTimerTree.c:962

6 0x00007f4ff54a81bf in cpmCompHealthcheckStop (pCompName=pCompName@entry=0x7f4fc008344c) at clCpmComponent.c:4986

7 0x00007f4ff574d69f in clAmsPeCompAssignCSITimeout (timer=) at clAmsPolicyEngine.c:16279

8 0x00007f4ff5714827 in clAmsEntityTimeout (timer=0x7f4fc0083a68) at clAmsEntities.c:4636

9 0x00007f4ff482236a in clTimerCallbackTask (invocation=invocation@entry=0x7f4fbc00e878) at clTimerTree.c:1121

10 0x00007f4ff4861392 in clTaskPoolEntry (pArg=) at clTaskPool.c:277

11 0x00007f4ff47e31e5 in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:951

12 0x00007f4ff62d5e9a in start_thread (arg=0x7f4ff1a41700) at pthread_create.c:308

13 0x00007f4ff3a95cbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

14 0x0000000000000000 in ?? ()

0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132

1 0x00007f4ff62d8065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0

2 0x00007f4ff62d7eba in __pthread_mutex_lock (mutex=0x1f3d780) at pthread_mutex_lock.c:61

3 0x00007f4ff47efd8e in __cosMutexLock (mutexId=, verbose=) at posix/clLinux.c:329

4 0x00007f4ff47fc136 in clOsalMutexLock (mutexId=0x1f3d778) at osal.c:532

5 0x00007f4ff57afc3d in __clAmsMgmtEntityGetConfig (in=, out=0x7f4fd4005780, versionCode=327680,

data=<error reading variable: Unhandled dwarf expression opcode 0xfa>) at clAmsMgmtServerApi.c:3898

6 0x00007f4ff4814bdc in clRmdInvoke (func=0x7f4ff57c6920 <_clAmsMgmtEntityGetConfig_5_0_0>, eoArg=0x0, inMsgHdl=0x7f4f88002d00, outMsgHdl=0x7f4fd4005780) at clRmdHandle.c:138

7 0x00007f4ff470930c in clEoWalkWithVersion (pThis=pThis@entry=0x1eb1578, func=680, version=version@entry=0x7f4ff317e800, pFuncCallout=,

inMsgHdl=inMsgHdl@entry=0x7f4f88002d00, outMsgHdl=0x7f4fd4005780) at eo.c:2427

8 0x00007f4ff48181aa in rmdHandleSyncRequest (pThis=pThis@entry=0x1eb1578, pReq=pReq@entry=0x7f4ff317e960, srcAddr=srcAddr@entry=0x7f4ff317e940,

priority=priority@entry=0 '\000', inMsgHdl=0x7f4f88002d00, protoType=<error reading variable: Unhandled dwarf expression opcode 0xfa>) at clRmdRecv.c:761

9 0x00007f4ff4818ae9 in clRmdReceiveRequest (pThis=0x1eb1578, rmdRecvMsg=0x7f4f88002d00, priority=0 '\000', protoType=, length=, srcAddr=...)

at clRmdRecv.c:282

10 0x00007f4ff470270c in clEoJobHandler (pJob=pJob@entry=0x7f4fd40008f8) at eo.c:3861

11 0x00007f4ff4861392 in clTaskPoolEntry (pArg=) at clTaskPool.c:277

12 0x00007f4ff47e31e5 in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:951

13 0x00007f4ff62d5e9a in start_thread (arg=0x7f4ff317f700) at pthread_create.c:308

14 0x00007f4ff3a95cbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

15 0x0000000000000000 in ?? ()

(gdb)