OpenClovis / SAFplus-Availability-Scalability-Platform

Middleware that provides libraries, GUI, and code generator to design multi-node (clustered) applications that are highly available, redundant, and scalable. Provides sub-second node and application fault detection and failover, and useful application libraries including distributed hash tables (checkpoint), event, logging, and communications. Implements SA-Forum APIs where applicable. Used anywhere reliability is a must -- like telecom, wireless, defense and enterprise computing. Download stable release with installer from: ftp.openclovis.com
www.openclovis.com
GNU General Public License v2.0
19 stars 13 forks source link

System crash after calling clLogWriteAsync() #118

Open hungta opened 10 years ago

hungta commented 10 years ago

For code in detail, please refer to bicTests model of SAFplus_Test, testcase BIC-LOG-WRT.TC011.

GDB output:

root@clovis20:~/safplus# gdb bin/TestLogClient var/run/core GNU gdb (GDB) 7.5.91.20130417-cvs-ubuntu Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /root/safplus/bin/TestLogClient...done. [New LWP 27573] [New LWP 27627] [New LWP 27323] [New LWP 27411] [New LWP 27412] [New LWP 27413] [New LWP 27414] [New LWP 27415] [New LWP 27416] [New LWP 27420] [New LWP 27421] [New LWP 27543] [New LWP 27544] [New LWP 27545] [New LWP 27546] [New LWP 27571] [New LWP 27572]

warning: Can't read pathname for load map: Input/output error. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fff71bfe000 Core was generated by `/root/safplus/bin/TestLogClient'. Program terminated with signal 6, Aborted.

0 0x00007f3307988037 in raise () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) thr a a bt

Thread 17 (Thread 0x7f330759b700 (LWP 27572)):

0 0x00007f330895905e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f330997efa6 in cosPosixCondWait (conditionId=0x11e6e38, mutexId=0x11e6e08, timer=...) at posix/clCommonCos.c:1571

2 0x00007f330998bdfa in clOsalCondWait (conditionId=conditionId@entry=0x11e6e38, mutexId=mutexId@entry=0x11e6e08,

timer=timer@entry=...) at osal.c:674

3 0x00007f33099b5dee in clTaskPoolEntry (pArg=) at clTaskPool.c:195

4 0x00007f3309977ddb in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:953

5 0x00007f3308954f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f3307a4ae1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 16 (Thread 0x7f33075cc700 (LWP 27571)):

0 0x00007f330895905e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f330997efa6 in cosPosixCondWait (conditionId=0x11e6e38, mutexId=0x11e6e08, timer=...) at posix/clCommonCos.c:1571

2 0x00007f330998bdfa in clOsalCondWait (conditionId=conditionId@entry=0x11e6e38, mutexId=mutexId@entry=0x11e6e08,

timer=timer@entry=...) at osal.c:674

3 0x00007f33099b5dee in clTaskPoolEntry (pArg=) at clTaskPool.c:195

4 0x00007f3309977ddb in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:953

5 0x00007f3308954f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f3307a4ae1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 15 (Thread 0x7f33075fd700 (LWP 27546)):

0 0x00007f330895905e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f330997efa6 in cosPosixCondWait (conditionId=0x10c2e98, mutexId=0x10c2e68, timer=...) at posix/clCommonCos.c:1571

2 0x00007f330998bdfa in clOsalCondWait (conditionId=conditionId@entry=0x10c2e98, mutexId=mutexId@entry=0x10c2e68,

timer=timer@entry=...) at osal.c:674

3 0x00007f33099b5dee in clTaskPoolEntry (pArg=) at clTaskPool.c:195

4 0x00007f3309977ddb in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:953

5 0x00007f3308954f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f3307a4ae1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 14 (Thread 0x7f330762e700 (LWP 27545)):

0 0x00007f330895905e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f330997efa6 in cosPosixCondWait (conditionId=0x10c2e98, mutexId=0x10c2e68, timer=...) at posix/clCommonCos.c:1571

2 0x00007f330998bdfa in clOsalCondWait (conditionId=conditionId@entry=0x10c2e98, mutexId=mutexId@entry=0x10c2e68,

---Type to continue, or q to quit--- timer=timer@entry=...) at osal.c:674

3 0x00007f33099b5dee in clTaskPoolEntry (pArg=) at clTaskPool.c:195

4 0x00007f3309977ddb in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:953

5 0x00007f3308954f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f3307a4ae1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 13 (Thread 0x7f330765f700 (LWP 27544)):

0 0x00007f330895905e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f330997efa6 in cosPosixCondWait (conditionId=0x10c2e98, mutexId=0x10c2e68, timer=...) at posix/clCommonCos.c:1571

2 0x00007f330998bdfa in clOsalCondWait (conditionId=conditionId@entry=0x10c2e98, mutexId=mutexId@entry=0x10c2e68,

timer=timer@entry=...) at osal.c:674

3 0x00007f33099b5dee in clTaskPoolEntry (pArg=) at clTaskPool.c:195

4 0x00007f3309977ddb in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:953

5 0x00007f3308954f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f3307a4ae1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 12 (Thread 0x7f3307690700 (LWP 27543)):

0 0x00007f330895905e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f330997efa6 in cosPosixCondWait (conditionId=0x10c2e98, mutexId=0x10c2e68, timer=...) at posix/clCommonCos.c:1571

2 0x00007f330998bdfa in clOsalCondWait (conditionId=conditionId@entry=0x10c2e98, mutexId=mutexId@entry=0x10c2e68,

timer=timer@entry=...) at osal.c:674

3 0x00007f33099b5dee in clTaskPoolEntry (pArg=) at clTaskPool.c:195

4 0x00007f3309977ddb in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:953

5 0x00007f3308954f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f3307a4ae1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 11 (Thread 0x7f33076c1700 (LWP 27421)):

0 0x00007f330895905e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f330997efa6 in cosPosixCondWait (conditionId=0x10c2e98, mutexId=0x10c2e68, timer=...) at posix/clCommonCos.c:1571

2 0x00007f330998bdfa in clOsalCondWait (conditionId=conditionId@entry=0x10c2e98, mutexId=mutexId@entry=0x10c2e68,

timer=timer@entry=...) at osal.c:674

3 0x00007f33099b5dee in clTaskPoolEntry (pArg=) at clTaskPool.c:195

4 0x00007f3309977ddb in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:953

5 0x00007f3308954f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f3307a4ae1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

---Type to continue, or q to quit---

Thread 10 (Thread 0x7f33076f2700 (LWP 27420)):

0 0x00007f330895905e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f330997efa6 in cosPosixCondWait (conditionId=0x10c2e98, mutexId=0x10c2e68, timer=...) at posix/clCommonCos.c:1571

2 0x00007f330998bdfa in clOsalCondWait (conditionId=conditionId@entry=0x10c2e98, mutexId=mutexId@entry=0x10c2e68,

timer=timer@entry=...) at osal.c:674

3 0x00007f33099b5dee in clTaskPoolEntry (pArg=) at clTaskPool.c:195

4 0x00007f3309977ddb in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:953

5 0x00007f3308954f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f3307a4ae1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 9 (Thread 0x7f3307737700 (LWP 27416)):

0 0x00007f330895905e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f330997efa6 in cosPosixCondWait (conditionId=0x11e9008, mutexId=0x11e8fd8, timer=...) at posix/clCommonCos.c:1571

2 0x00007f330998bdfa in clOsalCondWait (conditionId=conditionId@entry=0x11e9008, mutexId=mutexId@entry=0x11e8fd8,

timer=timer@entry=...) at osal.c:674

3 0x00007f33099b5dee in clTaskPoolEntry (pArg=) at clTaskPool.c:195

4 0x00007f3309977ddb in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:953

5 0x00007f3308954f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f3307a4ae1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 8 (Thread 0x7f3309ffe700 (LWP 27415)):

0 0x00007f3307a3e3cd in poll () from /lib/x86_64-linux-gnu/libc.so.6

1 0x00007f330773ae01 in poll (timeout=-1, nfds=1, __fds=0x7f3309fedb58)

at /usr/include/x86_64-linux-gnu/bits/poll2.h:46

2 xportRecv (commPort=18674200, pRecvOption=0x7f3309ffdc98, pBuffer=0x7f3309fedbb8 "\001\021", bufSize=,

message=0x103ad10, pRecvParam=0x7f3309ffdce8) at clTipc.c:543

3 0x00007f330992e14c in transportRecv (xport=0x10c2f88, commPort=18674200, pRecvOption=0x7f3309ffdc98, buffer=0x0,

bufSize=0, message=0x103ad10, pRecvParam=0x7f3309ffdce8) at clTransport.c:2852

4 0x00007f33099340aa in clTransportRecv (type=, commPort=, pRecvOption=,

buffer=<optimized out>, bufSize=<optimized out>, message=<optimized out>, pRecvParam=0x7f3309ffdce8)
at clTransport.c:2883

5 0x00007f3309927c94 in clIocReceiveAsync (commPortHdl=, pRecvOption=pRecvOption@entry=0x7f3309ffdcd0,

message=<optimized out>, pRecvParam=pRecvParam@entry=0x7f3309ffdce8) at clIocUserApi.c:2938

6 0x00007f33098e8d05 in clEoIocRecvQueueProcess (pThis=0x11c7b08) at eo.c:3822

---Type to continue, or q to quit---

7 0x00007f3309977ddb in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:953

8 0x00007f3308954f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

9 0x00007f3307a4ae1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 7 (Thread 0x7f330a02f700 (LWP 27414)):

0 0x00007f3308958ca4 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f330997ed3d in cosPosixCondWait (conditionId=0x7f3309de2fc8 <gEoStaticQueue+8>,

mutexId=0x7f3309de2ff8 <gEoStaticQueue+56>, timer=...) at posix/clCommonCos.c:1553

2 0x00007f330998bdfa in clOsalCondWait (conditionId=conditionId@entry=0x7f3309de2fc8 <gEoStaticQueue+8>,

mutexId=mutexId@entry=0x7f3309de2ff8 <gEoStaticQueue+56>, timer=timer@entry=...) at osal.c:674

3 0x00007f33098e2ad1 in eoStaticQueueProcess (pThis=0x11c7b08) at eo.c:586

4 0x00007f3309977ddb in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:953

5 0x00007f3308954f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f3307a4ae1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 6 (Thread 0x7f330a060700 (LWP 27413)):

0 0x00007f3308958ca4 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f330997ed3d in cosPosixCondWait (conditionId=0x7f3309e4a300 , mutexId=0x7f3309e4a2d0 ,

timer=...) at posix/clCommonCos.c:1553

2 0x00007f330998bdfa in clOsalCondWait (conditionId=conditionId@entry=0x7f3309e4a300 ,

mutexId=mutexId@entry=0x7f3309e4a2d0 <gLogMutex>, timer=timer@entry=...) at osal.c:674

3 0x00007f33099b05ac in clLogDeferredHandler (pData=) at clLogUtils.c:498

4 0x00007f3309977ddb in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:953

5 0x00007f3308954f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f3307a4ae1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 5 (Thread 0x7f330a091700 (LWP 27412)):

0 0x00007f3307a4b4d3 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6

1 0x00007f33099306c4 in transportListener (ctxt=0x7f3309c45a20 ) at clTransport.c:890

2 0x00007f33099b5f03 in clTaskPoolEntry (pArg=) at clTaskPool.c:282

3 0x00007f3309977ddb in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:953

4 0x00007f3308954f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f3307a4ae1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 4 (Thread 0x7f330a0f1700 (LWP 27411)): ---Type to continue, or q to quit---

0 0x00007f330895c43d in nanosleep () from /lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f3309979f5a in cosPosixTaskDelay (timer=...) at posix/clCommonCos.c:807

2 0x00007f33099882d2 in clOsalTaskDelay (timer=timer@entry=...) at osal.c:271

3 0x00007f330999be09 in clTimerTask (pArg=) at clTimerTree.c:1362

4 0x00007f3309977ddb in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:953

5 0x00007f3308954f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f3307a4ae1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 3 (Thread 0x7f330a198740 (LWP 27323)):

0 0x00007f3307a43233 in select () from /lib/x86_64-linux-gnu/libc.so.6

1 0x00007f330853e422 in SAFplus::App::dispatchForever (this=this@entry=0x7fff71b2a470) at clApp.cxx:238

2 0x0000000000404021 in main (argc=, argv=) at main.cpp:64

Thread 2 (Thread 0x7f3307539700 (LWP 27627)):

0 0x00007f330895905e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f330997efa6 in cosPosixCondWait (conditionId=0x11e6e38, mutexId=0x11e6e08, timer=...) at posix/clCommonCos.c:1571

2 0x00007f330998bdfa in clOsalCondWait (conditionId=conditionId@entry=0x11e6e38, mutexId=mutexId@entry=0x11e6e08,

timer=timer@entry=...) at osal.c:674

3 0x00007f33099b5dee in clTaskPoolEntry (pArg=) at clTaskPool.c:195

4 0x00007f3309977ddb in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:953

5 0x00007f3308954f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f3307a4ae1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 1 (Thread 0x7f330756a700 (LWP 27573)):

0 0x00007f3307988037 in raise () from /lib/x86_64-linux-gnu/libc.so.6

1 0x00007f330798b698 in abort () from /lib/x86_64-linux-gnu/libc.so.6

2 0x00007f3307980e03 in ?? () from /lib/x86_64-linux-gnu/libc.so.6

3 0x00007f3307980eb2 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6

4 0x00007f330994ca3c in clLogClntStreamWriteWithHeader (pClntEoEntry=0x11ea738, severity=severity@entry=CL_LOG_SEV_DEBUG,

serviceId=serviceId@entry=10, msgId=msgId@entry=1, 
pMsgHeader=pMsgHeader@entry=0x7f3307567e88 "Tue Jan  7 09:58:41.447 2014 (Worker0.27323 : LogClientI0.---.---", 
args=args@entry=0x7f3307567e70, hClntStreamNode=0x7f32e0001568) at clLogClientStream.c:787

5 0x00007f330993e459 in clLogVWriteAsyncWithHeader (hStream=hStream@entry=9156754310955014,

severity=severity@entry=CL_LOG_SEV_DEBUG, serviceId=serviceId@entry=10, msgId=msgId@entry=1, 
pMsgHeader=pMsgHeader@entry=0x7f3307567e88 "Tue Jan  7 09:58:41.447 2014 (Worker0.27323 : LogClientI0.---.---", 

---Type to continue, or q to quit--- args=args@entry=0x7f3307567e70) at clLogClient.c:897

6 0x00007f330993ecc1 in clLogWriteAsync (hStream=9156754310955014, logSeverity=CL_LOG_SEV_DEBUG,

serviceId=<optimized out>, msgId=<optimized out>) at clLogClient.c:746

7 0x0000000000425455 in clTcAsciiWrite (hStream=9156754310955014, severity=severity@entry=CL_LOG_SEV_DEBUG,

count=count@entry=100) at clTestLogWrite.c:31

8 0x000000000042931b in TC7_clLogWrite () at clTestLogWrite.c:675

9 0x000000000042d1b8 in clTcLogWriteTest () at clTestLogWrite.c:1217

10 0x0000000000407ea1 in clLogTestMain () at clTestLog.c:118

11 0x000000000042e725 in DoWork (this=, ws=) at main.cpp:42

12 MyApp::DoWork (this=, ws=) at main.cpp:36

13 0x00007f330853d668 in SAFplus::WorkAssignmentThreadEntry (w=0x7fff71b2a5d8) at clApp.cxx:431

14 0x00007f3309977ddb in cosPosixTaskWrapper (pArgument=) at posix/clCommonCos.c:953

15 0x00007f3308954f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

16 0x00007f3307a4ae1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

(gdb)

hoangle commented 10 years ago

Hi a Hung,

I think the fixed is at https://github.com/OpenClovis/SAFplus-Availability-Scalability-Platform/commit/e533c8c434c48b045e5f88b10dd8dfe78cf2913f

Please apply and redo the testcase.

Hoang

hungta commented 10 years ago

I tried but it didn't help. Suresh should take a look at this.

hungta commented 10 years ago

After analyzing code of clLogClntStreamWriteWithHeader and test case, I found that the test code causes the issue by inputting parameters of log stream not valid with the assert check of clLogClntStreamWriteWithHeader:

User test code:

streamAttr.fileName = (ClCharT_) clHeapCalloc(1, strlen(TC_LOG_FILE)+1); strcpy(streamAttr.fileName, TC_LOGFILE); streamAttr.fileLocation = (ClCharT) clHeapCalloc(1, strlen(".:/tmp")+1); strcpy(streamAttr.fileLocation,".:/tmp"); streamAttr.fileUnitSize = 0; streamAttr.recordSize = 5120; <<== THIS IS THE ISSUE streamAttr.haProperty = 1; streamAttr.fileFullAction = CL_LOG_FILE_FULL_ACTION_ROTATE; streamAttr.maxFilesRotated = 5; streamAttr.flushFreq = 1; streamAttr.flushInterval = 1000; streamAttr.waterMark.lowLimit = 2; streamAttr.waterMark.highLimit = 100;

While clLogClntStreamWriteWithHeader():

CL_ASSERT(recordSize < 4*1024); // Sanity check the log record size I think this is the exact place where crash occurred.

After adjust the record size of streamAttr to a less value e.g streamAttr.recordSize = 512, I don't see the crash anymore.

But the output of the GDB seems making us to misunderstand where the issue is:

CL_ASSERT(clLogClientStreamMutexUnlock(pClntData) == CL_OK); <-- THIS IS WHERE THE GDB OUTPUT SHOW.

Anyway, despite this issue is from invalid parameters input, but it doesn't make sense for system to crash; should an error code be returned instead?

sureshgajjala commented 10 years ago

This is sanity check added by andrew after taking the normal page size in linux system into consideration. This assert is added to find the root cause immediately if any invalid record size is provided as input.

To make sure GDB bt is pointing to wrong location in Code we can add "recordSize = 5000" before "CL_ASSERT(recordSize < 4*1024);" and bring-up the system. Then we can check whether the core dump back trace is correct or not.