epics-extensions / ca-gateway

Channel Access PV Gateway
http://www.aps.anl.gov/epics/extensions/gateway/
Other
18 stars 17 forks source link

crash #14

Closed MichaelRitzert closed 5 years ago

MichaelRitzert commented 7 years ago

I frequently see the gateway executable crash in our system. Unfortunately, I fail to see a pattern, when exactly it crashes. I enabled core dumps, and finally caught a crash when it happened:

(gdb) bt
#0  0x00007f67e7a3ed26 in assertIdenticalMutex (this=0x0, guard=..., chan=..., sidIn=4294967295, typeIn=65535, countIn=0) at ../../../include/epicsGuard.h:81
#1  tcpiiu::installChannel (this=0x0, guard=..., chan=..., sidIn=4294967295, typeIn=65535, countIn=0) at ../tcpiiu.cpp:1911
#2  0x00007f67e7a2c2bb in cac::transferChanToVirtCircuit (this=<value optimized out>, cid=<value optimized out>, sid=4294967295, typeCode=65535, count=0, minorVersionNumber=13, 
    addr=..., currentTime=...) at ../cac.cpp:639
#3  0x00007f67e7a3a4a0 in udpiiu::searchRespAction (this=<value optimized out>, msg=<value optimized out>, addr=<value optimized out>, currentTime=<value optimized out>)
    at ../udpiiu.cpp:690
#4  0x00007f67e7a3a5c2 in udpiiu::postMsg (this=0x242d760, net_addr=..., pInBuf=<value optimized out>, blockSize=48, currentTime=...) at ../udpiiu.cpp:857
#5  0x00007f67e7a3c681 in udpRecvThread::run (this=0x243db88) at ../udpiiu.cpp:394
#6  0x00007f67e77dd249 in epicsThreadCallEntryPoint (pPvt=0x243dba8) at ../../../src/libCom/osi/epicsThread.cpp:83

Can you make any sense of this?

ralphlange commented 7 years ago

This is more an EPICS (Channel Access Server) issue but a Gateway issue, but nonetheless:

this=0x0, sidIn=4294967295, typeIn=65535 are obviously invalid parameters that are NULL or all-bits-one. That is definitely not a valid tcpiiu object that the channel is transferred to.

Is this happening in the middle of operations, or at a specific Gateway moment (start up, shut down, heavy client start-up)? Which version of the Gateway and - more importantly - which version of Base are you using?

MichaelRitzert commented 7 years ago

EPICS base is 3.14.12.6, gateway 2.1.0.

It happens at random times, random intervals. No pattern at all.

This time, in the meantime, the user the gateway is running under has reached is nproc ulimit. Since this led to all sorts of problems all over the system that I haven't observed before, I doubt it is the (only) cause.

ralphlange commented 7 years ago

Huh. Interesting. That could probably be tested with a low nproc setting.

MichaelRitzert commented 7 years ago

Come to think of it, we have the gateway deployed on two PCs. The second has a lot lower process count (essentially only the gateway), and also sees the occasional crash. It might of course be another problem.

anjohnson commented 7 years ago

Actually this appears to be a CA Client library issue, not CAS; the failure is in cac.cpp. It more makes sense that the crash happens in the client code when the user's nproc limit has been reached, the client library may be in the middle of setting up a new CA client connection at that point to an IOC it isn't already talking to, for which it has to start a pair of new threads. IIRC the CAS code is still single-threaded, so it doesn't create new threads for itself after start-up.

I don't know the CA Client code very well myself, but we can ask someone who does to try to replicate the problem. I just filed a bug report for this on Launchpad against the Base-3.14 branch which is where we manage CA Client bugs.

ralphlange commented 7 years ago

Well spotted. I am so used to CAS being the culprit in Gateway matters...

MichaelRitzert commented 7 years ago

OK, I have another crash, this time no other circumstances involved (it's on the second PC, no problems with ulimit for sure), just regular operation of the system. It's in another place, but I'm adding it here, because it also has udpiiu in it:

Program terminated with signal 11, Segmentation fault.
#0  add (this=0xe0fc60, cid=<value optimized out>, sid=168846180, typeCode=65535, count=<value optimized out>, minorVersionNumber=<value optimized out>, addr=..., 
    currentTime=...) at ../../../include/tsDLList.h:322
322             lastNode.pNext = &item;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64 libgcc-4.4.7-17.el6.x86_64 libstdc++-4.4.7-17.el6.x86_64 ncurses-libs-5.7-4.20090207.el6.x86_64
(gdb) bt
#0  add (this=0xe0fc60, cid=<value optimized out>, sid=168846180, typeCode=65535, count=<value optimized out>, minorVersionNumber=<value optimized out>, addr=..., 
    currentTime=...) at ../../../include/tsDLList.h:322
#1  cac::transferChanToVirtCircuit (this=0xe0fc60, cid=<value optimized out>, sid=168846180, typeCode=65535, count=<value optimized out>, 
    minorVersionNumber=<value optimized out>, addr=..., currentTime=...) at ../cac.cpp:616
#2  0x00007f434cc014a0 in udpiiu::searchRespAction (this=<value optimized out>, msg=<value optimized out>, addr=<value optimized out>, currentTime=<value optimized out>)
    at ../udpiiu.cpp:690
#3  0x00007f434cc015c2 in udpiiu::postMsg (this=0xe7cc80, net_addr=..., pInBuf=<value optimized out>, blockSize=24, currentTime=...) at ../udpiiu.cpp:857
#4  0x00007f434cc03681 in udpRecvThread::run (this=0xe8d0a8) at ../udpiiu.cpp:394
#5  0x00007f434c9a4249 in epicsThreadCallEntryPoint (pPvt=0xe8d0c8) at ../../../src/libCom/osi/epicsThread.cpp:83
#6  0x00007f434c9aaed3 in start_routine (arg=0xe8d340) at ../../../src/libCom/osi/os/posix/osdThread.c:389
#7  0x00000030ada07aa1 in start_thread () from /lib64/libpthread.so.0
#8  0x00000030ad2e8aad in clone () from /lib64/libc.so.6

I also have another core still to be examined.

ralphlange commented 7 years ago

Can we move the discussion to https://bugs.launchpad.net/epics-base/+bug/1664302 until we have proof that it really is a Gateway issue? (I already added the new backtrace there.)

MichaelRitzert commented 7 years ago

Thank you. I should/will update my bookmark.

anjohnson commented 5 years ago

The Launchpad CAS bug linked from comments above is marked Fix Released, can this gateway bug be closed now?

ralphlange commented 5 years ago

Yep. Closing it.