Closed MichaelRitzert closed 5 years ago
This is more an EPICS (Channel Access Server) issue but a Gateway issue, but nonetheless:
this=0x0, sidIn=4294967295, typeIn=65535
are obviously invalid parameters that are NULL or all-bits-one. That is definitely not a valid tcpiiu
object that the channel is transferred to.
Is this happening in the middle of operations, or at a specific Gateway moment (start up, shut down, heavy client start-up)? Which version of the Gateway and - more importantly - which version of Base are you using?
EPICS base is 3.14.12.6, gateway 2.1.0.
It happens at random times, random intervals. No pattern at all.
This time, in the meantime, the user the gateway is running under has reached is nproc ulimit. Since this led to all sorts of problems all over the system that I haven't observed before, I doubt it is the (only) cause.
Huh. Interesting. That could probably be tested with a low nproc setting.
Come to think of it, we have the gateway deployed on two PCs. The second has a lot lower process count (essentially only the gateway), and also sees the occasional crash. It might of course be another problem.
Actually this appears to be a CA Client library issue, not CAS; the failure is in cac.cpp. It more makes sense that the crash happens in the client code when the user's nproc limit has been reached, the client library may be in the middle of setting up a new CA client connection at that point to an IOC it isn't already talking to, for which it has to start a pair of new threads. IIRC the CAS code is still single-threaded, so it doesn't create new threads for itself after start-up.
I don't know the CA Client code very well myself, but we can ask someone who does to try to replicate the problem. I just filed a bug report for this on Launchpad against the Base-3.14 branch which is where we manage CA Client bugs.
Well spotted. I am so used to CAS being the culprit in Gateway matters...
OK, I have another crash, this time no other circumstances involved (it's on the second PC, no problems with ulimit for sure), just regular operation of the system. It's in another place, but I'm adding it here, because it also has udpiiu in it:
Program terminated with signal 11, Segmentation fault.
#0 add (this=0xe0fc60, cid=<value optimized out>, sid=168846180, typeCode=65535, count=<value optimized out>, minorVersionNumber=<value optimized out>, addr=...,
currentTime=...) at ../../../include/tsDLList.h:322
322 lastNode.pNext = &item;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64 libgcc-4.4.7-17.el6.x86_64 libstdc++-4.4.7-17.el6.x86_64 ncurses-libs-5.7-4.20090207.el6.x86_64
(gdb) bt
#0 add (this=0xe0fc60, cid=<value optimized out>, sid=168846180, typeCode=65535, count=<value optimized out>, minorVersionNumber=<value optimized out>, addr=...,
currentTime=...) at ../../../include/tsDLList.h:322
#1 cac::transferChanToVirtCircuit (this=0xe0fc60, cid=<value optimized out>, sid=168846180, typeCode=65535, count=<value optimized out>,
minorVersionNumber=<value optimized out>, addr=..., currentTime=...) at ../cac.cpp:616
#2 0x00007f434cc014a0 in udpiiu::searchRespAction (this=<value optimized out>, msg=<value optimized out>, addr=<value optimized out>, currentTime=<value optimized out>)
at ../udpiiu.cpp:690
#3 0x00007f434cc015c2 in udpiiu::postMsg (this=0xe7cc80, net_addr=..., pInBuf=<value optimized out>, blockSize=24, currentTime=...) at ../udpiiu.cpp:857
#4 0x00007f434cc03681 in udpRecvThread::run (this=0xe8d0a8) at ../udpiiu.cpp:394
#5 0x00007f434c9a4249 in epicsThreadCallEntryPoint (pPvt=0xe8d0c8) at ../../../src/libCom/osi/epicsThread.cpp:83
#6 0x00007f434c9aaed3 in start_routine (arg=0xe8d340) at ../../../src/libCom/osi/os/posix/osdThread.c:389
#7 0x00000030ada07aa1 in start_thread () from /lib64/libpthread.so.0
#8 0x00000030ad2e8aad in clone () from /lib64/libc.so.6
I also have another core still to be examined.
Can we move the discussion to https://bugs.launchpad.net/epics-base/+bug/1664302 until we have proof that it really is a Gateway issue? (I already added the new backtrace there.)
Thank you. I should/will update my bookmark.
The Launchpad CAS bug linked from comments above is marked Fix Released, can this gateway bug be closed now?
Yep. Closing it.
I frequently see the gateway executable crash in our system. Unfortunately, I fail to see a pattern, when exactly it crashes. I enabled core dumps, and finally caught a crash when it happened:
Can you make any sense of this?