Open simon-ess opened 1 year ago
This might be more appropriate at https://github.com/epics-base/pvAccessCPP, I admit.
These are PVA or CA connections?
What does the server report on the IOC say?
These are PVA connections; the server report (pvasr
, I assume?) lists no clients, once all of the threads have exited.
Ok. This seems to be easy enough to replicate.
$ cat tick.db
record(calc, "$(P=)cnt") {
field(INPA, "$(P=)cnt")
field(CALC, "A+1")
field(SCAN, "1 second")
}
$ softIocPVA -d tick.db
Concurrently run
for n in `seq 100`
do
pvmonitor cnt &
done
Wait a bit then
pkill pvmonitor
epics> epicsThreadShowAll
NAME EPICS ID LWP ID OSIPRI OSSPRI STATE
_main_ 0x555931365050 356387 0 0 OK
errlog 0x55593141d5b0 356390 10 0 OK
taskwd 0x55593141dac0 356391 10 0 OK
timerQueue 0x55593141e370 356392 70 0 OK
cbLow 0x55593141e6a0 356393 59 0 OK
cbMedium 0x55593141e880 356394 64 0 OK
cbHigh 0x55593141eba0 356395 71 0 OK
dbCaLink 0x55593141efc0 356396 50 0 OK
PVAL 0x55593141f5f0 356397 50 0 OK
PDB-event 0x5559314371a0 356398 19 0 OK
pvAccess-client 0x555931437c70 356399 35 0 OK
UDP-rx 0.0.0.0:0 0x555931459270 356400 50 0 OK
UDP-rx 192.168.1 0x55593149a160 356402 50 0 OK
UDP-rx 192.168.1 0x555931499d30 356401 50 0 OK
UDP-rx 192.168.2 0x5559314dab70 356403 50 0 OK
UDP-rx 192.168.2 0x5559314daf70 356404 50 0 OK
UDP-rx 224.0.0.1 0x5559314fb730 356405 50 0 OK
scanOnce 0x555931502180 356406 67 0 OK
scan-10 0x555931502410 356407 60 0 OK
scan-5 0x555931502670 356408 61 0 OK
scan-2 0x5559315028d0 356409 62 0 OK
scan-1 0x555931502b30 356410 63 0 OK
scan-0.5 0x555931502d90 356411 64 0 OK
scan-0.2 0x555931502ff0 356412 65 0 OK
scan-0.1 0x555931503250 356413 66 0 OK
CAS-TCP 0x55593150bc10 356414 18 0 OK
CAS-UDP 0x55593150be70 356415 16 0 OK
CAS-beacon 0x55593150c0d0 356416 17 0 OK
PVAS timers 0x55593150c710 356417 25 0 OK
TCP-acceptor 0x55593150d300 356418 50 0 OK
UDP-rx 0.0.0.0:0 0x55593152dfa0 356419 50 0 OK
UDP-rx 192.168.1 0x55593156ea40 356420 50 0 OK
UDP-rx 192.168.1 0x55593156ee70 356421 50 0 OK
UDP-rx 192.168.2 0x5559315af880 356422 50 0 OK
UDP-rx 192.168.2 0x5559315afc80 356423 50 0 OK
UDP-rx 224.0.0.1 0x5559315d0440 356424 50 0 OK
TCP-tx 0x7f5ad21dd1d0 358337 20 0 OK
TCP-rx 0x7f5ad21dcdf0 358336 20 0 OK
TCP-rx 0x7f5ad21f7440 358351 20 0 OK
TCP-tx 0x7f5ad21f7820 358352 20 0 OK
TCP-rx 0x7f5ad2211a90 358362 20 0 OK
TCP-tx 0x7f5ad2211e70 358363 20 0 OK
TCP-rx 0x7f5ad222c0e0 358373 20 0 OK
TCP-tx 0x7f5ad222c4c0 358374 20 0 OK
TCP-rx 0x7f5b77f74f20 358389 20 0 OK
TCP-tx 0x7f5ad21c2da0 358390 20 0 OK
TCP-rx 0x7f5b756ba230 358400 20 0 OK
TCP-tx 0x7f5b7635a320 358401 20 0 OK
$ ps -T -C softIocPVA
PID SPID TTY TIME CMD
356387 356387 pts/9 00:00:00 softIocPVA
356387 356390 pts/9 00:00:00 errlog
356387 356391 pts/9 00:00:00 taskwd
356387 356392 pts/9 00:00:00 timerQueue
356387 356393 pts/9 00:00:00 cbLow
356387 356394 pts/9 00:00:00 cbMedium
356387 356395 pts/9 00:00:00 cbHigh
356387 356396 pts/9 00:00:00 dbCaLink
356387 356397 pts/9 00:00:00 PVAL
356387 356398 pts/9 00:00:00 PDB-event
356387 356399 pts/9 00:00:00 pvAccess-client
356387 356400 pts/9 00:00:00 UDP-rx 0.0.0.0:
356387 356401 pts/9 00:00:00 UDP-rx 192.168.
356387 356402 pts/9 00:00:00 UDP-rx 192.168.
356387 356403 pts/9 00:00:00 UDP-rx 192.168.
356387 356404 pts/9 00:00:00 UDP-rx 192.168.
356387 356405 pts/9 00:00:00 UDP-rx 224.0.0.
356387 356406 pts/9 00:00:00 scanOnce
356387 356407 pts/9 00:00:00 scan-10
356387 356408 pts/9 00:00:00 scan-5
356387 356409 pts/9 00:00:00 scan-2
356387 356410 pts/9 00:00:00 scan-1
356387 356411 pts/9 00:00:00 scan-0.5
356387 356412 pts/9 00:00:00 scan-0.2
356387 356413 pts/9 00:00:00 scan-0.1
356387 356414 pts/9 00:00:00 CAS-TCP
356387 356415 pts/9 00:00:00 CAS-UDP
356387 356416 pts/9 00:00:00 CAS-beacon
356387 356417 pts/9 00:00:00 PVAS timers
356387 356418 pts/9 00:00:00 TCP-acceptor
356387 356419 pts/9 00:00:00 UDP-rx 0.0.0.0:
356387 356420 pts/9 00:00:00 UDP-rx 192.168.
356387 356421 pts/9 00:00:00 UDP-rx 192.168.
356387 356422 pts/9 00:00:00 UDP-rx 192.168.
356387 356423 pts/9 00:00:00 UDP-rx 192.168.
356387 356424 pts/9 00:00:00 UDP-rx 224.0.0.
On IOC exit I also see
epics>
Leaking ServerContext _responseHandler use_count=14
2023-05-23T08:12:16.704 ServerContextImpl::shutdown() doesn't break all internal ref. loops. use_count=31
I suspect an issue with join-able threads and ref-counting of epicsThreadOSD
. Perhaps triggered by the design of pvAccessCPP where the worker threads often attempt to self-join.
Actually, it looks like ref-counting of epicsThreadOSD
is working just fine. Rather, the issue appears to be (yet another) reference loop in pvAccessCPP. sigh...
I am able to capture the issue with RR. I see that for a leaked epicsThreadOSD
, free_threadInfo()
is called once after the thread returns. However, if is not called again through epicsThread::~epicsThread()
(via. epicsThreadMustJoin()
)
Ok... My being tired of untangling a seemingly unending stream of reference loops like this one played a major role in motivating me to write PVXS. So I'm going to prioritize work on PVXS over this issue. Someone else is welcomed to have a go. Looking for DEBUG_SHARED_PTR
in pv/sharedPtr.h
might be a place to start. This is one of the tools I developed to hopefully make troubleshooting this sort of situation more tractable.
I will also look at making epicsThreadShowAll()
flag "zombie" threads so that it is more obvious when this sort of situation comes up again.
Looking again at this issue and https://github.com/epics-base/epics-base/pull/390, I notice that some of the pvmonitor
seemingly deadlock on exit.
Thread 1 (Thread 0x7f68dce22740 (LWP 231493) "pvmonitor"):
#0 __futex_lock_pi64 (futex_word=futex_word@entry=0x55ccbbec8b80, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:180
#1 0x00007f68dca8af86 in __pthread_mutex_lock_full (mutex=0x55ccbbec8b80) at ./nptl/pthread_mutex_lock.c:441
#2 0x00007f68dca8b285 in ___pthread_mutex_lock (mutex=mutex@entry=0x55ccbbec8b80) at ./nptl/pthread_mutex_lock.c:86
#3 0x00007f68dce8809a in mutexLock (id=0x55ccbbec8b80) at ../osi/os/posix/osdMutex.c:116
#4 epicsMutexOsdLock (pmutex=0x55ccbbec8b80) at ../osi/os/posix/osdMutex.c:163
#5 0x00007f68dce8093c in epicsMutex::lock (this=this@entry=0x55ccbbec8aa0) at ../osi/epicsMutex.cpp:276
#6 0x00007f68dd0c15ae in epics::pvData::Lock::Lock (m=..., this=<synthetic pointer>) at /home/dev/epics/pv/pvData/include/pv/lock.h:45
#7 (anonymous namespace)::InternalClientContextImpl::InternalChannelImpl::destroy (this=0x55ccbbec8920) at ../../src/remoteClient/clientContextImpl.cpp:3302
#8 0x00007f68dd0b1479 in epics::pvAccess::Destroyable::cleaner::operator() (this=<optimized out>) at ../../src/utils/pv/destroyable.h:54
...
... So I'm going to prioritize work on PVXS over this issue. ...
QSRV2 in PVXS does not have this issue as it does not create dedicated thread(s) for each client. In general with PVXS I have tried to be stricter about resource ownership transfers through the user API. So ownership should be rooted in external objects, and situations like a worker thread holding the last reference to itself should not happen, even momentarily. Also, pvxmonitor
does not deadlock on exit.
If you create an IOC with a single record, and then create and kill a large number of monitors on that record, then epicsThreadShowAll reveals a number of threads that no longer exist.
I am not getting 100% consistent numbers, but somewhere between 50--100 monitors on a given record will yield these "ghost" threads. Specifically, I run a number of
pvmonitor x
, wait 30 seconds, and then kill them all, and then runepicsThreadShowAll
in the IOC and I see something likeThe top threads are obviously expected, but the bottom ones are not. Moreover, they do not seem to exist according to the host OS: if you run
ps -T -C softIocPVA
then it does not list any of these threads once the monitors have been killed.If you repeat the process then the list simply gets longer each time.
This is EPICS base 7.0.7 on a CentOS7 VM.