epics-base / pvAccessCPP

pvAccessCPP is an EPICS V4 C++ module
https://epics-base.github.io/pvAccessCPP/
Other
9 stars 22 forks source link

Reference loop leaking TCP worker threads #190

Open simon-ess opened 1 year ago

simon-ess commented 1 year ago

If you create an IOC with a single record, and then create and kill a large number of monitors on that record, then epicsThreadShowAll reveals a number of threads that no longer exist.

I am not getting 100% consistent numbers, but somewhere between 50--100 monitors on a given record will yield these "ghost" threads. Specifically, I run a number of pvmonitor x, wait 30 seconds, and then kill them all, and then run epicsThreadShowAll in the IOC and I see something like

            NAME       EPICS ID   LWP ID   OSIPRI  OSSPRI  STATE
          _main_       0x60b450    23615      0       0       OK
          errlog       0x6c6280    23618     10       0       OK
          taskwd       0x6c6810    23619     10       0       OK
      timerQueue       0x6c7200    23620     70       0       OK
           cbLow       0x6c7650    23621     59       0       OK
        cbMedium       0x6c7840    23622     64       0       OK
          cbHigh       0x6c7c80    23623     71       0       OK
        dbCaLink       0x6d4600    23624     50       0       OK
            PVAL       0x6d4fc0    23625     50       0       OK
       PDB-event       0x6dbad0    23626     19       0       OK
pvAccess-client        0x6dcaa0    23627     35       0       OK
UDP-rx 0.0.0.0:0       0x6de400    23628     50       0       OK
UDP-rx 10.0.2.15       0x73f030    23629     50       0       OK
UDP-rx 224.0.0.1       0x75fdf0    23631     50       0       OK
UDP-rx 10.0.2.25       0x73f490    23630     50       0       OK
        scanOnce       0x766810    23632     67       0       OK
         scan-10       0x766b80    23633     60       0       OK
          scan-5       0x766ef0    23634     61       0       OK
          scan-2       0x767260    23635     62       0       OK
          scan-1       0x7675d0    23636     63       0       OK
        scan-0.5       0x767940    23637     64       0       OK
        scan-0.2       0x767cb0    23638     65       0       OK
        scan-0.1       0x768020    23639     66       0       OK
         CAS-TCP       0x7709e0    23640     18       0       OK
         CAS-UDP       0x770d90    23641     16       0       OK
      CAS-beacon       0x771100    23642     17       0       OK
     PVAS timers       0x771b70    23643     25       0       OK
    TCP-acceptor       0x772a20    23644     50       0       OK
UDP-rx 0.0.0.0:0       0x793390    23645     50       0       OK
UDP-rx 10.0.2.15       0x7d3f70    23646     50       0       OK
UDP-rx 10.0.2.25       0x7d43d0    23647     50       0       OK
UDP-rx 224.0.0.1       0x7f4ce0    23648     50       0       OK
          TCP-rx 0x7fffcc162000    24941     20       0       OK
          TCP-tx 0x7fffcc1624e0    24942     20       0       OK
          TCP-rx 0x7fffcc17c800    24955     20       0       OK
          TCP-tx 0x7fffcc17cce0    24956     20       0       OK
          TCP-tx 0x7fffcc1974f0    25002     20       0       OK
          TCP-rx 0x7fffcc197010    25001     20       0       OK
          TCP-rx 0x7fffcc1b1820    25007     20       0       OK
          TCP-tx 0x7fffcc1b1d00    25008     20       0       OK
          TCP-rx 0x7fffcc083150    25022     20       0       OK
          TCP-tx 0x7fffcc029f70    25049     20       0       OK
          TCP-rx 0x7fffcc0ef820    25164     20       0       OK
          TCP-tx 0x7fffcc0b16b0    25165     20       0       OK
          TCP-rx 0x7fffcc00f8d0    25170     20       0       OK
          TCP-tx 0x7fffcc0b1570    25171     20       0       OK
          TCP-rx 0x7fffcc083cc0    25172     20       0       OK
          TCP-tx 0x7fffcc029e90    25173     20       0       OK
          TCP-rx 0x7fffcc014cf0    25174     20       0       OK
          TCP-tx 0x7fffcc049820    25175     20       0       OK
          TCP-rx 0x7fffcc01f530    25176     20       0       OK
          TCP-tx 0x7fffcc0051b0    25177     20       0       OK
          TCP-rx 0x7fffcc123900    25178     20       0       OK
          TCP-tx 0x7fffcc128d20    25179     20       0       OK
          TCP-rx 0x7fffcc109270    25184     20       0       OK
          TCP-tx 0x7fffcc1190c0    25185     20       0       OK
          TCP-rx 0x7fffcc1475d0    25186     20       0       OK
          TCP-tx 0x7fffcc0ab650    25187     20       0       OK
          TCP-rx 0x7fffcc077d30    25230     20       0       OK
          TCP-tx 0x7fffcc0eedd0    25231     20       0       OK

The top threads are obviously expected, but the bottom ones are not. Moreover, they do not seem to exist according to the host OS: if you run ps -T -C softIocPVA then it does not list any of these threads once the monitors have been killed.

If you repeat the process then the list simply gets longer each time.

This is EPICS base 7.0.7 on a CentOS7 VM.

simon-ess commented 1 year ago

This might be more appropriate at https://github.com/epics-base/pvAccessCPP, I admit.

ralphlange commented 1 year ago

These are PVA or CA connections?

What does the server report on the IOC say?

simon-ess commented 1 year ago

These are PVA connections; the server report (pvasr, I assume?) lists no clients, once all of the threads have exited.

mdavidsaver commented 1 year ago

Ok. This seems to be easy enough to replicate.

$ cat tick.db
record(calc, "$(P=)cnt") {
  field(INPA, "$(P=)cnt")
  field(CALC, "A+1")
  field(SCAN, "1 second")
}
$ softIocPVA -d tick.db

Concurrently run

for n in `seq 100`
do
  pvmonitor cnt &
done

Wait a bit then

pkill pvmonitor
epics> epicsThreadShowAll 
            NAME       EPICS ID   LWP ID   OSIPRI  OSSPRI  STATE
          _main_ 0x555931365050   356387      0       0       OK
          errlog 0x55593141d5b0   356390     10       0       OK
          taskwd 0x55593141dac0   356391     10       0       OK
      timerQueue 0x55593141e370   356392     70       0       OK
           cbLow 0x55593141e6a0   356393     59       0       OK
        cbMedium 0x55593141e880   356394     64       0       OK
          cbHigh 0x55593141eba0   356395     71       0       OK
        dbCaLink 0x55593141efc0   356396     50       0       OK
            PVAL 0x55593141f5f0   356397     50       0       OK
       PDB-event 0x5559314371a0   356398     19       0       OK
pvAccess-client  0x555931437c70   356399     35       0       OK
UDP-rx 0.0.0.0:0 0x555931459270   356400     50       0       OK
UDP-rx 192.168.1 0x55593149a160   356402     50       0       OK
UDP-rx 192.168.1 0x555931499d30   356401     50       0       OK
UDP-rx 192.168.2 0x5559314dab70   356403     50       0       OK
UDP-rx 192.168.2 0x5559314daf70   356404     50       0       OK
UDP-rx 224.0.0.1 0x5559314fb730   356405     50       0       OK
        scanOnce 0x555931502180   356406     67       0       OK
         scan-10 0x555931502410   356407     60       0       OK
          scan-5 0x555931502670   356408     61       0       OK
          scan-2 0x5559315028d0   356409     62       0       OK
          scan-1 0x555931502b30   356410     63       0       OK
        scan-0.5 0x555931502d90   356411     64       0       OK
        scan-0.2 0x555931502ff0   356412     65       0       OK
        scan-0.1 0x555931503250   356413     66       0       OK
         CAS-TCP 0x55593150bc10   356414     18       0       OK
         CAS-UDP 0x55593150be70   356415     16       0       OK
      CAS-beacon 0x55593150c0d0   356416     17       0       OK
     PVAS timers 0x55593150c710   356417     25       0       OK
    TCP-acceptor 0x55593150d300   356418     50       0       OK
UDP-rx 0.0.0.0:0 0x55593152dfa0   356419     50       0       OK
UDP-rx 192.168.1 0x55593156ea40   356420     50       0       OK
UDP-rx 192.168.1 0x55593156ee70   356421     50       0       OK
UDP-rx 192.168.2 0x5559315af880   356422     50       0       OK
UDP-rx 192.168.2 0x5559315afc80   356423     50       0       OK
UDP-rx 224.0.0.1 0x5559315d0440   356424     50       0       OK
          TCP-tx 0x7f5ad21dd1d0   358337     20       0       OK
          TCP-rx 0x7f5ad21dcdf0   358336     20       0       OK
          TCP-rx 0x7f5ad21f7440   358351     20       0       OK
          TCP-tx 0x7f5ad21f7820   358352     20       0       OK
          TCP-rx 0x7f5ad2211a90   358362     20       0       OK
          TCP-tx 0x7f5ad2211e70   358363     20       0       OK
          TCP-rx 0x7f5ad222c0e0   358373     20       0       OK
          TCP-tx 0x7f5ad222c4c0   358374     20       0       OK
          TCP-rx 0x7f5b77f74f20   358389     20       0       OK
          TCP-tx 0x7f5ad21c2da0   358390     20       0       OK
          TCP-rx 0x7f5b756ba230   358400     20       0       OK
          TCP-tx 0x7f5b7635a320   358401     20       0       OK
$ ps -T -C softIocPVA
    PID    SPID TTY          TIME CMD
 356387  356387 pts/9    00:00:00 softIocPVA
 356387  356390 pts/9    00:00:00 errlog
 356387  356391 pts/9    00:00:00 taskwd
 356387  356392 pts/9    00:00:00 timerQueue
 356387  356393 pts/9    00:00:00 cbLow
 356387  356394 pts/9    00:00:00 cbMedium
 356387  356395 pts/9    00:00:00 cbHigh
 356387  356396 pts/9    00:00:00 dbCaLink
 356387  356397 pts/9    00:00:00 PVAL
 356387  356398 pts/9    00:00:00 PDB-event
 356387  356399 pts/9    00:00:00 pvAccess-client
 356387  356400 pts/9    00:00:00 UDP-rx 0.0.0.0:
 356387  356401 pts/9    00:00:00 UDP-rx 192.168.
 356387  356402 pts/9    00:00:00 UDP-rx 192.168.
 356387  356403 pts/9    00:00:00 UDP-rx 192.168.
 356387  356404 pts/9    00:00:00 UDP-rx 192.168.
 356387  356405 pts/9    00:00:00 UDP-rx 224.0.0.
 356387  356406 pts/9    00:00:00 scanOnce
 356387  356407 pts/9    00:00:00 scan-10
 356387  356408 pts/9    00:00:00 scan-5
 356387  356409 pts/9    00:00:00 scan-2
 356387  356410 pts/9    00:00:00 scan-1
 356387  356411 pts/9    00:00:00 scan-0.5
 356387  356412 pts/9    00:00:00 scan-0.2
 356387  356413 pts/9    00:00:00 scan-0.1
 356387  356414 pts/9    00:00:00 CAS-TCP
 356387  356415 pts/9    00:00:00 CAS-UDP
 356387  356416 pts/9    00:00:00 CAS-beacon
 356387  356417 pts/9    00:00:00 PVAS timers
 356387  356418 pts/9    00:00:00 TCP-acceptor
 356387  356419 pts/9    00:00:00 UDP-rx 0.0.0.0:
 356387  356420 pts/9    00:00:00 UDP-rx 192.168.
 356387  356421 pts/9    00:00:00 UDP-rx 192.168.
 356387  356422 pts/9    00:00:00 UDP-rx 192.168.
 356387  356423 pts/9    00:00:00 UDP-rx 192.168.
 356387  356424 pts/9    00:00:00 UDP-rx 224.0.0.

On IOC exit I also see

epics> 
Leaking ServerContext _responseHandler use_count=14
2023-05-23T08:12:16.704 ServerContextImpl::shutdown() doesn't break all internal ref. loops. use_count=31
mdavidsaver commented 1 year ago

I suspect an issue with join-able threads and ref-counting of epicsThreadOSD. Perhaps triggered by the design of pvAccessCPP where the worker threads often attempt to self-join.

mdavidsaver commented 1 year ago

Actually, it looks like ref-counting of epicsThreadOSD is working just fine. Rather, the issue appears to be (yet another) reference loop in pvAccessCPP. sigh...

I am able to capture the issue with RR. I see that for a leaked epicsThreadOSD, free_threadInfo() is called once after the thread returns. However, if is not called again through epicsThread::~epicsThread() (via. epicsThreadMustJoin())

mdavidsaver commented 1 year ago

Ok... My being tired of untangling a seemingly unending stream of reference loops like this one played a major role in motivating me to write PVXS. So I'm going to prioritize work on PVXS over this issue. Someone else is welcomed to have a go. Looking for DEBUG_SHARED_PTR in pv/sharedPtr.h might be a place to start. This is one of the tools I developed to hopefully make troubleshooting this sort of situation more tractable.

I will also look at making epicsThreadShowAll() flag "zombie" threads so that it is more obvious when this sort of situation comes up again.

mdavidsaver commented 1 year ago

Looking again at this issue and https://github.com/epics-base/epics-base/pull/390, I notice that some of the pvmonitor seemingly deadlock on exit.

Thread 1 (Thread 0x7f68dce22740 (LWP 231493) "pvmonitor"):
#0  __futex_lock_pi64 (futex_word=futex_word@entry=0x55ccbbec8b80, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:180
#1  0x00007f68dca8af86 in __pthread_mutex_lock_full (mutex=0x55ccbbec8b80) at ./nptl/pthread_mutex_lock.c:441
#2  0x00007f68dca8b285 in ___pthread_mutex_lock (mutex=mutex@entry=0x55ccbbec8b80) at ./nptl/pthread_mutex_lock.c:86
#3  0x00007f68dce8809a in mutexLock (id=0x55ccbbec8b80) at ../osi/os/posix/osdMutex.c:116
#4  epicsMutexOsdLock (pmutex=0x55ccbbec8b80) at ../osi/os/posix/osdMutex.c:163
#5  0x00007f68dce8093c in epicsMutex::lock (this=this@entry=0x55ccbbec8aa0) at ../osi/epicsMutex.cpp:276
#6  0x00007f68dd0c15ae in epics::pvData::Lock::Lock (m=..., this=<synthetic pointer>) at /home/dev/epics/pv/pvData/include/pv/lock.h:45
#7  (anonymous namespace)::InternalClientContextImpl::InternalChannelImpl::destroy (this=0x55ccbbec8920) at ../../src/remoteClient/clientContextImpl.cpp:3302
#8  0x00007f68dd0b1479 in epics::pvAccess::Destroyable::cleaner::operator() (this=<optimized out>) at ../../src/utils/pv/destroyable.h:54
...
mdavidsaver commented 1 year ago

... So I'm going to prioritize work on PVXS over this issue. ...

QSRV2 in PVXS does not have this issue as it does not create dedicated thread(s) for each client. In general with PVXS I have tried to be stricter about resource ownership transfers through the user API. So ownership should be rooted in external objects, and situations like a worker thread holding the last reference to itself should not happen, even momentarily. Also, pvxmonitor does not deadlock on exit.