epics-base / epics-base

The C/C++ core of the EPICS Base control system toolkit
https://epics-controls.org/
Other
133 stars 136 forks source link

Application hanging for 30 seconds after both client and server are destroyed #477

Open hinxx opened 6 months ago

hinxx commented 6 months ago

I have a pcas example code with a simple CA client which results an application with both CA server and CA client. Initially I was doing this in python (https://github.com/pyepics/pyepics/issues/259) but have since managed to reproduce the following problem in C/C++. If the client is not coupled with the server this problem is not observed. Test made on Linux, with latest epics-base and pcas code from git.

As it turns out if the CA client thread is stopped before the CAS is I hit the situation described here occurs:

https://github.com/epics-base/epics-base/blob/cb1571783b4793cb9c545e5736659ee5ed7d03bb/modules/ca/src/client/tcpiiu.cpp#L205

This results in my application hanging for 30 seconds after both client and server are destroyed and return from main() was called. Ctrl+C does not results in immediate stop at that point. As seen in the comment the problem is related to the call to ca_context_destroy(); on the CA client side; without it there is no hanging, but a Connection reset by peer is printed out and application exits immediately nevertheless. It was also observed that it makes a difference on which is stopped first : client or server.

I sprinkled some printfs in the cac.cpp and tcpiiu.cpp to see what is happening and here are the three application outputs.

Client joined before the server, 30 second hanging is observed, but no errors reported either .. so all is fine?:

$ ./bin/linux-x86_64/excas -t 10
clientTaskC: >>>
clientTaskC: loop
jane 0.061597
clientTaskC: loop
jane 0.158288
clientTaskC: loop
jane 0.059100
clientTaskC: loop
jane 0.063468
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
../cac.cpp:~cac:292
../osi/epicsThread.cpp:exitWait:152 CAC-UDP 139880078571072 delay 0.000000
../osi/epicsThread.cpp:exitWait:176 CAC-UDP 139880078571072 delay 0.000000
../osi/epicsThread.cpp:exitWait:190 CAC-UDP 139880078571072 elapsed  0.000000
../tcpiiu.cpp:run:205
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-recv 139880077518400 delay 30.000000
../osi/epicsThread.cpp:exitWait:176 CAC-TCP-recv 139880077518400 delay 30.000000
../osi/epicsThread.cpp:exitWait:152 CAC-UDP 139880078571072 delay 1.000000
../osi/epicsThread.cpp:exitWait:176 CAC-UDP 139880078571072 delay 1.000000
../osi/epicsThread.cpp:exitWait:190 CAC-UDP 139880078571072 elapsed  0.000051
../cac.cpp:~cac:313
../cac.cpp:~cac:325 iiuExistenceCount 1

[[[[[[ 30 seconds ]]]]]]

../osi/epicsThread.cpp:exitWait:190 CAC-TCP-recv 139880077518400 elapsed  30.000120
../tcpiiu.cpp:run:217
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-recv 139880077518400 delay 30.000000
../osi/epicsThread.cpp:exitWait:176 CAC-TCP-recv 139880077518400 delay 30.000000
../osi/epicsThread.cpp:exitWait:190 CAC-TCP-recv 139880077518400 elapsed  0.000083
../tcpiiu.cpp:run:219
../tcpiiu.cpp:run:233
../cac.cpp:destroyIIU:1236
../cac.cpp:destroyIIU:1261
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-send 139880075417152 delay 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-recv 139880077518400 delay 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
../osi/epicsThread.cpp:exitWait:176 CAC-TCP-recv 139880077518400 delay 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
../osi/epicsThread.cpp:exitWait:190 CAC-TCP-recv 139880077518400 elapsed  0.000000
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-send 139880075417152 delay 10.000000
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-recv 139880077518400 delay 10.000000
../osi/epicsThread.cpp:exitWait:176 CAC-TCP-recv 139880077518400 delay 10.000000
../osi/epicsThread.cpp:exitWait:190 CAC-TCP-recv 139880077518400 elapsed  0.000000
../cac.cpp:destroyIIU:1273 iiuExistenceCount 1
../cac.cpp:destroyIIU:1279 iiuExistenceCount 0
../cac.cpp:~cac:327 iiuExistenceCount 0
../cac.cpp:~cac:330
../osi/epicsThread.cpp:exitWait:152 CAC-UDP 139880078571072 delay 0.000000
../osi/epicsThread.cpp:exitWait:176 CAC-UDP 139880078571072 delay 0.000000
../osi/epicsThread.cpp:exitWait:190 CAC-UDP 139880078571072 elapsed  0.000000
../osi/epicsThread.cpp:exitWait:152 CAC-UDP 139880078571072 delay 10.000000
../osi/epicsThread.cpp:exitWait:176 CAC-UDP 139880078571072 delay 10.000000
../osi/epicsThread.cpp:exitWait:190 CAC-UDP 139880078571072 elapsed  0.000000
../osi/epicsThread.cpp:exitWait:152 timerQueue 139880079623744 delay 10.000000
../osi/epicsThread.cpp:exitWait:176 timerQueue 139880079623744 delay 10.000000
../osi/epicsThread.cpp:exitWait:190 timerQueue 139880079623744 elapsed  0.000000
clientTaskC: <<<

Client joined after the server, application exits immediately, all seems fine:

$ ./bin/linux-x86_64/excas -t 10
clientTaskC: >>>
clientTaskC: loop
jane 0.061597
clientTaskC: loop
jane 0.158288
clientTaskC: loop
jane 0.059100
clientTaskC: loop
jane 0.063468
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
../tcpiiu.cpp:run:205
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-recv 140679408809536 delay 30.000000
../osi/epicsThread.cpp:exitWait:176 CAC-TCP-recv 140679408809536 delay 30.000000
../osi/epicsThread.cpp:exitWait:190 CAC-TCP-recv 140679408809536 elapsed  0.000000
../tcpiiu.cpp:run:219
../tcpiiu.cpp:run:233
../cac.cpp:destroyIIU:1236
../cac.cpp:~cac:292
../cac.cpp:destroyIIU:1261
../osi/epicsThread.cpp:exitWait:152 CAC-UDP 140679409862208 delay 0.000000
../osi/epicsThread.cpp:exitWait:176 CAC-UDP 140679409862208 delay 0.000000
../osi/epicsThread.cpp:exitWait:190 CAC-UDP 140679409862208 elapsed  0.000000
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-send 140679406708288 delay 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
../osi/epicsThread.cpp:exitWait:152 CAC-UDP 140679409862208 delay 1.000000
../osi/epicsThread.cpp:exitWait:176 CAC-UDP 140679409862208 delay 1.000000
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-recv 140679408809536 delay 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
../osi/epicsThread.cpp:exitWait:176 CAC-TCP-recv 140679408809536 delay 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
../osi/epicsThread.cpp:exitWait:190 CAC-TCP-recv 140679408809536 elapsed  0.000000
../osi/epicsThread.cpp:exitWait:190 CAC-UDP 140679409862208 elapsed  0.000000
../cac.cpp:~cac:313
../cac.cpp:~cac:325 iiuExistenceCount 1
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-send 140679406708288 delay 10.000000
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-recv 140679408809536 delay 10.000000
../osi/epicsThread.cpp:exitWait:176 CAC-TCP-recv 140679408809536 delay 10.000000
../osi/epicsThread.cpp:exitWait:190 CAC-TCP-recv 140679408809536 elapsed  0.000000
../cac.cpp:destroyIIU:1273 iiuExistenceCount 1
../cac.cpp:destroyIIU:1279 iiuExistenceCount 0
../cac.cpp:~cac:327 iiuExistenceCount 0
../cac.cpp:~cac:330
../osi/epicsThread.cpp:exitWait:152 CAC-UDP 140679409862208 delay 0.000000
../osi/epicsThread.cpp:exitWait:176 CAC-UDP 140679409862208 delay 0.000000
../osi/epicsThread.cpp:exitWait:190 CAC-UDP 140679409862208 elapsed  0.000000
../osi/epicsThread.cpp:exitWait:152 CAC-UDP 140679409862208 delay 10.000000
../osi/epicsThread.cpp:exitWait:176 CAC-UDP 140679409862208 delay 10.000000
../osi/epicsThread.cpp:exitWait:190 CAC-UDP 140679409862208 elapsed  0.000000
../osi/epicsThread.cpp:exitWait:152 timerQueue 140679410914880 delay 10.000000
../osi/epicsThread.cpp:exitWait:176 timerQueue 140679410914880 delay 10.000000
../osi/epicsThread.cpp:exitWait:190 timerQueue 140679410914880 elapsed  0.000000
clientTaskC: <<<

NOT CALLING the ca_context_destroy(), immediate exit with a non-fatal error:

$ ./bin/linux-x86_64/excas -t 10
clientTaskC: >>>
clientTaskC: loop
jane 0.061597
clientTaskC: loop
jane 0.158288
clientTaskC: loop
jane 0.059100
clientTaskC: loop
jane 0.063468
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: <<<
../tcpiiu.cpp:run:205
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-recv 140300334052928 delay 30.000000
../osi/epicsThread.cpp:exitWait:176 CAC-TCP-recv 140300334052928 delay 30.000000
Unexpected problem with CA circuit to server "bd-srv07.cslab.esss.lu.se:36765" was "Connection reset by peer" - disconnecting
../osi/epicsThread.cpp:exitWait:190 CAC-TCP-recv 140300334052928 elapsed  0.000096
../tcpiiu.cpp:run:219
../tcpiiu.cpp:run:233
../cac.cpp:destroyIIU:1236

Here is pcas main.cc with my modifications that includes a CA client used in above test runs. Should be drop-in replaceable for the CAS example from https://github.com/epics-modules/pcas: main.cc.txt.

I would like to understand what is the reason for hanging ? What is clients CAC-TCP-recv thread waiting for?

I experimented with lowering the 30 second wait to 5 seconds, in the offending line https://github.com/epics-base/epics-base/blob/cb1571783b4793cb9c545e5736659ee5ed7d03bb/modules/ca/src/client/tcpiiu.cpp#L205 and could not see any other difference except for the hanging time becoming 5 seconds only; aka. the wait times out in both cases and we proceed with the rest of CAC destruction.

I tried to apply the same order of stopping the server and client with python pcaspy and pyepics (2nd test run) but I do not see the same results.

hinxx commented 6 months ago

Here the output of the application when I set the wait value from 30 to 0 in

https://github.com/epics-base/epics-base/blob/cb1571783b4793cb9c545e5736659ee5ed7d03bb/modules/ca/src/client/tcpiiu.cpp#L205

I also added the printf indication if the thread was already terminated or not.

$ ./bin/linux-x86_64/excas -t 10
clientTaskC: >>>
clientTaskC: loop
jane 0.061597
clientTaskC: loop
jane 0.158288
clientTaskC: loop
jane 0.059100
clientTaskC: loop
jane 0.063468
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
../cac.cpp:~cac:292
../tcpiiu.cpp:run:205
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-recv 140566494967360 delay 0.000000, terminated 0
../osi/epicsThread.cpp:exitWait:177 CAC-TCP-recv 140566494967360 delay 0.000000
../osi/epicsThread.cpp:exitWait:191 CAC-TCP-recv 140566494967360 elapsed  0.000000
../osi/epicsThread.cpp:exitWait:208 CAC-TCP-recv 140566494967360 delay 0.000000, terminated 0
../osi/epicsThread.cpp:exitWait:152 CAC-UDP 140566698153536 delay 0.000000, terminated 0
../osi/epicsThread.cpp:exitWait:177 CAC-UDP 140566698153536 delay 0.000000
../osi/epicsThread.cpp:exitWait:191 CAC-UDP 140566698153536 elapsed  0.000000
../tcpiiu.cpp:run:218
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-recv 140566494967360 delay 0.000000, terminated 0
../osi/epicsThread.cpp:exitWait:177 CAC-TCP-recv 140566494967360 delay 0.000000
../osi/epicsThread.cpp:exitWait:191 CAC-TCP-recv 140566494967360 elapsed  0.000000
../osi/epicsThread.cpp:exitWait:208 CAC-TCP-recv 140566494967360 delay 0.000000, terminated 0
../tcpiiu.cpp:run:220
../tcpiiu.cpp:run:234
../cac.cpp:destroyIIU:1236
../osi/epicsThread.cpp:exitWait:208 CAC-UDP 140566698153536 delay 0.000000, terminated 0
../cac.cpp:destroyIIU:1261
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-send 140566492866112 delay 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000, terminated 0
../osi/epicsThread.cpp:exitWait:160 CAC-TCP-send 140566492866112 delay 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000, destroyed 1
../osi/epicsThread.cpp:exitWait:152 CAC-UDP 140566698153536 delay 1.000000, terminated 0
../osi/epicsThread.cpp:exitWait:177 CAC-UDP 140566698153536 delay 1.000000
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-recv 140566494967360 delay 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000, terminated 1
../osi/epicsThread.cpp:exitWait:177 CAC-TCP-recv 140566494967360 delay 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
../osi/epicsThread.cpp:exitWait:191 CAC-TCP-recv 140566494967360 elapsed  0.000000
../osi/epicsThread.cpp:exitWait:208 CAC-TCP-recv 140566494967360 delay 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000, terminated 1
../osi/epicsThread.cpp:exitWait:191 CAC-UDP 140566698153536 elapsed  0.000000
../osi/epicsThread.cpp:exitWait:208 CAC-UDP 140566698153536 delay 1.000000, terminated 1
../cac.cpp:~cac:313
../cac.cpp:~cac:325 iiuExistenceCount 1
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-send 140566492866112 delay 10.000000, terminated 0
../osi/epicsThread.cpp:exitWait:160 CAC-TCP-send 140566492866112 delay 10.000000, destroyed 1
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-recv 140566494967360 delay 10.000000, terminated 1
../osi/epicsThread.cpp:exitWait:177 CAC-TCP-recv 140566494967360 delay 10.000000
../osi/epicsThread.cpp:exitWait:191 CAC-TCP-recv 140566494967360 elapsed  0.000000
../osi/epicsThread.cpp:exitWait:208 CAC-TCP-recv 140566494967360 delay 10.000000, terminated 1
../cac.cpp:destroyIIU:1273 iiuExistenceCount 1
../cac.cpp:destroyIIU:1279 iiuExistenceCount 0
../cac.cpp:~cac:327 iiuExistenceCount 0
../cac.cpp:~cac:330
../osi/epicsThread.cpp:exitWait:152 CAC-UDP 140566698153536 delay 0.000000, terminated 1
../osi/epicsThread.cpp:exitWait:177 CAC-UDP 140566698153536 delay 0.000000
../osi/epicsThread.cpp:exitWait:191 CAC-UDP 140566698153536 elapsed  0.000000
../osi/epicsThread.cpp:exitWait:208 CAC-UDP 140566698153536 delay 0.000000, terminated 1
../osi/epicsThread.cpp:exitWait:152 CAC-UDP 140566698153536 delay 10.000000, terminated 1
../osi/epicsThread.cpp:exitWait:177 CAC-UDP 140566698153536 delay 10.000000
../osi/epicsThread.cpp:exitWait:191 CAC-UDP 140566698153536 elapsed  0.000000
../osi/epicsThread.cpp:exitWait:208 CAC-UDP 140566698153536 delay 10.000000, terminated 1
../osi/epicsThread.cpp:exitWait:152 timerQueue 140566699206208 delay 10.000000, terminated 1
../osi/epicsThread.cpp:exitWait:177 timerQueue 140566699206208 delay 10.000000
../osi/epicsThread.cpp:exitWait:191 timerQueue 140566699206208 elapsed  0.000000
../osi/epicsThread.cpp:exitWait:208 timerQueue 140566699206208 delay 10.000000, terminated 1
clientTaskC: <<<
hinxx commented 6 months ago

I'm looking at the epicsThread::exitWait()

https://github.com/epics-base/epics-base/blame/cb1571783b4793cb9c545e5736659ee5ed7d03bb/modules/libcom/src/osi/epicsThread.cpp#L171

that is actually performing the original wait of 30 seconds. The way it does it seems already safe for

https://github.com/epics-base/epics-base/blob/cb1571783b4793cb9c545e5736659ee5ed7d03bb/modules/ca/src/client/tcpiiu.cpp#L205

call to be made with argument of 0 (or a sufficiently low value like 0.3) instead of 30.

If the epicsThread::exitWait() returns 0 (thread not terminated) the caller will do another iteration in the above while loop until the epicsThread::exitWait() returns returns 1 (thread terminated).

From what I've seen what makes the CAC-TCP-recv terminate is actually this line

https://github.com/epics-base/epics-base/blob/cb1571783b4793cb9c545e5736659ee5ed7d03bb/modules/ca/src/client/tcpiiu.cpp#L213

that is called after the wait to exit has been already performed at least once (hanging for 30 sec in our case).

anjohnson commented 6 months ago

Could you try building this against EPICS 7.0.3 or earlier? Between 7.0.3 and 7.0.3.1 we added support for joining threads to the C epicsThread API. The C++ class that the CA code uses already supported joining, and I wonder whether the changes made to the class at that time might have affected this at all.

hinxx commented 6 months ago

The behavior with 7.0.3 is different. Compared to the original main.cc.txt I had to change the parts around the client thread creation and termination due to the lack of thread API for joining the thread. Here is the
main-7.0.3.cc.txt.

Execution output:

$ ./bin/linux-x86_64/excas -t 10
clientTaskC: >>>
clientTaskC: loop
jane 0.061597
clientTaskC: loop
jane 0.158288
clientTaskC: loop
jane 0.059100
clientTaskC: loop
jane 0.063468
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
main: wait client thread finish
main: wait client thread finish done
Unexpected problem with CA circuit to server "bd-srv07.cslab.esss.lu.se:39715" was "Connection reset by peer" - disconnecting
clientTaskC: <<<

The application terminates as expected without any hanging.

hinxx commented 6 months ago

To avoid misleading the debug efforts on this matter I want to also say that if main-7.0.3.cc.txt is used with 7.0.8 EPICS base I do not see the problem with hanging.

I guess there is a problem only I use the CA client thread that does join (with the new threading API).

mdavidsaver commented 4 months ago

fyi. when encountering a process which is "stuck" for longer than a few seconds, I find it useful to collect stack traces at the time.

Either run with gdb and manually break while the process is "stuck". Or if things behave differently when running with GDB, run normally and from another terminal attach by run gdb --pid <pid>. In either case, issue thread apply all backtrace to GDB.

... have since managed to reproduce the following problem in C/C++ ...

Can you try running this example with valgrind --tool=helgrind and/or --tool=drd?

mdavidsaver commented 4 months ago

Maybe related to #480

hinxx commented 4 months ago

Thanks for the pointer @mdavidsaver . Here is the GDB output after thread apply all backtrace is issued at the time of the "stuck": backtrace-all.txt.

hinxx commented 4 months ago

Can you try running this example with valgrind --tool=helgrind

Here it the output: hellgrind.txt