Open hinxx opened 6 months ago
Here the output of the application when I set the wait value from 30 to 0 in
I also added the printf indication if the thread was already terminated or not.
$ ./bin/linux-x86_64/excas -t 10
clientTaskC: >>>
clientTaskC: loop
jane 0.061597
clientTaskC: loop
jane 0.158288
clientTaskC: loop
jane 0.059100
clientTaskC: loop
jane 0.063468
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
../cac.cpp:~cac:292
../tcpiiu.cpp:run:205
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-recv 140566494967360 delay 0.000000, terminated 0
../osi/epicsThread.cpp:exitWait:177 CAC-TCP-recv 140566494967360 delay 0.000000
../osi/epicsThread.cpp:exitWait:191 CAC-TCP-recv 140566494967360 elapsed 0.000000
../osi/epicsThread.cpp:exitWait:208 CAC-TCP-recv 140566494967360 delay 0.000000, terminated 0
../osi/epicsThread.cpp:exitWait:152 CAC-UDP 140566698153536 delay 0.000000, terminated 0
../osi/epicsThread.cpp:exitWait:177 CAC-UDP 140566698153536 delay 0.000000
../osi/epicsThread.cpp:exitWait:191 CAC-UDP 140566698153536 elapsed 0.000000
../tcpiiu.cpp:run:218
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-recv 140566494967360 delay 0.000000, terminated 0
../osi/epicsThread.cpp:exitWait:177 CAC-TCP-recv 140566494967360 delay 0.000000
../osi/epicsThread.cpp:exitWait:191 CAC-TCP-recv 140566494967360 elapsed 0.000000
../osi/epicsThread.cpp:exitWait:208 CAC-TCP-recv 140566494967360 delay 0.000000, terminated 0
../tcpiiu.cpp:run:220
../tcpiiu.cpp:run:234
../cac.cpp:destroyIIU:1236
../osi/epicsThread.cpp:exitWait:208 CAC-UDP 140566698153536 delay 0.000000, terminated 0
../cac.cpp:destroyIIU:1261
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-send 140566492866112 delay 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000, terminated 0
../osi/epicsThread.cpp:exitWait:160 CAC-TCP-send 140566492866112 delay 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000, destroyed 1
../osi/epicsThread.cpp:exitWait:152 CAC-UDP 140566698153536 delay 1.000000, terminated 0
../osi/epicsThread.cpp:exitWait:177 CAC-UDP 140566698153536 delay 1.000000
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-recv 140566494967360 delay 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000, terminated 1
../osi/epicsThread.cpp:exitWait:177 CAC-TCP-recv 140566494967360 delay 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
../osi/epicsThread.cpp:exitWait:191 CAC-TCP-recv 140566494967360 elapsed 0.000000
../osi/epicsThread.cpp:exitWait:208 CAC-TCP-recv 140566494967360 delay 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000, terminated 1
../osi/epicsThread.cpp:exitWait:191 CAC-UDP 140566698153536 elapsed 0.000000
../osi/epicsThread.cpp:exitWait:208 CAC-UDP 140566698153536 delay 1.000000, terminated 1
../cac.cpp:~cac:313
../cac.cpp:~cac:325 iiuExistenceCount 1
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-send 140566492866112 delay 10.000000, terminated 0
../osi/epicsThread.cpp:exitWait:160 CAC-TCP-send 140566492866112 delay 10.000000, destroyed 1
../osi/epicsThread.cpp:exitWait:152 CAC-TCP-recv 140566494967360 delay 10.000000, terminated 1
../osi/epicsThread.cpp:exitWait:177 CAC-TCP-recv 140566494967360 delay 10.000000
../osi/epicsThread.cpp:exitWait:191 CAC-TCP-recv 140566494967360 elapsed 0.000000
../osi/epicsThread.cpp:exitWait:208 CAC-TCP-recv 140566494967360 delay 10.000000, terminated 1
../cac.cpp:destroyIIU:1273 iiuExistenceCount 1
../cac.cpp:destroyIIU:1279 iiuExistenceCount 0
../cac.cpp:~cac:327 iiuExistenceCount 0
../cac.cpp:~cac:330
../osi/epicsThread.cpp:exitWait:152 CAC-UDP 140566698153536 delay 0.000000, terminated 1
../osi/epicsThread.cpp:exitWait:177 CAC-UDP 140566698153536 delay 0.000000
../osi/epicsThread.cpp:exitWait:191 CAC-UDP 140566698153536 elapsed 0.000000
../osi/epicsThread.cpp:exitWait:208 CAC-UDP 140566698153536 delay 0.000000, terminated 1
../osi/epicsThread.cpp:exitWait:152 CAC-UDP 140566698153536 delay 10.000000, terminated 1
../osi/epicsThread.cpp:exitWait:177 CAC-UDP 140566698153536 delay 10.000000
../osi/epicsThread.cpp:exitWait:191 CAC-UDP 140566698153536 elapsed 0.000000
../osi/epicsThread.cpp:exitWait:208 CAC-UDP 140566698153536 delay 10.000000, terminated 1
../osi/epicsThread.cpp:exitWait:152 timerQueue 140566699206208 delay 10.000000, terminated 1
../osi/epicsThread.cpp:exitWait:177 timerQueue 140566699206208 delay 10.000000
../osi/epicsThread.cpp:exitWait:191 timerQueue 140566699206208 elapsed 0.000000
../osi/epicsThread.cpp:exitWait:208 timerQueue 140566699206208 delay 10.000000, terminated 1
clientTaskC: <<<
I'm looking at the epicsThread::exitWait()
that is actually performing the original wait of 30 seconds. The way it does it seems already safe for
call to be made with argument of 0 (or a sufficiently low value like 0.3) instead of 30.
If the epicsThread::exitWait()
returns 0 (thread not terminated) the caller will do another iteration in the above while
loop until the epicsThread::exitWait()
returns returns 1 (thread terminated).
From what I've seen what makes the CAC-TCP-recv
terminate is actually this line
that is called after the wait to exit has been already performed at least once (hanging for 30 sec in our case).
Could you try building this against EPICS 7.0.3 or earlier? Between 7.0.3 and 7.0.3.1 we added support for joining threads to the C epicsThread API. The C++ class that the CA code uses already supported joining, and I wonder whether the changes made to the class at that time might have affected this at all.
The behavior with 7.0.3 is different. Compared to the original main.cc.txt
I had to change the parts around the client thread creation and termination due to the lack of thread API for joining the thread.
Here is the
main-7.0.3.cc.txt.
Execution output:
$ ./bin/linux-x86_64/excas -t 10
clientTaskC: >>>
clientTaskC: loop
jane 0.061597
clientTaskC: loop
jane 0.158288
clientTaskC: loop
jane 0.059100
clientTaskC: loop
jane 0.063468
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
clientTaskC: loop
jane 0.000000
main: wait client thread finish
main: wait client thread finish done
Unexpected problem with CA circuit to server "bd-srv07.cslab.esss.lu.se:39715" was "Connection reset by peer" - disconnecting
clientTaskC: <<<
The application terminates as expected without any hanging.
To avoid misleading the debug efforts on this matter I want to also say that if main-7.0.3.cc.txt is used with 7.0.8 EPICS base I do not see the problem with hanging.
I guess there is a problem only I use the CA client thread that does join (with the new threading API).
fyi. when encountering a process which is "stuck" for longer than a few seconds, I find it useful to collect stack traces at the time.
Either run with gdb
and manually break while the process is "stuck". Or if things behave differently when running with GDB, run normally and from another terminal attach by run gdb --pid <pid>
. In either case, issue thread apply all backtrace
to GDB.
... have since managed to reproduce the following problem in C/C++ ...
Can you try running this example with valgrind --tool=helgrind
and/or --tool=drd
?
Maybe related to #480
Thanks for the pointer @mdavidsaver .
Here is the GDB output after thread apply all backtrace
is issued at the time of the "stuck": backtrace-all.txt.
Can you try running this example with valgrind --tool=helgrind
Here it the output: hellgrind.txt
I have a pcas example code with a simple CA client which results an application with both CA server and CA client. Initially I was doing this in python (https://github.com/pyepics/pyepics/issues/259) but have since managed to reproduce the following problem in C/C++. If the client is not coupled with the server this problem is not observed. Test made on Linux, with latest epics-base and pcas code from git.
As it turns out if the CA client thread is stopped before the CAS is I hit the situation described here occurs:
https://github.com/epics-base/epics-base/blob/cb1571783b4793cb9c545e5736659ee5ed7d03bb/modules/ca/src/client/tcpiiu.cpp#L205
This results in my application hanging for 30 seconds after both client and server are destroyed and return from main() was called. Ctrl+C does not results in immediate stop at that point. As seen in the comment the problem is related to the call to
ca_context_destroy();
on the CA client side; without it there is no hanging, but aConnection reset by peer
is printed out and application exits immediately nevertheless. It was also observed that it makes a difference on which is stopped first : client or server.I sprinkled some printfs in the cac.cpp and tcpiiu.cpp to see what is happening and here are the three application outputs.
Client joined before the server, 30 second hanging is observed, but no errors reported either .. so all is fine?:
Client joined after the server, application exits immediately, all seems fine:
NOT CALLING the
ca_context_destroy()
, immediate exit with a non-fatal error:Here is pcas
main.cc
with my modifications that includes a CA client used in above test runs. Should be drop-in replaceable for the CAS example from https://github.com/epics-modules/pcas: main.cc.txt.I would like to understand what is the reason for hanging ? What is clients
CAC-TCP-recv
thread waiting for?I experimented with lowering the 30 second wait to 5 seconds, in the offending line https://github.com/epics-base/epics-base/blob/cb1571783b4793cb9c545e5736659ee5ed7d03bb/modules/ca/src/client/tcpiiu.cpp#L205 and could not see any other difference except for the hanging time becoming 5 seconds only; aka. the wait times out in both cases and we proceed with the rest of CAC destruction.
I tried to apply the same order of stopping the server and client with python
pcaspy
andpyepics
(2nd test run) but I do not see the same results.