Closed MarkRivers closed 3 years ago
When this happens there is a deadlock with 2 EPICS mutexes. It is not clear if the failure to reply is causing the deadlock, or if the deadlock is causing the failure to reply.
The deadlock needs to be tracked down, which should be possible using gdb and finding which threads are blocked and at what line in the source file. However, when I run the application under gdb it almost always crashes with the access violation due to #9, so I cannot get to the point where the deadlock would occur.
I managed to run gdb by attaching to the EPICS application when it hung up.
The thread that calls configure() is this one:
Thread 125 (Thread 0x7f896cbdd700 (LWP 136638)):
#0 0x00007f8971205f97 in pthread_join () from /lib64/libpthread.so.0
#1 0x00007f8971bb8183 in std::thread::join() () from /home/epics/devel/dante-1-0/lib/linux-x86_64/libXGL_DPP.so.1
#2 0x00007f89718e86b9 in std::_Sp_counted_ptr_inplace<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<AsyncTask> >, void>, std::allocator<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<AsyncTask> >, void> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /home/epics/devel/dante-1-0/lib/linux-x86_64/libXGL_DPP.so.1
#3 0x00007f8971899916 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /home/epics/devel/dante-1-0/lib/linux-x86_64/libXGL_DPP.so.1
#4 0x00007f89718e1d83 in int_configure(char const*, unsigned short, configuration) () from /home/epics/devel/dante-1-0/lib/linux-x86_64/libXGL_DPP.so.1
#5 0x000000000059aae9 in Dante::setDanteConfiguration (this=this@entry=0x31c95b0, addr=0) at ../dante.cpp:759
#6 0x000000000059ac1c in Dante::writeFloat64 (this=0x31c95b0, pasynUser=0x4889e08, value=0.10000000000000001) at ../dante.cpp:568
#7 0x0000000000d52c5d in writeFloat64 (drvPvt=0x31c95b0, pasynUser=0x4889e08, value=0.10000000000000001) at ../../asyn/asynPortDriver/asynPortDriver.cpp:2370
#8 0x0000000000d6f547 in processCallbackOutput (pasynUser=0x4889e08) at ../../asyn/devEpics/devAsynFloat64.c:354
#9 0x0000000000d495b3 in portThread (pport=0x31cb0c0) at ../../asyn/asynDriver/asynManager.c:913
#10 0x0000000000eb3d5c in start_routine (arg=0x31cc160) at ../osi/os/posix/osdThread.c:412
#11 0x00007f8971204e25 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f89700bdbad in clone () from /lib64/libc.so.6
So the EPICS driver function Dante::setDanteConfiguration has called configure(), and that resulted in a hang in libXGL_DPP.so. It is hung in a call to pthead_join().
Why?
Hi Mark, We think that this problem can be due to an inter-thread lock inside our library. For this reason, as suggested in DANTE Library API manual and done in "DPP_Test.cpp" it is recommended to disable the "autoScanSlaves()" function before configuring --> autoScanSlaves(false).
Please keeps this function disabled also during the acquisitions. This function should be enabled only when it is desired to discover new attached devices.
I added this call to the constructor in the EPICS driver once it finds the boards.
autoScanSlaves(false)
This seems to have fixed the problem.
The XGL_DPP library frequently fails to reply to the configure() call during startup. I have increased the timeout in dante::wait_reply() from 10 seconds to 20 seconds, but that does not help. I added an additional "const char *caller" argument to wait_reply() so I can see what function in the XGL_DPP library was being called.
This is what I see when it fails:
The only solution at that point is to restart the IOC. It only happens during startup, once the IOC is completely started it never seems to happen.
Between this issue and #9 the IOC fails to start up to 90% of the time. This makes doing development very difficult, because each time I change the code I may have to try for 10-15 minutes to get the IOC to start up successfully.