KIT-IBPT / epics-open62541

EPICS device support acting as an OPC UA client
Other
0 stars 2 forks source link

Long boot process and crashes if OPCUA server is not reachable #5

Open eddybl opened 1 year ago

eddybl commented 1 year ago

We use one IOC with multiple connections to several OPCUA servers.

Now it seems like one of these OPCUA servers is down. This seems to crash the IOC after a while and also makes the start up very slow while it trys to connect to each individual channel of this not available server. So the ioc init takes a couple of minutes while it is still working through the not available channels and crashes suddenly:

2022/11/15 16:32:18.709832 cbLow <PV_PREFIX_HIDDEN>:PowerMon:F10:Phase:L1 Record processing failed: Error monitoring node: BadInternalError
[2022-11-15 16:32:18.709 (UTC+0100)] info/client Connecting to endpoint opc.tcp://<PLC_URL_HIDDEN>:4840
[2022-11-15 16:32:18.709 (UTC+0100)] info/client SecurityPolicy not specified -> use default #None
[2022-11-15 16:32:18.710 (UTC+0100)] warn/securitypolicy Security policy None is used to create SecureChannel. Accepting all certificates
[2022-11-15 16:32:21.781 (UTC+0100)] warn/network        Connection to opc.tcp://<PLC_URL_HIDDEN>:4840 failed with error: No route to host
[2022-11-15 16:32:21.781 (UTC+0100)] error/client        Opening the TCP socket failed
[2022-11-15 16:32:21.781 (UTC+0100)] error/client        Couldn't connect the client to a TCP secure channel
2022/11/15 16:32:21.781605 non-EPICS_140462526551808 Could not connect to OPC UA server: BadConnectionClosed
[2022-11-15 16:32:21.781 (UTC+0100)] error/network       No connection to server.
[2022-11-15 16:32:21.781 (UTC+0100)] info/client Connecting to endpoint opc.tcp://<PLC_URL_HIDDEN>:4840
[2022-11-15 16:32:21.781 (UTC+0100)] info/client SecurityPolicy not specified -> use default #None
[2022-11-15 16:32:21.781 (UTC+0100)] warn/securitypolicy Security policy None is used to create SecureChannel. Accepting all certificates
2022/11/15 16:32:21.781700 cbLow <PV_PREFIX_HIDDEN>:PowerMon:F10:Phase:L2 Record processing failed: Error monitoring node: BadInternalError
[2022-11-15 16:32:24.857 (UTC+0100)] warn/network        Connection to opc.tcp://<PLC_URL_HIDDEN>:4840 failed with error: No route to host
[2022-11-15 16:32:24.857 (UTC+0100)] error/client        Opening the TCP socket failed
[2022-11-15 16:32:24.857 (UTC+0100)] error/client        Couldn't connect the client to a TCP secure channel
2022/11/15 16:32:24.857576 non-EPICS_140462526551808 Could not connect to OPC UA server: BadConnectionClosed
[2022-11-15 16:32:24.857 (UTC+0100)] error/network       No connection to server.
[2022-11-15 16:32:24.857 (UTC+0100)] info/client Connecting to endpoint opc.tcp://<PLC_URL_HIDDEN>:4840
[2022-11-15 16:32:24.857 (UTC+0100)] info/client SecurityPolicy not specified -> use default #None
[2022-11-15 16:32:24.857 (UTC+0100)] warn/securitypolicy Security policy None is used to create SecureChannel. Accepting all certificates
2022/11/15 16:32:24.857705 cbLow <PV_PREFIX_HIDDEN>:PowerMon:F10:Phase:L3 Record processing failed: Error monitoring node: BadInternalError

Both issues seem less than ideal (slow init and crash), is there something to improve this situation?

smarsching commented 1 year ago

The crashes definitely are a bug that needs to be addressed. Do you have any more details that might help with reproducing this problem (e.g. does it only happen when a server is unavailable or when there is more than one connection defined in the IOC)?

The startup taking so long is the result of the code trying to reconnect if there is no working connection. We could implement a mechanism that blocks reconnection attempts for a certain time after a failed connection attempt. This would probably improve startup times in this scenario, the downside being that this means it might take longer for a connection to be reestablished after the cause of the problem has been resolved.

eddybl commented 1 year ago

I added a branch on the IOC for power monitoring PLCs "test-opcua-issues" with only the PLC which is currently offline.

It seems like without all the other channels the IOC init does work reasonably quickly, but still trying to connect (and failing) to each individual channel one after each other takes a long time, 3-4 seconds per channel, but it seems to try to connect to each channel individually, so right now it takes around 10 minutes to loop through all records. Once all records where processed further errors show up with non-EPICS lines:

2022/11/15 23:13:05.273562 non-EPICS_139649138407168 Could not connect to OPC UA server: BadConnectionClosed

But without the other additional channels it does not seem to crash, whereas with all the channels it does seem to crash (as evident by the cmk e-mails realted to the Power Monitoring PLC IOC)