epics-modules / opcua

EPICS Device Support for OPC UA
Other
19 stars 13 forks source link

Very large updates on subscriptions kill the connection #113

Open ralphlange opened 3 years ago

ralphlange commented 3 years ago

Reported by James Wilson STFC UKRI:

I’ve got an OPC Server setup that is hosting 600 arrays. Each array is 7500 elements of type Double.

I’ve then got an IOC setup as an OPC UA Client, trying to read all of those variables into PVs. Every time the OPC Server updates the variables I see the error:

OPC UA session OPC1: connection status changed from Connected to ConnectionErrorApiReconnect
OPC UA session OPC1: connection status changed from ConnectionErrorApiReconnect to Connected

It will work fine for up to 288 variables but any more than that causes it to fall over.

Enabling debugging in opcuaCreateSession command gives more information. Initially the error status I get is:

Session OPC1: (readComplete) for read service (transaction id 2) failed with status BadResponseTooLarge

Which can be fixed by limiting the maximum nodes with either ‘nodes-max’ or ‘read-nodes-max’.

This then changes the error to:

OPC UA Session OPC1: connection status changed from Connected to ConnectionErrorApiReconnect
OPC UA Session OPC1: connection status changed from ConnectionErrorApiReconnect to Connected

Session OPC1: triggering initial read for all 600 items

And the PVs go from a ‘COMM INVALID’ state at the time of the connection status errors being reported in the IOC shell, to being fine and showing the data after the triggering initial read message. So it looks like both the Server and Client are capable of running the connection fine, but that the client for some reason thinks there has been an error every time it gets a trigger from the OPC UA Server but is able to recover, reconnect, and read successfully after that?

My attempts to use the read-timeout-max and -min options just stopped the PVs from coming back to life after the ‘COMM INVALID’ error states as they instead go into ‘READ INVALID’ states instead, but that’s probably irrelevant as it’ll just be that my guesses at acceptable timeout values are wrong.

If I split the OPC UA comms so it uses multiple Subscriptions from the same OPC Session then I can get it to work fine with no errors. So I guess that works as a work around but can you think of any other reason why I’d be having these errors come back when I try to use one Subscription for all 600 Waveform PVs? The PVs are 7500 elements of 32 bit floating point type.

ralphlange commented 3 years ago

Googling BadResponseTooLarge shows that this error indeed means the message from the server exceeds client side limits.

I'll find out why and where these limits are and try to add a way to set them.

The rest looks consistent. The nodes-max only affects the direct read and write operations. Subscriptions have updates coming in from the server, where the client has absolutely no control over how big these updates are. I think that you are successfully limiting the size of read operations so that the answers are small enough for the client, but the single subscription still has updates that are too large, and the client cuts the connection. (The ApiReconnect then Connected means a connection loss followed by reconnection.) Splitting things up on multiple subscriptions reduces the maximum size of an update on any of those subscriptions, and all works fine.

The timeout options just slow down things for read and write operations, they don't affect sizes. (Nor subscriptions.) Thanks for your detective work!! Splitting things up as you do is a valid workaround, and I will come up with a candidate fix for you to test very soon (I hope).

ralphlange commented 3 years ago

I have verified that the BadResponseTooLarge status is not generated in the client. It is generated by the server in case the server thinks a response message is exceeding the limits set by the client. However, there seem to be cases where servers are not correctly interpreting the limits a client sets when opening the connection. (See: https://github.com/FreeOpcUa/python-opcua/issues/247 or https://github.com/node-opcua/node-opcua/issues/1004, which clearly are about different clients and servers.)

Have you contacted NI and asked about possible limitations on their server? If you're using the UAExpert client, can you connect to the server and subscribe to all the arrays?

Note that doubles use 64bit, so the complete data size is about 36MB (or 18MB in case of single width floats). Your reported limit of 288 arrays would be roughly equivalent to 17MB (8.5MB for floats), which happens to be in the same order as the MaxMessageSize that UAExpert seems to set (shown in one of the referenced issues). (UAExpert uses the same low level client library as the EPICS Device Support.)

ralphlange commented 3 years ago

Those limits (used in the HELO when opening a session) are indeed set in the client, where they're defined in opcua_config.h as:

/*============================================================================
 * binary serializer constraints
 *===========================================================================*/
/** @brief The maximum size of memory allocated by a serializer */
#ifndef OPCUA_SERIALIZER_MAXALLOC
# define OPCUA_SERIALIZER_MAXALLOC                  16777216
#endif /* OPCUA_SERIALIZER_MAXALLOC */

/** @brief Maximum String Length accepted */
#ifndef OPCUA_ENCODER_MAXSTRINGLENGTH
# define OPCUA_ENCODER_MAXSTRINGLENGTH              ((OpcUa_UInt32)16777216)
#endif /* OPCUA_ENCODER_MAXSTRINGLENGTH */

/** @brief Maximum Array Length accepted */
#ifndef OPCUA_ENCODER_MAXARRAYLENGTH
# define OPCUA_ENCODER_MAXARRAYLENGTH               ((OpcUa_UInt32)65536)
#endif /* OPCUA_ENCODER_MAXARRAYLENGTH */

/** @brief Maximum ByteString Length accepted */
#ifndef OPCUA_ENCODER_MAXBYTESTRINGLENGTH
# define OPCUA_ENCODER_MAXBYTESTRINGLENGTH          ((OpcUa_UInt32)16777216)
#endif /* OPCUA_ENCODER_MAXBYTESTRINGLENGTH */

/** @brief Maximum Message Length accepted */
#ifndef OPCUA_ENCODER_MAXMESSAGELENGTH
# define OPCUA_ENCODER_MAXMESSAGELENGTH             ((OpcUa_UInt32)16777216)
#endif /* OPCUA_ENCODER_MAXMESSAGELENGTH */

Why they are set at these values is a good question for UnifiedAutomation. You can experiment with changing them once you have the source code. (opcua_config.h is generated through cmake during the build.)