OPCFoundation / UA-.NETStandard-Samples

Other
261 stars 185 forks source link

Aggregation server does not reconnect to downstream servers when server is restarted or network connection is lost #312

Open pc-avatar-7076 opened 2 years ago

pc-avatar-7076 commented 2 years ago

Scenario 1: Downstream Server Disable Network Adapter

Test:

  1. Configure Aggregator (dev host machine) to connect to the OPCF reference server (VM running on dev host)
  2. Start Aggregator (Debug with IDE) and verify connection/steady state behavior of “metadata” session
  3. Disable VM network adapter and observe behavior of metadata session
  4. Wait until the Aggregator keep alive status is “late”
  5. Enable VM network adapter and observe behavior of metadata session

Results: TL;DR - Once the adapter is reenabled, the Aggregator continues to attempt to “renew” the secure channel. However, the OpenSecureChannelRequest fails with a ServiceFault (BadTcpSecureChannelUnknown). See “opcf-reference-aggregator-southbound-disable-adapter-opcf-reference-server.pcapng”, attached (as zip).

Note: Using UA-.NETStandard tag 1.4.368.53 and UA-.NETStandard-Samples tag preview on commit e17387d791d836d1e16b9392b282a688f3f12d90

Wireshark Capture: opcf-reference-aggregator-southbound-disable-adapter-opcf-reference-server.zip

Scenario 2: Downstream Server Stop/Start Test:

  1. Configure Aggregator (dev host machine) to connect to the OPCF reference server (VM running on dev host)
  2. Start Aggregator (Debug with IDE) and verify connection/steady state behavior of “metadata” session
  3. Stop OPCF reference server and observe behavior of metadata session
  4. Wait until the Aggregator keep alive status is “late”
  5. Start OPCF reference server and observe behavior of metadata session

Preliminary Note: This investigation required a OPCF UA stack bugfix for the following issues:

The bug fix is available in UA-.NETStandard tag 1.4.368.53 and UA-.NETStandard-Samples tag preview on commit e17387d791d836d1e16b9392b282a688f3f12d90.

Results: TL;DR - Aggregator has a keep alive handler with logic to reestablish the Session once the keep alive is “late”. The logic is never executed because the secure channel cannot be reestablished when the downstream server is restarted. This seems to be because the aggregator attempts to ‘renew’ the channel instead of recreating it.

Technical Details: In steady state (Aggregator connected to OPCF reference server), the keep alive timer calls BeginRead, which increments m_outstandingRequests to 1. The counter is decremented to 0 when the read response is successfully received.

When the downstream server is shutdown, Session.OnKeepAlive calls BeginRead and the operation fails with an exception ("Could not send keep alive request: Opc.Ua.ServiceResultException BadConnectionClosed). AsyncRequestStarted is never called, so m_outstandingRequests is never incremented.

After the downstream has been restarted, the channel cannot re-establish a secure channel (it's trying to 'renew' rather than create a fresh secure channel). Session.OnKeepAlive continues to call BeginRead, which continues to throw the same exception: Could not send keep alive request: Opc.Ua.ServiceResultException BadConnectionClosed. Again, AsyncRequestStarted is never called, and m_outstandingRequests is never incremented.

mregen commented 2 years ago

This issue may be related to https://github.com/OPCFoundation/UA-.NETStandard/pull/1802. Please recheck if fix is available.

mregen commented 2 years ago

Hi @pcameron-ptc, I think the case you describe may not be covered by the fix #1802. Thanks for the detailed writeup to repro, we will check if this is fixed.

pc-avatar-7076 commented 2 years ago

Thank you @mregen . I pulled down UA-.NETStandard branch release/1.4.368 with the recent reconnect changes (i.e. #1802) and updated the Aggregation sample to reference my local build of the stack. I did not observe any change in behavior in regards to this defect.