Azure / Industrial-IoT

Azure Industrial IoT Platform
MIT License
523 stars 214 forks source link

Item was disposed or moved to another subscription #2317

Closed steffenbeermann closed 2 months ago

steffenbeermann commented 2 months ago

Describe the bug We have a OPC Publisher with about 24 Endpoints with around 2000-3000 Nodes each. We just updates to version 3.9.10 and facing the issue that after a couple of hours the OPC Publisher removed almost all nodes of the endpoints with the logs stating: 24-08-07 15:35:27.9796] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=...' with server id 1 (created): Item was disposed or moved to another subscription

We can see this error message a lot in the logs, since the issue occurred it was written for the same nodes every hour.

This affected 22 of the endpoints where for most of them, all nodes where disposed and for some only about 60% were disposed. We athough noticed this happend in two waves. Yesterday at 17:30 6 Endpoints went down and after about 2 hours the other 16 were affected.

After restarting the publisher, all connections were established again and working fine for now. It seems that this is not an issue with the OPC Servers or Nodes because with a restart everything works fine again

To Reproduce Publisher Config: { "Hostname": "publisher_axh_active", "Cmd": [ "--cl=5", "--cf", "--aa", "--pf=/mount/publishednodes.json", "--PkiRootPath=/mount/pki", "--bs=100", "--di=3600" ], "HostConfig": { "Binds": [ "/home/edge/OPCPublisherMount_Active:/mount" ] } }

Version used: 3.9.10

Expected behavior The nodes are not disposed.

Additional context Part of the logs: ` em.DataChange[0] Data Item 'ns=2;s=RO_KW_AGGREGAT_NL.1_3MFB40_EA700.in.betriebsart.value' with server id 205 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_MELDEKONTAKT.1_3MFB40_EA801_YA03.in.kontakt.value' with server id 206 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_ALARMKONTAKT.1_3MFB40_EA801_XM54.in.kontakt.value' with server id 207 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_MELDEKONTAKT.1_3MFB40_EA802_YA03.in.kontakt.value' with server id 208 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_ALARMKONTAKT.1_3MFB40_EA802_XM54.in.kontakt.value' with server id 209 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_MELDEKONTAKT.1_3MFB40_EA811_YA03.in.kontakt.value' with server id 210 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_ALARMKONTAKT.1_3MFB40_EA811_XM54.in.kontakt.value' with server id 211 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_MELDEKONTAKT.1_3MFB40_EA812_YA03.in.kontakt.value' with server id 212 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_ALARMKONTAKT.1_3MFB40_EA812_XM54.in.kontakt.value' with server id 213 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_ALARMKONTAKT.1_3MFB40_EA851_XM54.in.kontakt.value' with server id 214 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_ALARMKONTAKT.1_3MFB40_EA852_XM54.in.kontakt.value' with server id 215 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_ALARMKONTAKT.1_3MFB40_EA862_XM54.in.kontakt.value' with server id 216 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_ALARMKONTAKT.1_3MFB40_EA861_XM54.in.kontakt.value' with server id 217 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_ALARMKONTAKT.1_3MFB40_EA871_XM54.in.kontakt.value' with server id 218 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_ALARMKONTAKT.1_3MFB40_EA872_XM54.in.kontakt.value' with server id 219 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_KW_AGGREGAT_NL.1_3MFB40_EA710.in.zustand0.value' with server id 220 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_KW_AGGREGAT_NL.1_3MFB40_EA710.in.zustand1.value' with server id 221 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_KW_AGGREGAT_NL.1_3MFB40_EA710.in.aktive_bedienebene.value' with server id 222 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_KW_AGGREGAT_NL.1_3MFB40_EA710.in.modus.value' with server id 223 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_SOLLWERT.1_3MFB41_AP100_ZE11.out.sollwert.value' with server id 224 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_SOLLWERT.1_3MFB41_AP200_ZE11.out.sollwert.value' with server id 225 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_KW_AGGREGAT_NL.1_3MFB40_EA711.in.zustand0.value' with server id 226 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_KW_AGGREGAT_NL.1_3MFB40_EA711.in.zustand1.value' with server id 227 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_KW_MESSWERT.1_3MFB41_CL001_XQ01.in.ausfall.value' with server id 228 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_KW_MESSWERT.1_3MFB41_CL001_XQ01.in.minalarm.value' with server id 229 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_KW_MESSWERT.1_3MFB41_CL001_XQ01.out.gw_minalarm.value' with server id 230 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_KW_MESSWERT.1_3MFB41_CL001_XQ01.in.tief.value' with server id 231 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_KW_MESSWERT.1_3MFB41_CL001_XQ01.out.gw_tief.value' with server id 232 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_KW_MESSWERT.1_3MFB41_CL001_XQ01.in.wert.value' with server id 233 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s={827D3700-C4FF-41A9-884F-B49DA06097ED}' with server id 1 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_KW_MESSWERT.1_3MFB41_CT001_XQ01.in.ausfall.value' with server id 234 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_KW_MESSWERT.1_3MFB41_CT001_XQ01.in.meldung1.value' with server id 235 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s={82218900-CF04-457D-89AA-4658952B6D64}' with server id 2 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s={101CE900-4D46-4D4B-A828-84D575C85DEF}' with server id 3 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s={D3E10101-CD29-445D-89EB-3D5089A7F197}' with server id 4 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9835] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaMonitoredItem.DataChange[0] Data Item 'ns=2;s=RO_KW_MESSWERT.1_3MFB41_CT001_XQ01.in.meldung2.value' with server id 236 (created): Item was disposed or moved to another subscription [24-08-07 15:35:27.9847] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0] 1fdf3dbee7ab3b7700eb7c690aab5a41af407714_0:3762709867 - Now monitoring 155 nodes:

Good/Bad: 155/0

  # Reporting:    155
  # Sampling:     0
  # Disabled:     0
  # Not applied:  0
  # Removed:      0

[24-08-07 15:35:27.9849] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0] 989ca92c8a770ee4b215750a5e242542683c1ca9_0:3762709869 - Now monitoring 52 nodes:

Good/Bad: 52/0

  # Reporting:    52
  # Sampling:     0
  # Disabled:     0
  # Not applied:  0
  # Removed:      0

`

marcschier commented 2 months ago

@steffenbeermann - please could you try mcr.microsoft.com/iotedge/opc-publisher:2.9.11-preview1, which should address this issue.

marcschier commented 2 months ago

Background - this happens during subscription reset where the old monitored items are not properly removed from the subscription and then remain in a disposed state. Data should flow, but heartbeats and other timers are off and not firing anymore.

steffenbeermann commented 2 months ago

Thank you for the quick response! To understand the issue: Why are the subscriptions reset in the first place? I noticed that every hour the subscriptions were recreated, eg: `[24-08-07 15:05:45.2384] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0] 4e7e56282acc83bbaae502cca831be66e4365a23_3:2124724734 - Now monitoring 1000 nodes:

Good/Bad: 969/31

  # Reporting:    1000
  # Sampling:     0
  # Disabled:     0
  # Not applied:  0
  # Removed:      0`

In our case, we sadly hadn't any dataflow anymore for the affected subscriptions?

marcschier commented 2 months ago

The message you are seeing there happens as a result of a) when a session recovery/reconnect sequence was run and produced either a new session or the old one. Then the subscription state in the session is synchronized which if it is new creates the monitored items - or - b) the subscription itself ran into an issue e.g. the subscription lifetime expired on the server and status change notification returned, or keep alives were missing on our side - then the subscription in a running session is reset (closed and recreated). Then the items should have been recreated, but the bug we had was that the existing (now disposed items) are not even touched anymore. But they are reported now as "bad". We fixed this in 2.9.11.

This should all be prominent in the log. If you can try 2.9.11-preview1 that would help me a lot.

You also see this message you pasted if applying one or more monitored items failed before or on the general management timer. Typically nothing is done to the subscription or items at this point though if all is working fine.

steffenbeermann commented 2 months ago

Thanks for the clarification!
I upgraded to 2.9.11-preview1 and so far everything looks good. I will monitor the logs and will give an update tomorrow!

steffenbeermann commented 2 months ago

It seems like 3.9.11-preview11 helped with the issue of Item was disposed or moved to another subscription. But we had one endpoint with issues where I am not to sure what happend and if it is related to the new publisher version. Here are the findings:

marcschier commented 2 months ago

Thank you for the details @steffenbeermann, when the subscription is STOPPED, do you see session keep alive timeout logs as well (this is the way in OPC UA connection drops are discovered). If the channel is closed, the session keep alive check should fail and trigger a reconnect. The keep alive error and reconnect should be visible in the logs,

 ... [state:Ready|refs:<n>]: Got Keep Alive error: ...

If not it would be great if you could share (e.g., via github gist or support team the full logs).

Another important diagnostics info is the number of publish requests in progress, it is one of the metrics at the top of the diagnostic output to console (if --di is not disabled) like this:

 # Queued/Minimum request count       :              4 | 2
 # Good/Bad Publish request count     :              2 | 2

What server is this connected to?

steffenbeermann commented 2 months ago

We have two Publisher Modules that connect to the same endpoints. In one of them I could not find the Got Keep Alive error, but this could happen because the save the logs with an custom module and with this there is a chance that when a lot of logs are written some logs get lost....

In the second module log I found some, here are the logs related to the failed endpoint and I maybe found some usefull logs: ` [24-08-08 17:36:58.3624] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaClient[0] opc.tcp://0.0.0.0:4840_116A0CAD_KWT [state:Ready|refs:3]: Got Keep Alive error: BadNoCommunication 'Server not responding to keep alive requests.' (08/08/2024 17:36:58:BadNoCommunication 'Server not responding to keep alive requests.' [24-08-08 17:36:58.3645] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaClient[0] opc.tcp://0.0.0.0:4840_116A0CAD_KWT [state:Ready|refs:3]: Got Keep Alive error: BadNoCommunication 'Server not responding to keep alive requests.' (08/08/2024 17:36:58:BadNoCommunication 'Server not responding to keep alive requests.' ... 24-08-08 17:36:58.3651] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaClient[0] opc.tcp://0.0.0.0:4840_116A0CAD_KWT [state:Ready|refs:3]: Reconnecting session opc.tcp://0.0.0.0:4840_116A0CAD_KWT due to error BadNoCommunication 'Server not responding to keep alive requests.'... [24-08-08 17:36:58.3651] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaClient[0] opc.tcp://0.0.0.0:4840_116A0CAD_KWT [state:Connecting|refs:3]: Session opc.tcp://0.0.0.0:4840_116A0CAD_KWT with opc.tcp://0.0.0.0:4840 changed from Ready to Connecting ... [24-08-08 17:36:59.2681] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaClient[0] opc.tcp://0.0.0.0:4840_116A0CAD_KWT [state:Ready|refs:3]: Session opc.tcp://0.0.0.0:4840_116A0CAD_KWT with opc.tcp://0.0.0.0:4840 changed from Connecting to Ready [24-08-08 17:36:59.2682] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaClient[0] opc.tcp://0.0.0.0:4840_116A0CAD_KWT [state:Ready|refs:3]: Client RECOVERED! [24-08-08 17:36:59.2769] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaStack[0] Publish #93680, Reconnecting=False, Error: BadNoSubscription [24-08-08 17:36:59.2857] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaStack[0] Publish #93681, Reconnecting=False, Error: BadNoSubscription [24-08-08 17:36:59.2915] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0] e8586836593ba92624a1f0b6516d1bcf11d4b67c_0:16 - Now monitoring 1000 nodes:

Good/Bad: 1000/0

  # Reporting:     1000
  # Sampling:      0
  # Heartbeat/ing: 0/0
  # Condition/ing: 0/0
  # Disabled:      0
  # Not applied:   0
  # Removed:       0

[24-08-08 17:36:59.2952] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0] e8586836593ba92624a1f0b6516d1bcf11d4b67c_1:17 - Now monitoring 1000 nodes:

Good/Bad: 1000/0

  # Reporting:     1000
  # Sampling:      0
  # Heartbeat/ing: 0/0
  # Condition/ing: 0/0
  # Disabled:      0
  # Not applied:   0
  # Removed:       0

[24-08-08 17:36:59.2982] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0] e8586836593ba92624a1f0b6516d1bcf11d4b67c_2:18 - Now monitoring 757 nodes:

Good/Bad: 757/0

  # Reporting:     757
  # Sampling:      0
  # Heartbeat/ing: 0/0
  # Condition/ing: 0/0
  # Disabled:      0
  # Not applied:   0
  # Removed:       0

[24-08-08 17:36:59.2984] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaClient[0] opc.tcp://0.0.0.0:4840_116A0CAD_KWT [state:Ready|refs:3]: Applying changes to 3 subscription(s) took 00:00:00.0301485. [24-08-08 17:36:59.3074] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaStack[0] Publish #93682, Reconnecting=False, Error: BadNoSubscription [24-08-08 17:36:59.3157] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaStack[0] Publish #93683, Reconnecting=False, Error: BadNoSubscription [24-08-08 17:36:59.3630] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaStack[0] Publish #27560, Reconnecting=True, Error: BadSecureChannelClosed ... [24-08-08 17:37:59.2879] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0]

5/30: Subscription e8586836593ba92624a1f0b6516d1bcf11d4b67c_0:16 is missing keep alive.

[24-08-08 17:37:59.2920] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0]

5/30: Subscription e8586836593ba92624a1f0b6516d1bcf11d4b67c_1:17 is missing keep alive.

[24-08-08 17:37:59.2961] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0]

5/30: Subscription e8586836593ba92624a1f0b6516d1bcf11d4b67c_2:18 is missing keep alive.

... [24-08-08 17:43:11.2889] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0] Trigger management of subscription e8586836593ba92624a1f0b6516d1bcf11d4b67c_0:16... [24-08-08 17:43:11.2923] crit: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0]

31/30: Keep alive count exceeded. Perform Reset for e8586836593ba92624a1f0b6516d1bcf11d4b67c_1:17...

[24-08-08 17:43:11.2923] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0] Trigger management of subscription e8586836593ba92624a1f0b6516d1bcf11d4b67c_1:17... [24-08-08 17:43:11.2964] crit: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0]

31/30: Keep alive count exceeded. Perform Reset for e8586836593ba92624a1f0b6516d1bcf11d4b67c_2:18...

[24-08-08 17:43:11.2964] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0] Trigger management of subscription e8586836593ba92624a1f0b6516d1bcf11d4b67c_2:18... ... 24-08-08 17:43:12.7730] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0] Enabled Subscription e8586836593ba92624a1f0b6516d1bcf11d4b67c_1:20 in session opc.tcp://0.0.0.0:4840_116A0CAD_KWT. [24-08-08 17:43:12.7828] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0] ======== Closing subscription e8586836593ba92624a1f0b6516d1bcf11d4b67c_2:18 and re-creating ========= [24-08-08 17:43:12.8242] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0] Subscription 'e8586836593ba92624a1f0b6516d1bcf11d4b67c_2:0' closed. [24-08-08 17:43:12.8242] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0] Creating new disabled subscription e8586836593ba92624a1f0b6516d1bcf11d4b67c_2:0 in session opc.tcp://0.0.0.0:4840_116A0CAD_KWT. [24-08-08 17:43:12.8329] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0] Successfully created subscription e8586836593ba92624a1f0b6516d1bcf11d4b67c_2:21'. Actual (revised) state/desired state:

PublishingEnabled False/False

  # PublishingInterval 1000/1000
  # KeepAliveCount 10/0
  # LifetimeCount 30/0

[24-08-08 17:43:13.3566] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0] e8586836593ba92624a1f0b6516d1bcf11d4b67c_2:21 - Now monitoring 757 nodes:

Good/Bad: 757/0

  # Reporting:     757
  # Sampling:      0
  # Heartbeat/ing: 0/0
  # Condition/ing: 0/0
  # Disabled:      0
  # Not applied:   0
  # Removed:       0

[24-08-08 17:43:13.3651] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0] Enabled Subscription e8586836593ba92624a1f0b6516d1bcf11d4b67c_2:21 in session opc.tcp://10.166.73.224:4840_116A0CAD_KWT. `

Because there are some additional informations like node names and so on, I need to check if I can upload the full logs. I will give an update on monday.

I could not find the diagnostic info, even the start up is with '--dii==3600'? But we collect all metrics via the metric collector. For iiot_edge_publisher_good_nodes / iiot_edge_publisher_bad_nodes there were no changes in the timespan of the issue. But I am not sure if this is the same? I could not find the corresponding metrics to Queued/Minimum request count?

It connects to an OPC UA Server that is provided by a power plant provider. Sadly I don't know which kind of software they are using to provide the OPC UA Server.

marcschier commented 2 months ago

You can set --di=60 and there will be metrics printed to the console every minute. But it is ok what you are doing. I could not see anything concerning above. Don't share - it is ok. What is likely the key piece is

zure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaStack[0] Publish #93682, Reconnecting=False, Error: **BadNoSubscription** [24-08-08 17:36:59.3157] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaStack[0] Publish #93683, Reconnecting=False, Error: **BadNoSubscription** [24-08-08 17:36:59.3630] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaStack[0] Publish #27560, Reconnecting=True, Error: **BadSecureChannelClosed** ... [24-08-08 17:37:59.2879] info: 

You could test 2.9.11-preview2 which I just released which will re-enable the quirk that resets the session when the publish requests return BadNoSession or BadInvalidSessionId. Because of a reported interop issue, I went through a lot of reconnect testing using a network shaper yesterday and today and validated that with TCP RST or tampered packets the publisher always recovered, so it is possibly a server vendor issue, too, just like in the reported case.

steffenbeermann commented 2 months ago

okay, I will test 2.9.11-preview2. But over the weekend it ran without problems with 2.9.11-preview1 as well.

marcschier commented 2 months ago

Thanks for testing and confirming, I will close, please let me know if you encounter issues by either commenting here or open a new issue.