eclipse / milo

Eclipse Milo™ - an open source implementation of OPC UA (IEC 62541).
http://www.eclipse.org/milo
Eclipse Public License 2.0
1.16k stars 431 forks source link

Client stack - Reconnection does not restart subscriptions properly #1179

Closed persicsb closed 10 months ago

persicsb commented 11 months ago

Reconnection does not restart subscriptions The symptom is:

I think, the cause is twofold:

  1. Missing release of held concurrency resources In SessionFsmFactory.java:205, a new SessionFuture is unconditionally created and replaces the current SessionFuture, without closing the previous one, when the SessionFsm transitions from the Active state to CreatingWait. The previous SessionFuture is is left dangling, and not completed exceptionally via CompletableFuture.completeExceptionally(). Any active Futures shall be completed, before replacing the session future with a new instance, so anybody who is waiting for the previous future can move forward.

  2. Race condition In SessionFsm.java:80, when SessionFsm.getSession() returns with a future, the future in the context might change between the acquire of the future and the returned future, and at that point, the returned future is an old one. SessionFsm.getSession() acquires a future from the context, however, when it returns, the context might have a different future (because SessionFsm.getSession() is not an atomic call regards to the session context). This is an unsafe publication in terms of concurrency. Anybody, who has a reference to an old SessionFuture will wait indefinitely and be stuck (because the returned, old future is left dangling and never completed, see the previous point). For example, a SubscriptionManager will wait for the previous SessionFuture that never will be completed.

Expected behavior Resources, especially for concurrency control are not left dangling and properly completed. The concurrency resources are published in a safe way, so that consumers can properly use them. Since creating a new high-level client does not restart the stack, creating a new client does not solve the problem.

Additional context Affected Milo version: 0.6.8, 0.6.11 (and probably any prior)

kevinherron commented 11 months ago

The onSessionActive callback is not an appropriate place to make a blocking call that requires the Session because it will halt the progress of the FSM.

Subscriptions transfer is automatically attempted for you when the session is re-established. If this transfer fails, one of the onSubscriptionTransferFailed callbacks[1][2] will be invoked, and this is where you can safely start re-creating subscriptions.

[1] https://github.com/eclipse/milo/blob/f1ce134e2ffbf15f4c03269d62235985aa34fbc4/opc-ua-sdk/sdk-client/src/main/java/org/eclipse/milo/opcua/sdk/client/api/subscriptions/UaSubscription.java#L320 [2] https://github.com/eclipse/milo/blob/f1ce134e2ffbf15f4c03269d62235985aa34fbc4/opc-ua-sdk/sdk-client/src/main/java/org/eclipse/milo/opcua/sdk/client/subscriptions/ManagedSubscription.java#L1210

persicsb commented 11 months ago

Thanks for the reply.

I managed to track down a problem further.

        if (future != null) {
            return future;
        } else {
            // "Slow" path... not activated yet.
            Event.GetSession getSession = new Event.GetSession();

            fsm.fireEvent(getSession);

            return getSession.future;
        }

The call stack is the folloing:

So the slow path is not working at all.

I'm not sure, if onSubscriptionTransferFailed shall be the proper place to create ManagedSubscriptions, it seems, the Fsm is stuck, when we do that.

onSessionActive seems more intuitive (it shall signal, that the session is in active state, the client can do whatever it wants to do).

kevinherron commented 11 months ago

You'll have to provide more information or a reproducible test case.

This modification to one of the examples demonstrates re-creating a ManagedSubscription inside the transfer failed callback and it works without issue through multiple restarts of the server: https://gist.github.com/kevinherron/72b7ce12299f822fca0511320d171c31

06:23:34.113 [main] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - item created for nodeId=NodeId{ns=0, id=2258}
06:23:35.121 [milo-shared-thread-pool-2] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:23:34 PST 2023, instant=2023-11-23T14:23:34.112339Z}}
06:23:36.121 [milo-shared-thread-pool-1] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:23:35 PST 2023, instant=2023-11-23T14:23:35.113134Z}}
06:23:37.124 [milo-shared-thread-pool-2] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:23:36 PST 2023, instant=2023-11-23T14:23:36.117310Z}}
06:23:38.123 [milo-shared-thread-pool-1] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:23:37 PST 2023, instant=2023-11-23T14:23:37.120289Z}}
06:23:39.128 [milo-shared-thread-pool-2] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:23:38 PST 2023, instant=2023-11-23T14:23:38.121603Z}}
06:23:40.133 [milo-shared-thread-pool-1] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:23:39 PST 2023, instant=2023-11-23T14:23:39.126652Z}}
06:23:43.898 [milo-shared-thread-pool-2] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - item created for nodeId=NodeId{ns=0, id=2258}
06:23:44.903 [milo-shared-thread-pool-2] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:23:43 PST 2023, instant=2023-11-23T14:23:43.899671Z}}
06:23:45.905 [milo-shared-thread-pool-1] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:23:44 PST 2023, instant=2023-11-23T14:23:44.901274Z}}
06:23:46.910 [milo-shared-thread-pool-2] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:23:45 PST 2023, instant=2023-11-23T14:23:45.902302Z}}
06:23:47.909 [milo-shared-thread-pool-1] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:23:46 PST 2023, instant=2023-11-23T14:23:46.906482Z}}
06:23:48.910 [milo-shared-thread-pool-2] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:23:47 PST 2023, instant=2023-11-23T14:23:47.908513Z}}
06:24:05.497 [milo-shared-thread-pool-3] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - item created for nodeId=NodeId{ns=0, id=2258}
06:24:06.506 [milo-shared-thread-pool-3] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:24:05 PST 2023, instant=2023-11-23T14:24:05.498543Z}}
06:24:06.506 [milo-shared-thread-pool-3] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:24:06 PST 2023, instant=2023-11-23T14:24:06.499880Z}}
06:24:08.508 [milo-shared-thread-pool-4] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:24:07 PST 2023, instant=2023-11-23T14:24:07.501403Z}}
06:24:09.510 [milo-shared-thread-pool-3] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:24:08 PST 2023, instant=2023-11-23T14:24:08.504878Z}}
06:24:10.515 [milo-shared-thread-pool-4] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:24:09 PST 2023, instant=2023-11-23T14:24:09.509588Z}}
06:24:11.520 [milo-shared-thread-pool-3] INFO  o.e.m.e.c.ManagedSubscriptionDataExample - subscription value received: item=NodeId{ns=0, id=2258}, value=Variant{value=DateTime{date=Thu Nov 23 06:24:10 PST 2023, instant=2023-11-23T14:24:10.513486Z}}
persicsb commented 11 months ago

Thanks for the example.

I think I narrowed the problem, and I found a consistent way to reproduce and remove the problem.

If a single-threaded ExecutorService is set for the ClientConfigBuilder when creating the configuration, the resubscription won't work:

Example:

            builder = builder.setApplicationName(LocalizedText.english(CertConfig.COMMON_NAME))
                    .setApplicationUri(CertConfig.APP_URI)
                    .setIdentityProvider(createIdentityProvider(username, password))
                    .setExecutor(Executors.newFixedThreadPool(1, new CustomizableThreadFactory("opcua-" + opcUaServer.getId() + "-")))
                    .setRequestTimeout(uint(10000));

However, if the thread pool is at least size 2, or use the default (omitting to set the Executor), the resubscription works flawlessly.

We can close the issue at the moment, but I think, either Milo Javadoc shall document, that a single-thread executor is unusable, or throw a warning in runtime, and revert to the default, when a single-threaded ExecutorService is provided.

I think, a concurrency issue is still present here, since a single-threaded client cannot proceed after reconnection, however, nothing prevents users to run the client in single-threaded mode.

persicsb commented 11 months ago

What we observed with a Siemens S7 PLC's OPC UA Server with a multithreaded client: if a network connection is lost for such a long time, that the server timeouts an UA Session, the client won't reconnect. If the network is connected again within the UA Session timeout, the client reconnects properly.

kevinherron commented 10 months ago

I'm not able to reproduce this with an S7-1500.

Perhaps you can turn these loggers to DEBUG level:

as well as provide a Wireshark capture started prior to the disconnect. (make sure to connect without security)