Open cerlestes opened 2 weeks ago
A small update after a week of trying to work on this:
I've implemented the solution I've stated above in an adapter class of our application code. It worked absolutely fine when I tried it on my Windows machine by cutting the SSH forwarding to the PLC network and reconnecting it; the client would pick up on this and restablish connection just fine. But when we tried running it from a Linux host, where we'd sporadically blackhole the PLC IP address via nftables to simulate packet loss, the application would sometimes end up hanging ad infinitum when trying to connect the same Client instance again after it had lost its connection. I couldn't find the solution to this problem; I don't want to point fingers at somebody else's code, but it seemed like the bug is either somewhere in the UaClient
/UASocketProtocol
or even deeper within Python. The UaClient
calls asyncio.wait_for()
when connecting, but this timeout never fires; the function simply hangs forever. Adding another asyncio.wait_for()
around it in our application code didn't work either, which sounds to me like the whole thread is getting stuck somewhere deep down the rabbit hole.
But I've found an alternative solution that isn't a downgrade and works perfectly fine: simply recreate the Client instance when reconnecting. So the workflow is as follows:
await client._monitor_server_task
I think the same workflow could be adapted to the Client
class itself, so that it would simply replace its internal UaClient
instance when reconnecting.
We'll put this method into 24/7 test for a week or two now and if that succeeds, we'll roll it out to two applications on the field. I'll report back if further adjustments were required to achieve a robust reconnect mechanism.
@cerlestes thanks for a good analysis and also some interesting propositions. I see we could recreate the ua_client object in a task at connection loose, but then we would also need to remember all the subscriptions that client has (or even worse, check with server if the subscription still exists and try to reconnect them), sounds quite hard to make it work reliably in all cases
btw we had a similar issue at work and I just implemented that one: https://github.com/FreeOpcUa/opcua-asyncio/pull/1670 maybe that also helps you
but your last proposition is the async way of doing the same as a callback. We just need to make public that monitor_server_task() and it does the same
@oroulet Nice commit, we would have needed that a few weeks ago 🤣 It's basically the same solution that I arrived at, only that my solution was external to the library. So +1 to that feature, looking forward to it releasing soon 👍 Maybe make it an async function though and await it?
Re subscriptions: I don't think it's a lot of work to get them working again; maybe I don't know enough about how they work to see the effects it would have on more complex cases though. In my adapter class, I simply try to delete the subscriptions by their ID once I reconnect, and then recreate them. This works absolutely fine for our use case. Maybe there are use-cases where it'd be preferred to keep the subscriptions if they still exist on the server, so that solution might not be the best way to go ahead, but it'd work. I'd implement the reconnect mechanism on the Client
, not on UaClient
, so the library would simply need to remember the values that subscriptions were created with in the Client
and then it's as easy as calling create_subscription()
again with those values after reconnecting.
PS: the reconnect mechanism I've described above has been running all weekend now, experiencing a controlled connection loss about every 3 minutes and reconnecting successfully afterwards for almost 1500 times already. So it seems to be working fine and stable.
I made it async now
Hello everyone!
First off, thanks a lot to every contributor of this repository; it's a great library that has helped us out tremendously in multiple projects. Secondly, I hope it's okay that I'm using an issue to open a discussion. I'd like to gather some insights from people who are more knowledgable than I am about OPC-UA and this library, hoping that I'll be able to contribute a well-rounded feature out of this discussion in the future.
The topic is handling connection issues and reconnecting properly. Right now, whenever our application loses the connection to the OPC-UA server, for example because the PLC config changed and it's reloading the server, we're reconnecting the client from our application once we try to interact with a node and it fails (we catch the UaError and simply try connecting up to a few times). This was fine until subscriptions came into play. With subscriptions, I'm really having a hard time finding the proper way to detect issues, reconnect and restart the subscriptions.
I've found the
Client._monitor_server_loop()
method, which is started as a task intoClient._monitor_server_task
. Once the connection dies, it'll inform the subscriptions of theBadShutdown
. This seems to be about the only way to be informed about a connection issue other than emulating that behaviour externally to the client, polling and catching errors when they are raised. Another method of detecting connection issues is theClient.check_connection()
method. But again, this method must be polled from the application external to the client.I think ideally the client itself should provide a mechanism to allow applications to react to connection issues and states in general, i.e. callback when the client lost the connection. On top of that, it should then implement an optional reconnect mechanism that, when enabled, automatically attempts to reconnect upon losing connection, including restoring any subscriptions.
My current proposal would be the following:
asyncio.Event
instancesClient.connected
,Client.disconnected
,Client.failed
. These events areset()
when the respective connection state is reached andclear()
-ed when the respectice state is left. This would allow application code to simplyawait client.connected.wait()
before each interaction with the client. It would also allow to run error handler tasks once the connection fails withawait client.failed.wait()
.Client.add_connected_callback()
,Client.add_disconnected_callback()
,Client.add_failed_callback()
to register callback functions which are called once the respective state is reached.Client()
which could be as simple asauto_reconnect: bool = False
.auto_reconnect
is enabled, an additional taskClient._auto_reconnect_task
will be created by the client upon connecting, which continously callsClient.check_connection()
similiar to how theClient._monitor_server_loop()
works, and in case of an error automatically tries connecting the client again.AutoReconnectSettings
. The following settings come to mind:ClientReconnectHandler
, which would implement a simple strategy pattern to allow interchangeable reconnection mechanisms, providing aExponentialBackoffReconnectHandler
by default. The parameter could then have the signature ofauto_reconnect: bool | ClientReconnectHandler = False
, applying a default handler with default values when simply set toTrue
.I'd love to hear what you guys think about this and how you would approach this. Maybe someone has already implemented a similiar reconnect mechanism and would like to share their thoughts, I'd greatly appreciate that.
Thanks a lot!