FreeOpcUa / opcua-asyncio

OPC UA library for python >= 3.7
GNU Lesser General Public License v3.0
1.04k stars 345 forks source link

How to properly handle connection issues and reconnecting? (request for comments) #1660

Open cerlestes opened 2 weeks ago

cerlestes commented 2 weeks ago

Hello everyone!

First off, thanks a lot to every contributor of this repository; it's a great library that has helped us out tremendously in multiple projects. Secondly, I hope it's okay that I'm using an issue to open a discussion. I'd like to gather some insights from people who are more knowledgable than I am about OPC-UA and this library, hoping that I'll be able to contribute a well-rounded feature out of this discussion in the future.

The topic is handling connection issues and reconnecting properly. Right now, whenever our application loses the connection to the OPC-UA server, for example because the PLC config changed and it's reloading the server, we're reconnecting the client from our application once we try to interact with a node and it fails (we catch the UaError and simply try connecting up to a few times). This was fine until subscriptions came into play. With subscriptions, I'm really having a hard time finding the proper way to detect issues, reconnect and restart the subscriptions.

I've found the Client._monitor_server_loop() method, which is started as a task into Client._monitor_server_task. Once the connection dies, it'll inform the subscriptions of the BadShutdown. This seems to be about the only way to be informed about a connection issue other than emulating that behaviour externally to the client, polling and catching errors when they are raised. Another method of detecting connection issues is the Client.check_connection() method. But again, this method must be polled from the application external to the client.

I think ideally the client itself should provide a mechanism to allow applications to react to connection issues and states in general, i.e. callback when the client lost the connection. On top of that, it should then implement an optional reconnect mechanism that, when enabled, automatically attempts to reconnect upon losing connection, including restoring any subscriptions.

My current proposal would be the following:

I'd love to hear what you guys think about this and how you would approach this. Maybe someone has already implemented a similiar reconnect mechanism and would like to share their thoughts, I'd greatly appreciate that.

Thanks a lot!

cerlestes commented 1 week ago

A small update after a week of trying to work on this:

I've implemented the solution I've stated above in an adapter class of our application code. It worked absolutely fine when I tried it on my Windows machine by cutting the SSH forwarding to the PLC network and reconnecting it; the client would pick up on this and restablish connection just fine. But when we tried running it from a Linux host, where we'd sporadically blackhole the PLC IP address via nftables to simulate packet loss, the application would sometimes end up hanging ad infinitum when trying to connect the same Client instance again after it had lost its connection. I couldn't find the solution to this problem; I don't want to point fingers at somebody else's code, but it seemed like the bug is either somewhere in the UaClient/UASocketProtocol or even deeper within Python. The UaClient calls asyncio.wait_for() when connecting, but this timeout never fires; the function simply hangs forever. Adding another asyncio.wait_for() around it in our application code didn't work either, which sounds to me like the whole thread is getting stuck somewhere deep down the rabbit hole.

But I've found an alternative solution that isn't a downgrade and works perfectly fine: simply recreate the Client instance when reconnecting. So the workflow is as follows:

I think the same workflow could be adapted to the Client class itself, so that it would simply replace its internal UaClient instance when reconnecting.

We'll put this method into 24/7 test for a week or two now and if that succeeds, we'll roll it out to two applications on the field. I'll report back if further adjustments were required to achieve a robust reconnect mechanism.

oroulet commented 4 days ago

@cerlestes thanks for a good analysis and also some interesting propositions. I see we could recreate the ua_client object in a task at connection loose, but then we would also need to remember all the subscriptions that client has (or even worse, check with server if the subscription still exists and try to reconnect them), sounds quite hard to make it work reliably in all cases

btw we had a similar issue at work and I just implemented that one: https://github.com/FreeOpcUa/opcua-asyncio/pull/1670 maybe that also helps you

oroulet commented 4 days ago

but your last proposition is the async way of doing the same as a callback. We just need to make public that monitor_server_task() and it does the same

cerlestes commented 4 days ago

@oroulet Nice commit, we would have needed that a few weeks ago 🤣 It's basically the same solution that I arrived at, only that my solution was external to the library. So +1 to that feature, looking forward to it releasing soon 👍 Maybe make it an async function though and await it?

Re subscriptions: I don't think it's a lot of work to get them working again; maybe I don't know enough about how they work to see the effects it would have on more complex cases though. In my adapter class, I simply try to delete the subscriptions by their ID once I reconnect, and then recreate them. This works absolutely fine for our use case. Maybe there are use-cases where it'd be preferred to keep the subscriptions if they still exist on the server, so that solution might not be the best way to go ahead, but it'd work. I'd implement the reconnect mechanism on the Client, not on UaClient, so the library would simply need to remember the values that subscriptions were created with in the Client and then it's as easy as calling create_subscription() again with those values after reconnecting.

PS: the reconnect mechanism I've described above has been running all weekend now, experiencing a controlled connection loss about every 3 minutes and reconnecting successfully afterwards for almost 1500 times already. So it seems to be working fine and stable.

oroulet commented 4 days ago

I made it async now