FreeOpcUa / opcua-asyncio

OPC UA library for python >= 3.7
GNU Lesser General Public License v3.0
1.12k stars 361 forks source link

Memory balloons after "Subscription Subscription(id:<id>) has expired." #116

Closed dannmartens closed 3 months ago

dannmartens commented 4 years ago

We've noticed that in our production environment, where we run multiple instances (19 right now) of asyncua-based OPC UA servers, some instances enter some strange mode from which recovery is difficult. A symptom of this runaway mode is memory ballooning until the limit is reached.

From the moment the issue manifests itself, the logs are blasted with the following statements, following "the session id is not valid" error:

2019-12-17 15:26:46 [ERROR] asyncua.server.uaprocessor - sending service fault response: The session id is not valid. (BadSessionIdInvalid)
2019-12-17 15:31:25 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:93) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:25 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:86) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:25 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:93) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:25 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:86) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:25 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:93) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:25 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:86) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:26 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:93) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:26 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:86) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:26 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:93) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:26 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:86) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:26 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:93) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:26 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:86) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:26 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:93) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:26 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:86) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:26 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:93) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:26 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:86) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:27 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:93) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:27 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:86) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:27 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:93) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:27 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:86) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:27 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:93) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:27 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:86) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:27 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:93) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:27 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:86) has expired, publish cycle count(101) > lifetime count (100)
2019-12-17 15:31:27 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:93) has expired, publish cycle count(102) > lifetime count (100)
2019-12-17 15:31:27 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:86) has expired, publish cycle count(102) > lifetime count (100)
2019-12-17 15:31:28 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:93) has expired, publish cycle count(102) > lifetime count (100)
2019-12-17 15:31:28 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:86) has expired, publish cycle count(102) > lifetime count (100)
2019-12-17 15:31:28 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:93) has expired, publish cycle count(102) > lifetime count (100)
...

Sometimes, when this happens, nothing else gets logged except that statement, in the thousands. The above log goes on to "publish cycle count" 2320 and beyond:

2019-12-17 17:01:16 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:101) has expired, publish cycle count(2320) > lifetime count (100)

We've noticed serious connectivity issues on the production network, which seem to trigger this phenomenon. The "The session id is not valid" error indicates an OPC UA client is reconnecting on account of a connectivity dropout.

Any ideas what might happen here at the subscription level from the perspective of the asyncua library?

oroulet commented 4 years ago

thanks for the report. I have not read that code for a long time but I think there has been a PR a few week ago, can you try on master? or try an older version to see if that happens?

oroulet commented 4 years ago

and great that you report that kind of things, this is hard to test since it never happens in short testing env.

I looked at spec:

ubscriptions have a lifetime counter that counts the number of consecutive publishing cycles in which there have been no Publish requests available to send a Publish response for the Subscription. Any Service call that uses the SubscriptionId or the processing of a Publish response resets the lifetime counter of this Subscription. When this counter reaches the value calculated for the lifetime of a Subscription based on the MaxKeepAliveCount parameter in the CreateSubscription Service (5.13.2), the Subscription is closed. Closing the Subscription causes its MonitoredItems to be deleted. In addition the Server shall issue a StatusChangeNotification notificationMessage with the status code Bad_Timeout. The StatusChangeNotification notificationMessage type is defined in 7.20.4.

and code where there is huge FIXME

https://github.com/FreeOpcUa/opcua-asyncio/blob/master/asyncua/server/internal_subscription.py#L100

It looks like we have stopped receiving publishingrequests from server and that we should close the subscription but we do not... PR welcome

oroulet commented 4 years ago

but this is the side effect of the first one which is a connection issue. (anyway it explains the memory ballons since we just fill up memory with events that should be sent to server and that are not)

dannmartens commented 4 years ago

I've baselined our code on the version corresponding to the commit "Fix Issue #108," available since about 18 days ago.

But the version where we have the above issue, is much older. It dates back to "Fix #71." We're planning a new deployment soon.

In this particular extreme case, where the network goes haywire, I assume the OPC UA clients and servers are seeing disconnects constantly and are desperately attempting to restore the connection.

oroulet commented 4 years ago

I merged a fix for session renewal. You can also try it. It might solve the fundamental issue in your case

chrisjbremner commented 3 years ago

I've also been running into this recently. My scenario is I have both a asyncua client and server. The connection between the two is occasionally unstable, and thus when the client disconnects (and closes the session), the request to delete all of the client's subscriptions here never actually makes it to the server. The subscription results still queue up here which I believe is where the memory is leaking. If I read the spec correctly, the subscription should be deleted after going over the lifetime limit, but as far as I can tell it does not, which means memory will continue to be leaked until the program crashes.

For now, I can rely on my client auto-reconnecting to the server, so I simply delete all of the old subscriptions upon reconnection, but I think the long term fix might be deleting the subscriptions from the server side once they expire.

dannmartens commented 1 year ago

I found several of our OPC UA servers frantically logging the expiry notification again. I updated the servers to version 1.0.4, only a few weeks back. The last update goes back to December 2022, when I bumped it up to 1.0.1.

2023-09-06 14:51:10,207 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055148) > lifetime count (12000)
2023-09-06 14:51:10,307 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055149) > lifetime count (12000)
2023-09-06 14:51:10,407 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055150) > lifetime count (12000)
2023-09-06 14:51:10,506 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055151) > lifetime count (12000)
2023-09-06 14:51:10,607 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055152) > lifetime count (12000)
2023-09-06 14:51:10,707 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055153) > lifetime count (12000)
2023-09-06 14:51:10,807 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055154) > lifetime count (12000)
2023-09-06 14:51:10,906 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055155) > lifetime count (12000)
2023-09-06 14:51:11,007 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055156) > lifetime count (12000)
2023-09-06 14:51:11,107 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055157) > lifetime count (12000)
2023-09-06 14:51:11,207 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055158) > lifetime count (12000)
2023-09-06 14:51:11,307 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055159) > lifetime count (12000)
2023-09-06 14:51:11,406 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055160) > lifetime count (12000)
2023-09-06 14:51:11,508 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055161) > lifetime count (12000)
2023-09-06 14:51:11,606 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055162) > lifetime count (12000)
2023-09-06 14:51:11,706 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055163) > lifetime count (12000)
2023-09-06 14:51:11,807 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055164) > lifetime count (12000)
2023-09-06 14:51:11,907 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055165) > lifetime count (12000)
2023-09-06 14:51:12,007 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055166) > lifetime count (12000)
2023-09-06 14:51:12,106 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055167) > lifetime count (12000)
2023-09-06 14:51:12,206 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055168) > lifetime count (12000)
2023-09-06 14:51:12,306 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055169) > lifetime count (12000)
2023-09-06 14:51:12,406 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055170) > lifetime count (12000)

I will keep the servers under observation; is there anything I can do to get a better understanding of the root cause of this issue?

cwmaxwell83 commented 11 months ago

I am facing the same issues. Did anyone manage to get a rid of this yet?

noliveira87 commented 6 months ago

We are encountering the same issue that was reported here. Has anyone managed to resolve it?

chrisjbremner commented 4 months ago

In the case I described above, I think there's two issues:

When combined, there's no way for you to delete the now dangling subscription and the memory is leaked as the queue grows.

To address point 1, I think there needs to be a server-side session watchdog to kill any dangling sessions that weren't properly closed by the client. Maybe something like this?

This seems to be in compliance with section 5.6.2 of the spec:

Sessions are terminated by the Server automatically if the Client fails to issue a Service request on the Session within the timeout period negotiated by the Server in the CreateSession Service response. This protects the Server against Client failures and against situations where a failed underlying connection cannot be re-established. Clients shall be prepared to submit requests in a timely manner to prevent the Session from closing automatically.

EDIT: The above solution was predicated on my assumption that the session closure was being initiated solely on the client side, which is not correct. That being said, I do still think honoring the timeout portion of the spec makes sense.

To address point 2, I think the subscription just needs to be stopped like this?

This seems to be in compliance with section 5.13.1.1 part 8 of the spec:

Subscriptions have a lifetime counter that counts the number of consecutive publishing cycles in which there have been no Publish requests available to send a Publish response for the Subscription. Any Service call that uses the SubscriptionId or the processing of a Publish response resets the lifetime counter of this Subscription. When this counter reaches the value calculated for the lifetime of a Subscription based on the MaxKeepAliveCount parameter in the CreateSubscription Service (5.13.2), the Subscription is closed. Closing the Subscription causes its MonitoredItems to be deleted. In addition the Server shall issue a StatusChangeNotification notificationMessage with the status code Bad_Timeout. The StatusChangeNotification notificationMessage type is defined in 7.20.4.

Happy to PR the changes if @oroulet thinks this makes sense.

oroulet commented 4 months ago

Thanks for thorough analysis. honestly I do not know what is happening here, as you wrote over, when we loose the connection we do delete the session which has the subcsriptions so wehn a client disconnect cleanly or not, everything should be deleted. Might be an issue with clients creating many subscriptions and not using them, we can delete subscriptions then , as per spec. PR welcome. But I have a hard time understanding how it becomes an issue unless you have a completely broken client... That safeguard make more sense when sessions can be reconnected to, and we do not support that..

chrisjbremner commented 4 months ago

I have no evidence to back this up (and should probably enable more verbose logging to confirm), but it seems possible that the publishing task could silently die if an exception other than the 3 that are caught here is raised. This does match up with behavior I've experienced where the subscriptions no longer appear to update but everything else on the client works as expected. Would it make sense to do either of: a) make the publishing task more robust to unexpected errors (perhaps by repeating without acknowledgement like is done with other errors) or b) at least log a critical warning when an unexpected error occurs?

chrisjbremner commented 3 months ago

I've now addressed the two points discussed here. While I believe the situation where a subscription exceeds its lifetime count can still occur, I believe my fixes will stop this subscription and mitigate the associated memory leak. Re-open if the issue recurs even after the fixes.