Closed dannmartens closed 3 months ago
thanks for the report. I have not read that code for a long time but I think there has been a PR a few week ago, can you try on master? or try an older version to see if that happens?
and great that you report that kind of things, this is hard to test since it never happens in short testing env.
I looked at spec:
ubscriptions have a lifetime counter that counts the number of consecutive publishing cycles in which there have been no Publish requests available to send a Publish response for the Subscription. Any Service call that uses the SubscriptionId or the processing of a Publish response resets the lifetime counter of this Subscription. When this counter reaches the value calculated for the lifetime of a Subscription based on the MaxKeepAliveCount parameter in the CreateSubscription Service (5.13.2), the Subscription is closed. Closing the Subscription causes its MonitoredItems to be deleted. In addition the Server shall issue a StatusChangeNotification notificationMessage with the status code Bad_Timeout. The StatusChangeNotification notificationMessage type is defined in 7.20.4.
and code where there is huge FIXME
https://github.com/FreeOpcUa/opcua-asyncio/blob/master/asyncua/server/internal_subscription.py#L100
It looks like we have stopped receiving publishingrequests from server and that we should close the subscription but we do not... PR welcome
but this is the side effect of the first one which is a connection issue. (anyway it explains the memory ballons since we just fill up memory with events that should be sent to server and that are not)
I've baselined our code on the version corresponding to the commit "Fix Issue #108," available since about 18 days ago.
But the version where we have the above issue, is much older. It dates back to "Fix #71." We're planning a new deployment soon.
In this particular extreme case, where the network goes haywire, I assume the OPC UA clients and servers are seeing disconnects constantly and are desperately attempting to restore the connection.
I merged a fix for session renewal. You can also try it. It might solve the fundamental issue in your case
I've also been running into this recently. My scenario is I have both a asyncua
client and server. The connection between the two is occasionally unstable, and thus when the client disconnects (and closes the session), the request to delete all of the client's subscriptions here never actually makes it to the server. The subscription results still queue up here which I believe is where the memory is leaking. If I read the spec correctly, the subscription should be deleted after going over the lifetime limit, but as far as I can tell it does not, which means memory will continue to be leaked until the program crashes.
For now, I can rely on my client auto-reconnecting to the server, so I simply delete all of the old subscriptions upon reconnection, but I think the long term fix might be deleting the subscriptions from the server side once they expire.
I found several of our OPC UA servers frantically logging the expiry notification again. I updated the servers to version 1.0.4, only a few weeks back. The last update goes back to December 2022, when I bumped it up to 1.0.1.
2023-09-06 14:51:10,207 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055148) > lifetime count (12000)
2023-09-06 14:51:10,307 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055149) > lifetime count (12000)
2023-09-06 14:51:10,407 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055150) > lifetime count (12000)
2023-09-06 14:51:10,506 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055151) > lifetime count (12000)
2023-09-06 14:51:10,607 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055152) > lifetime count (12000)
2023-09-06 14:51:10,707 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055153) > lifetime count (12000)
2023-09-06 14:51:10,807 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055154) > lifetime count (12000)
2023-09-06 14:51:10,906 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055155) > lifetime count (12000)
2023-09-06 14:51:11,007 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055156) > lifetime count (12000)
2023-09-06 14:51:11,107 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055157) > lifetime count (12000)
2023-09-06 14:51:11,207 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055158) > lifetime count (12000)
2023-09-06 14:51:11,307 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055159) > lifetime count (12000)
2023-09-06 14:51:11,406 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055160) > lifetime count (12000)
2023-09-06 14:51:11,508 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055161) > lifetime count (12000)
2023-09-06 14:51:11,606 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055162) > lifetime count (12000)
2023-09-06 14:51:11,706 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055163) > lifetime count (12000)
2023-09-06 14:51:11,807 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055164) > lifetime count (12000)
2023-09-06 14:51:11,907 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055165) > lifetime count (12000)
2023-09-06 14:51:12,007 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055166) > lifetime count (12000)
2023-09-06 14:51:12,106 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055167) > lifetime count (12000)
2023-09-06 14:51:12,206 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055168) > lifetime count (12000)
2023-09-06 14:51:12,306 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055169) > lifetime count (12000)
2023-09-06 14:51:12,406 [WARNING] asyncua.server.internal_subscription - Subscription Subscription(id:85) has expired, publish cycle count(1055170) > lifetime count (12000)
I will keep the servers under observation; is there anything I can do to get a better understanding of the root cause of this issue?
I am facing the same issues. Did anyone manage to get a rid of this yet?
We are encountering the same issue that was reported here. Has anyone managed to resolve it?
In the case I described above, I think there's two issues:
When combined, there's no way for you to delete the now dangling subscription and the memory is leaked as the queue grows.
To address point 1, I think there needs to be a server-side session watchdog to kill any dangling sessions that weren't properly closed by the client. Maybe something like this?
This seems to be in compliance with section 5.6.2 of the spec:
Sessions are terminated by the Server automatically if the Client fails to issue a Service request on the Session within the timeout period negotiated by the Server in the CreateSession Service response. This protects the Server against Client failures and against situations where a failed underlying connection cannot be re-established. Clients shall be prepared to submit requests in a timely manner to prevent the Session from closing automatically.
EDIT: The above solution was predicated on my assumption that the session closure was being initiated solely on the client side, which is not correct. That being said, I do still think honoring the timeout portion of the spec makes sense.
To address point 2, I think the subscription just needs to be stopped like this?
This seems to be in compliance with section 5.13.1.1 part 8 of the spec:
Subscriptions have a lifetime counter that counts the number of consecutive publishing cycles in which there have been no Publish requests available to send a Publish response for the Subscription. Any Service call that uses the SubscriptionId or the processing of a Publish response resets the lifetime counter of this Subscription. When this counter reaches the value calculated for the lifetime of a Subscription based on the MaxKeepAliveCount parameter in the CreateSubscription Service (5.13.2), the Subscription is closed. Closing the Subscription causes its MonitoredItems to be deleted. In addition the Server shall issue a StatusChangeNotification notificationMessage with the status code Bad_Timeout. The StatusChangeNotification notificationMessage type is defined in 7.20.4.
Happy to PR the changes if @oroulet thinks this makes sense.
Thanks for thorough analysis. honestly I do not know what is happening here, as you wrote over, when we loose the connection we do delete the session which has the subcsriptions so wehn a client disconnect cleanly or not, everything should be deleted. Might be an issue with clients creating many subscriptions and not using them, we can delete subscriptions then , as per spec. PR welcome. But I have a hard time understanding how it becomes an issue unless you have a completely broken client... That safeguard make more sense when sessions can be reconnected to, and we do not support that..
I have no evidence to back this up (and should probably enable more verbose logging to confirm), but it seems possible that the publishing task could silently die if an exception other than the 3 that are caught here is raised. This does match up with behavior I've experienced where the subscriptions no longer appear to update but everything else on the client works as expected. Would it make sense to do either of: a) make the publishing task more robust to unexpected errors (perhaps by repeating without acknowledgement like is done with other errors) or b) at least log a critical warning when an unexpected error occurs?
I've now addressed the two points discussed here. While I believe the situation where a subscription exceeds its lifetime count can still occur, I believe my fixes will stop this subscription and mitigate the associated memory leak. Re-open if the issue recurs even after the fixes.
We've noticed that in our production environment, where we run multiple instances (19 right now) of asyncua-based OPC UA servers, some instances enter some strange mode from which recovery is difficult. A symptom of this runaway mode is memory ballooning until the limit is reached.
From the moment the issue manifests itself, the logs are blasted with the following statements, following "the session id is not valid" error:
Sometimes, when this happens, nothing else gets logged except that statement, in the thousands. The above log goes on to "publish cycle count" 2320 and beyond:
We've noticed serious connectivity issues on the production network, which seem to trigger this phenomenon. The "The session id is not valid" error indicates an OPC UA client is reconnecting on account of a connectivity dropout.
Any ideas what might happen here at the subscription level from the perspective of the asyncua library?