potential memory leak - Githubissues

JoeyFaulkner commented 4 years ago

hey!

I'm trying to have a long running OPC server/client "bridge". Similar to the code written here https://github.com/Green-Fusion/opcua-bridging/blob/master/examples/bridge/cloud/cloud/forwarding-service.py, which is every second read by another client. I find that after a few days the memory taken by the process continuously increases until the process crashes. I saw another person had similar issues in #79 so followed their method of using memtop, and after 24 hours of running I have this:

opcua_bridge_1     | WARNING:asyncua:
opcua_bridge_1     | refs:
opcua_bridge_1     | 66162      <class 'set'> {<weakref at 0x7f1ad5a0c830; to '_asyncio.Task' at 0x7f1ad5a00050>, <weakref at 0x7f1acfe5bef0; to '
opcua_bridge_1     | 66158      <class 'list'> [<Task finished coro=<UaProcessor.close() done, defined at /usr/local/lib/python3.7/site-packages/as
opcua_bridge_1     | 14367      <class 'dict'> {'__module__': 'asyncua.ua.object_ids', 'Null': 0, 'Boolean': 1, 'SByte': 2, 'Byte': 3, 'Int16': 4,
opcua_bridge_1     | 6764       <class 'dict'> {NumericNodeId(i=3062): NodeData(id:NumericNodeId(i=3062), attrs:{<AttributeIds.NodeId: 1>: Attribut
opcua_bridge_1     | 5742       <class 'list'> ['# module pyparsing.py\n', '#\n', '# Copyright (c) 2003-2018  Paul T. McGuire\n', '#\n', '# Permiss
opcua_bridge_1     | 2404       <class 'dict'> {139753465755680: <weakref at 0x7f1ae325d770; to 'type' at 0x7f1ae3aeb820 (type)>, 139753465774432:
opcua_bridge_1     | 740        <class 'dict'> {'__module__': 'lxml.etree', '__doc__': 'Libxml2 error types', '__dict__': <attribute '__dict__' of
opcua_bridge_1     | 594        <class 'pytz.lazy.LazyList.__new__.<locals>.LazyList'> ['Africa/Abidjan', 'Africa/Accra', 'Africa/Addis_Ababa', 'Africa/Algiers', 'Africa/Asmara', 'Africa/
opcua_bridge_1     | 594        <class 'pytz.lazy.LazySet.__new__.<locals>.LazySet'> LazySet({'America/New_York', 'America/El_Salvador', 'Pacific/Truk', 'Africa/Casablanca', 'America/Mo
opcua_bridge_1     | 558        <class 'dict'> {FourByteNodeId(i=14846): <class 'asyncua.ua.uaprotocol_auto.KeyValuePair'>, FourByteNodeId(i=15671)
opcua_bridge_1     |
opcua_bridge_1     | bytes:
opcua_bridge_1     | 2097384     {<weakref at 0x7f1ad5a0c830; to '_asyncio.Task' at 0x7f1ad5a00050>, <weakref at 0x7f1acfe5bef0; to '
opcua_bridge_1     | 589936      {'__module__': 'asyncua.ua.object_ids', 'Null': 0, 'Boolean': 1, 'SByte': 2, 'Byte': 3, 'Int16': 4,
opcua_bridge_1     | 578944      [<Task finished coro=<UaProcessor.close() done, defined at /usr/local/lib/python3.7/site-packages/as
opcua_bridge_1     | 147568      {NumericNodeId(i=3062): NodeData(id:NumericNodeId(i=3062), attrs:{<AttributeIds.NodeId: 1>: Attribut
opcua_bridge_1     | 51760       ['# module pyparsing.py\n', '#\n', '# Copyright (c) 2003-2018  Paul T. McGuire\n', '#\n', '# Permiss
opcua_bridge_1     | 36976       {139753465755680: <weakref at 0x7f1ae325d770; to 'type' at 0x7f1ae3aeb820 (type)>, 139753465774432:
opcua_bridge_1     | 36976       {'__module__': 'lxml.etree', '__doc__': 'Libxml2 error types', '__dict__': <attribute '__dict__' of
opcua_bridge_1     | 33008       LazySet({'America/New_York', 'America/El_Salvador', 'Pacific/Truk', 'Africa/Casablanca', 'America/Mo
opcua_bridge_1     | 33008       LazySet({'America/New_York', 'America/El_Salvador', 'Africa/Casablanca', 'America/Montserrat', 'Euro
opcua_bridge_1     | 18536       {'sys': <module 'sys' (built-in)>, 'builtins': <module 'builtins' (built-in)>, '_frozen_importlib':
opcua_bridge_1     |
opcua_bridge_1     | types:
opcua_bridge_1     | 170880      <class 'dict'>
opcua_bridge_1     | 69333       <class 'weakref'>
opcua_bridge_1     | 66173       <class 'Context'>
opcua_bridge_1     | 66167       <class 'coroutine'>
opcua_bridge_1     | 66162       <class '_asyncio.Task'>
opcua_bridge_1     | 39245       <class 'asyncua.ua.uatypes.StatusCode'>
opcua_bridge_1     | 39243       <class 'asyncua.ua.uatypes.DataValue'>
opcua_bridge_1     | 39243       <class 'asyncua.ua.uatypes.Variant'>
opcua_bridge_1     | 39238       <class 'asyncua.server.address_space.AttributeValue'>
opcua_bridge_1     | 15482       <class 'asyncua.ua.uatypes.NumericNodeId'>

It seems like the problem is here: opcua_bridge_1 | 2097384 {<weakref at 0x7f1ad5a0c830; to '_asyncio.Task' at 0x7f1ad5a00050>, <weakref at 0x7f1acfe5bef0; to ' which looks like just a plain set of "weakref"s to "Task"s.

If anyone already knows what the problem is that would help me a lot, but until then i'll start working on it

swamper123 commented 4 years ago

This is one of those nightmare problems. Something that can just occurs after a long while.

Did you already locate where these Task refering to? Are you disconnecting your clients properly?

Could be interessting to know, if this may happen with a single client as well. This may restrict the source of this evil. Perhaps monitoring such stuff with asyncio.all_tasks() may help as well (so you may not need to wait 24h) and compare which tasks already existed xy minitues/hours ago and not have been garbage collected. In Python3.8 you can give asyncio Tasks even names, perhaps this may help against cryptic names as well?!

JoeyFaulkner commented 4 years ago

really interesting points to think about! the disconnecting clients thing could easily be the issue. Will also take a look at asyncio.all_tasks. Thank you so much!

swamper123 commented 4 years ago

the disconnecting clients thing could easily be the issue.

I haven't found an explicit client.disconnect command on the fly and no async with client statementblock. So this was my first guess. ;)

Perhaps another good tool is tracemalloc. You can make snapshots between two steps and compare them.

JoeyFaulkner commented 4 years ago

hmm, ok, so from playing around with tracemalloc i get

opcua_bridge_1     | WARNING:asyncua:/usr/local/lib/python3.7/tracemalloc.py:113: size=188 KiB (+188 KiB), count=2000 (+2000), average=96 B
opcua_bridge_1     | WARNING:asyncua:/usr/local/lib/python3.7/asyncio/base_events.py:404: size=342 KiB (+13.7 KiB), count=2508 (+97), average=140 B
opcua_bridge_1     | WARNING:asyncua:/usr/local/lib/python3.7/_weakrefset.py:84: size=273 KiB (+6336 B), count=1693 (+72), average=165 B
opcua_bridge_1     | WARNING:asyncua:/usr/local/lib/python3.7/site-packages/asyncua/server/binary_server_asyncio.py:46: size=112 KiB (+4896 B), count=844 (+36), average=136 B
opcua_bridge_1     | WARNING:asyncua:/usr/local/lib/python3.7/site-packages/asyncua/server/binary_server_asyncio.py:52: size=112 KiB (+4896 B), count=843 (+36), average=136 B
opcua_bridge_1     | WARNING:asyncua:/usr/local/lib/python3.7/asyncio/events.py:41: size=69.7 KiB (+3040 B), count=892 (+38), average=80 B
opcua_bridge_1     | WARNING:asyncua:/usr/local/lib/python3.7/asyncio/selector_events.py:192: size=3456 B (+1728 B), count=6 (+3), average=576 B
opcua_bridge_1     | WARNING:asyncua:/usr/local/lib/python3.7/site-packages/asyncua/server/binary_server_asyncio.py:53: size=14.3 KiB (+1672 B), count=1 (+0), average=14.3 KiB
opcua_bridge_1     | WARNING:asyncua:/usr/local/lib/python3.7/tracemalloc.py:185: size=1685 KiB (+1104 B), count=26960 (+17), average=64 B
opcua_bridge_1     | WARNING:asyncua:/usr/local/lib/python3.7/tracemalloc.py:127: size=1056 B (+968 B), count=12 (+11), average=88 B

And it seems it is still storing a lot of memory in the weakref for the tasks, but also from /asyncio/base_events.py This is after wrapping the client in an "async with", however the "async with" is not exited until the process is stopped, so I still have a long running client here which probably prevents garbage connection. Do you happen to have any idea whats oging on here? Is it just bad practice to have such a long running client?

oroulet commented 4 years ago

Long running client is for sure not a bad design. Did you check all_tasks list? Does it increase over time? Also check the acknowledgement lists there have been memory leaks there earlier. Also did you try to Google for python memory leaks? Seems to be some tools like objgraph

swamper123 commented 4 years ago

Do you happen to have any idea whats oging on here? Is it just bad practice to have such a long running client?

To be honest: No, not at the moment. The only thing I can say is, that I know the used ressource curve goes up for a while and then stays steady in a normal "Server is running" and "client ocnnects, requests, disconnects" use-case. But with this record the amount of source-problems is reduced ( hopefully).

Could you show us the code you used with these changes you made? Where exactly do you made the snapshots?

EDIT: Like orulet mentioned: If there are async tasks still running, which should be dead, it would explain why they are not garbage collected.

JoeyFaulkner commented 4 years ago

Thanks so much for your help. I think im on my way. So it turns out that these tasks are small, c. 160 bytes, but my client was causing the server to create a lot of them and even when they're finished, they're kept in asyncio as a record of what tasks happened.

However over a few days, so many of these are being created that they start taking up hundreds of mb, which on my tiny server is enough to cause a memory problem. I think right now I'm going to try and find a way of just deleting the old Tasks and see if this is fixed. I guess we'll see in a few days. Will update you :) I think its a bit similar to this https://stackoverflow.com/questions/44605752/delete-destroy-finished-futures-asyncio-python but a global variable.

JoeyFaulkner commented 4 years ago

ok i found the issue! I'll just explain what happened in case someone has a similar problem later down the line.

So this isnt an asyncio problem, basically today i learned that weakref is a set of references to objects, but those references are "weak", and so if the only references to an object are "weak" then the garbage collector will get rid of the object from the memory. Suffice to say, if you see a weakset getting larger, the problem isnt the weakset itself but instead the other objects referencing the objects in this weakset. I used gc.get_referrers on the inidividual tasks on the weakset returned by asyncio.get_tasks() and it showed that there was another list keeping all these async tasks (and eventually getting very large), even after they were completed or cancelled. I ran gc.get_referrers on this list (which was returned by the original gc.get_referrers and it turned out it was being stored as BinaryServer.protocol_tasks (https://github.com/FreeOpcUa/opcua-asyncio/blob/master/asyncua/server/binary_server_asyncio.py#L114). This is defined as just an empty list, but its passed into OPCUAProtocol (https://github.com/FreeOpcUa/opcua-asyncio/blob/master/asyncua/server/binary_server_asyncio.py#L29) and its a mutable object, so when asyncio tasks are added to the list (https://github.com/FreeOpcUa/opcua-asyncio/blob/master/asyncua/server/binary_server_asyncio.py#L58, https://github.com/FreeOpcUa/opcua-asyncio/blob/master/asyncua/server/binary_server_asyncio.py#L53) they're kept on the BinaryServer in perpetuity and the list keeps getting longer and taking up more memory. I added a function to clear this list of finished/cancelled tasks with every new connection and it has fixed the memory leak problem.

I won't close this issue because I'm going to develop a pull request to fix this, but to know how to fix it it would be useful to know the purpose of this protocol_tasks list. Is it meant to be kept by the binary server and added to by the OPCUA connection? and if so is it meant to keep the old completed tasks in there? Or can they be deleted?

@swamper123 @oroulet thanks so much for your help!

JoeyFaulkner commented 4 years ago

https://github.com/JoeyFaulkner/opcua-asyncio/pull/3 this is an example of the kind of PR which fixes the issue

oroulet commented 4 years ago

Can't you just grep for that list and see if it is used somewhere?

Just make sure it is not used to close the tasks at the end... All tasks ahould be closed and awaited.

swamper123 commented 3 years ago

@JoeyFaulkner Is the issue still active in master? Otherwise we can close that one.

JoeyFaulkner commented 3 years ago

Yep, i think it is!

FreeOpcUa / opcua-asyncio

potential memory leak #267