clj-python / libpython-clj

Python bindings for Clojure
Eclipse Public License 2.0
1.05k stars 68 forks source link

Ways to reduce probability of memory leaks? #260

Open mharju opened 5 months ago

mharju commented 5 months ago

We have a long-running process that uses libpython-clj as a bridge between Java and Python code.

We've noticed that we seem to be for some reason, leaking memory just a tiny bit. This is enough to cause issues when the processes are run for multiple weeks/months on end.

Is there some good practices to transfer data between the Python and Java sides with minimal probability of leaking? What would be the ways to figure out what is causing the leaks?

Thank you!

cnuernber commented 5 months ago

Hey - sorry of not responding quicker on Slack but honestly many people I think ask this question so an issue is definitely the best place to collect our thoughts on this.

We don't know if the leak is on the binding side (meaning libpython is holding onto a python object), on the pure python side or on the pure jvm side so we have to narrow down the issue until we can figure this out. I don't know the python API well enough to know if there are ways to check all of the allocated objects in python but I do know that java can give you a few pieces of information in this area. Let's for now assume the issues in specifically in how libpython-clj is dealing with python objects.

The first place to start is the GC topic. My original intention was the people could progressively move to more strict GC environments as they moved their code into production. So originally we operate in a permissive GC environment where we attempt to use the java GC to track objects and that is fine for a few global objects but we don't want to use just the general GC mechanism in this case specifically - we want to use with with-stack-gc-context type operations where we can guarantee things will be released at a particular time. Using a stack gc context isn't possible all the time but it does address some percentage of these issues.

If you are already using a stack GC context then the next step may be to enable addref/release logging and capture logs to a file or something like that and ensure that indeed all objects you expect to be released are in fact released - there could be an issue with the library here or something could be holding onto a reference.

There is a set of forever references for things like class definitions - one question I would have is if this concurrent hashmap is growing in size as this guarantees a memory leak.

The next piece of advice is perhaps not as ideal but it would be to write as much of your solution as you can in python in order to limit the amount of jvm->python traffic and then use the raw python interfaces - not require-python but the low level import-module, call-attr etc. pathways to call your wrapper thus limiting your potential for memory leaks in the first place and then repeat the steps above with your now greatly reduced possible error set.

These are initial thoughts on the subject. You may even want to fork the library and deal with references in a different way that is easier to ensure correctness for your specific use case.

Let us know how this goes - I think you are in somewhat unexplored territory so it is great for us to hear about this.

mharju commented 5 months ago

Thanks for your response! We will try to follow the steps and give more info in case we find something.