Swirrl / drafter

A clojure service and a client to it for exposing data management operations to PMD
Other
0 stars 0 forks source link

Fixes #623 memory leak when appending & deleting triples #627

Closed RickMoynihan closed 2 years ago

RickMoynihan commented 2 years ago

This PR fixes a proven memory leak, which I believe is likely to be the culprit behind #623.

First I will detail how you can recreate and witness the leak yourself.

Reproducing the leak

First check out master at the commit before this PR, then find a suitably large (but not too large) file of triples. I used a 60mb file of uncompressed n-triples that I obtained from Scotlands site. The specific content of the data shouldn't matter too much.

Next fire up drafter and connect visualvm to the drafter process so you can preview graphs of heap utilization in real time. Once connected:

  1. create a new draftset via the drafter API
  2. upload your triples to the draftset
  3. wait for the upload to finish (you can tail the logs to see. The import took about 4-6 seconds for me to load 60mb).
  4. After the upload has finished push the "force GC" button in VisualVM, to ask the JVM to do a GC.
  5. Repeat the process several times by returning to step 2.

You do not need to publish, and you do not need to create a new draftset each time. You will clearly witness the leak in visualvm as the floor will raise after each batch and each GC:

Screenshot 2022-08-10 at 16 22 18

Each spike in the above chart is the heap allocations of each batch, the sharp cliff drop after each batch occurs when forcing a GC, though you will note the floor rising each time due to the memory leak. The floor is raised by approximately 180mb each time, which I guess is roughly explained by 60mb utf-8 * 2 = 120mb + approximately 60mb overhead

Witnessing the fix is real

If you repeat the above the experiment after having checked out this PR/fix you will see the floor after each GC returns to the base line (for me approximately 140mb of HEAP):

Screenshot 2022-08-10 at 16 18 58

Process to discover leak

I thought it might be beneficial to describe to others how you can easily find leaks like this.

First download the Eclipse Memory Analyzer Tool (MAT) don't let the ugly UI fool you, this tool is awesome.

First there was a general suspicion that data loading was the most likely cause of the leak. So I started drafter and ran a modest amount of data through the upload and publish process a few times.

Next I connected the MAT, you can generate a heap dump directly for any VM running locally from the UI, or you can point it at one created by running the command line tool jmap -dump:format=b,file=heap.bin <pid> (useful if you want to get one off a server).

For large heap dumps you may need to increase the heap available to MAT, e.g. by running /Applications/mat.app/Contents/MacOS/MemoryAnalyzer -vmargs -Xmx8g

Once you have the heap dump openned in MAT it will ask you to produce a report; I usually ask for a dominator report, though it's not always useful. In this case it mainly identifies false positives. From here though you can click Histogram to see the classes who's objects hold the most heap:

Screenshot 2022-08-11 at 11 43 00

Here we can clearly see some grafter quads suspiciously lurking in the heap; you'd expect these to be released after processing. Lets try and figure out what is holding onto them; generally you can figure things out by clicking around and exploring the various options. In this case the route I went (and the root I've done before) is to right click on the Quads class, and list objects with outgoing references, then pick a quad at random in the list, right click it and select "merge shortest paths to GC root", then walk down the tree from the Thread (GC root that referenced the initial object) and keep walking till we see something interesting:

Screenshot 2022-08-11 at 12 12 19

You will notice we start at the thread that is ultimately referencing the quad. Unsurprisingly this thread is drafters write-loop thread, as that is the only thread in drafter responsible for writing quads. From here you can keep expanding references slowly walking towards our quad.

Initially at the top of the list you'll notice the retained heap is small but then as we expand const__3 we see the retained heap is 189mb. I think this is explained by const__3 being a reference from the stack to the heap; and interestingly const__3 is a var. We next see that the var holds a clojure atom, which holds a map, which eventually holds a drafter.async.jobs.Job, and that job then holds a partially applied function, who's first (or maybe second) argument arg1 is a lazy sequence, which ultimately if we walk it many thousands more times will land us at our initial quad.

So from this we can clearly see the problem; there is a var holding an atom, which is storing our jobs, and the job has a function in it which is closing over a large sequence of quads. We can see enough hints in the traces here, which then let us pin down the clojure namespace and find the var and atom in question:

https://github.com/Swirrl/drafter/blob/e40fe7f91842098c81cfb66e21addc38bd4b7b47/drafter/src/drafter/async/jobs.clj#L45-L47

As I also had a drafter REPL open, you can then trivially switch to this namespace and inspect the contents of that atom

drafter.async.jobs> (first (:complete @jobs))
{:draftset-id {:id "8e66f77f-c120-414f-b434-e3eb35d496a6"},
  :start-time 1660148532362,
  :user-id "publisher@swirrl.com"} 
  :function #function[clojure.core/partial/fn--5857]
 ,,,,
}

Explanation for fix

In the above analysis we can clearly see the :function which is closing over the data. So the problem is that we are recording an in memory record of completed jobs, which we retain for the UI and API to ask questions about job status. This change occurred back in 2019 in https://github.com/Swirrl/drafter/pull/338, and one known issue is that we never purge this in memory record; however it was never thought likely to be a problem in practice as the intention was that only a small amount of metadata would be held, and that drafter would likely be restarted well before this became a problem.

So the above leak is by design*; but it expects only metadata to be stored in the jobs, and we are now I think storing a continuation of the job in the :function field. I don't yet know why we've noticed the leaks recently, but can only guess that another PR tweaked this slightly, most likely making the closures bigger.

Regardless the fix in this PR simply dissocs the :function when we move it from :pending to :complete and this can clearly be demonstrated to resolve a serious memory leak.

Is this enough?

It is clear to me that this PR represents a strict improvement for a serious memory leak. However I have long suspected the batching code is susceptible to another class of leak, where for the duration of an append memory accumulates (see issue #633) (where ideally it would stream). This PR means that once the job is complete the memory can be shown to be released, but this PR doesn't attempt to fix this other leak (which is currently really just a suspicion of mine); and would just set an upper limit on the amount of data we can load at once.

RickMoynihan commented 2 years ago

I'd still like to narrow this down and find where exactly this was change occurred; I might invest some time on an archeological dig through git.

RickMoynihan commented 2 years ago

Ok, I can confidently say that this PR https://github.com/Swirrl/drafter/pull/539 introduced the problem. Running the same experiment with commit 09da6630 at the top of that PR yields the following graph:

Screenshot 2022-08-11 at 15 18 36

However repeating the experiment at the commit immediately prior to that PR 2e5619fa (drafter TAG https://github.com/Swirrl/drafter/releases/tag/2.4.8) yields the following graph:

Screenshot 2022-08-11 at 15 29 18
RickMoynihan commented 2 years ago

I’ve also just looked at the ons changelog and I’m pretty sure this PR was deployed on 4th and 6th April 2022. So this would I think entirely correlate with what we’ve been seeing and when.

https://docs.google.com/document/d/1rbPnNqi8Zw1eB8wCgGuRXBLT6Jz-7M0ExuQbGGNuAOs/edit?pli=1#

RickMoynihan commented 2 years ago

I've backported this fix and created maintainance builds for scotland and ONS on PRs #628 and #629