Swirrl / drafter

A clojure service and a client to it for exposing data management operations to PMD
Other
0 stars 0 forks source link

Suspected memory leak in large append/delete operations #633

Open RickMoynihan opened 2 years ago

RickMoynihan commented 2 years ago

I’ve long suspected there is still a leak during the batching of large updates… i.e. that there is an upper bound on the size of a drafter append/delete we can receive; which is determined by drafters RAM. These leaks (if they do exist) are almost certainly* mopped up when the batch successfully completes — however if the update consumes all memory before then or explicit failure, my hunch is an unrecoverable leak can occur.

Beyond just my suspicion we periodically see issues like this.

There were some previous attempts to improve this, in particular this PR #281. I've never truly believed this PR adequately solved the issue though. IIRC the issue this PR identified was a known issue in mapcat itself, where it is "too eager" (laziness is really a spectrum), and this problem in turn hinges on implementation details of apply and variadic arguments, see CLJ-1218. I believe this PR did improve the situation, but it has always felt like it was overly focussing on particular implementation details in clojure; rather than addressing the root issue which is that lazy sequences are not really the right tool for this job†, and that we should fundamentally switch the I/O loading to an eager process; for example something based upon transducers. Additionally perhaps batching could be made more explicit, by first literally splitting a job up into separate files on disk; and then only consuming them a file at a time. This would have the advantage of moving us closer to being able to persist progress in jobs across restarts etc...

NOTE: this issue is a hunch; we should first reproduce it by trying to append and delete a large quantity of data... e.g. a few GB should do it with current default memory settings; though it may be more practical to just lower drafters heap in dev and try and recreate it with smaller files. We should first evidence the issue with an approach similar to that I took in #623, i.e. using heap monitoring, analysis and dumps; and when confirmed any fix should also prove it is fixed with similar analytical methods.

* I make this claim with high certainty but not claiming infalibility. During the fix (PR #627) for issue #623 I repeatably observed GC cleaning up and returning back to the base line once the job was finished. See the graphs in the PR #627 for evidence. † Fundamentally I think problems like this are likely to keep occuring if we retain the lazy-seq abstraction to maintain state across the batches.