cowtowncoder / java-merge-sort

Basic stand-alone disk-based N-way merge sort component for Java
Apache License 2.0
83 stars 31 forks source link

Possible 'overallocation' of memory #5

Closed tallpsmith closed 12 years ago

tallpsmith commented 12 years ago

I just wanted to raise that I think there is a doubling of the expected amount of usage of memory by this library. We recently hit a problem with OOM during the sort because our estimatedSize calculations were wrong (not truly reflecting the cost of the object structure, you can see the fix here: https://github.com/Aconex/scrutineer/commit/47f304267426bdd44fe1b396770449ee1e0271ce )

When debugging what we may have done wrong, we stumbled-upon code within Sorter.java that appears to hold 2 Object[] the size based on the sort memory configuration item. What we think is happening is:

When you combine this 'doubling' of the configured memory with a bad object size estimate, it's an episode of Air Crash Investigations waiting to film.

Is this the correct reading of the way the code works? From reading the presort method, I would think the first 2 lines are the only ones needing the firstSortedBatch reference, the remaining method code that does the do/while loop doesn't need this reference and really somehow the firstSortedBatch reference needs to be released (nulled out - eligible for GC), thereby honouring the configured memory utilisation. Some code refactoring in order?

thoughts?

cowtowncoder commented 12 years ago

Sounds plausible, I'll have a look. Thanks!

cowtowncoder commented 12 years ago

Quick comment on memory size calculations: assumption is that usually most memory is not used for Object[] itself, but rather for entries contained. Heuristics are approximate of course.

tallpsmith commented 12 years ago

you are correct, but if the estimated size is grossly off (we originally estimated 24 bytes/object, then sat down a bit and really looked at it, this was a good reference: http://www.ibm.com/developerworks/java/library/j-codetoheap/index.html) we realised for an Object wrapping a String and an long, it actually adds up to 106 bytes or something, so we were 4x wrong! this then makes java-merge-sort choose to use 4x as many to fit the buffer.

If we combine that silly original estimate, with the fact that we were running Java 6 U31 64bit which used Compressed OOPs _andthen had to downgrade to a JVM prior to the Compressed OOPS support (long story on the why, irrelevant here), plus the maybe-doubling of the Object[] above, our 2Gb heap pretty much vanished! :)

The 2nd copy of the array is a problem still I believe, because the user of the library will allocate memory based on their understanding of the JVM environment they're running in, and if it's now doubling it, that may well be a penalty they can't afford (particularly if their object size calcs are pretty dodgy.. :) )

cowtowncoder commented 12 years ago

Yes, forgot to mention that the estimate obviously has to be close enough -- and Strings are expensive for multiple reasons.

I also think you may well be right that there is unintended reference retention, and that would be good to eliminate. Especially if there are duplicate Strings from previous rounds.

cowtowncoder commented 12 years ago

Ok, yes. Some of things were not getting cleared as they should have, including one you pointed out. I added two smaller clean up fixes; I hope those help a bit too.

I released a new version (actually, well, two, finding one more thing to fix), see if 0.7.1 behaves any better for you? Further suggestions for improvements are gladly accepted. :-)