frizbog / gedcom4j

Java library for reading/writing genealogy files in GEDCOM format
http://gedcom4j.org
53 stars 36 forks source link

Better/lighter/cheaper ArrayList? #108

Closed frizbog closed 8 years ago

frizbog commented 8 years ago

A huge amount of heap memory is consumed by the model where it has lots of ArrayLists of items, pre-initialized, and usually empty. In #103, this was improved a good bit by setting the default capacity of the ArrayLists to 0, thereby avoiding the memory allocation of the backing array until it is needed - but the overhead of the ArrayList class is still there, and duplicated all over the place - sometimes hundreds of thousands of times, depending on the size of the file.

It ought to be possible to create an implementation of java.util.List that looks like an ArrayList and shares the same instance when the collection is empty, to avoid the duplication in the heap. Then the model can use that instead of ArrayList everywhere, and save a ton of memory for empty lists.

frizbog commented 8 years ago

I couldn't make this work. It just gets compounded with a whole ton of references to the new List type instead of the ArrayList. It's like a snake eating its own tail. The only solution here I can think of is to complete the conversion of the model to use Javabeans with getters and setters and do lazy initialization of the lists and only if needed...but even this is iffy, because as soon as you retrieve a list with the getter, you'll end up creating an empty arraylist that sits and takes up memory. This is going to take more thought.

frizbog commented 8 years ago

I think a better (and easier) choice than all of this noise is to put the user in control, and let them decide whether they want all the collections in the object model pre-initialized (for convenience at the expense of memory) or not (for memory savings but having to do null checks).

ghost commented 8 years ago

You might consider using the Builder pattern to help reduce the static memory footprint. The "Builder" classes could use ArrayLists but the "Product" classes could use straight arrays to save on memory footprint since they are intended to be immutable anyway. At any given time most of the objects in memory for an multi-thousand person family tree are not actually being modified so all of those objects would be tighter in terms of their memory footprint. Having the "Product" class instances be immutable could also allow for some further optimizations when there's something that repeats many times - instead of having many instances of equal objects on the heap we could safely have just one. (Ref Item 13 from Effective Java.) Let me know if you'd like a specific example. When you go modify an object there is of course the slight overhead to get it into a Builder to be mutable again - that's the trade-off.

frizbog commented 8 years ago

Normally I'd agree but the object model is not intended to be immutable. One of the use cases for the library is to provide the ability to make GEDCOM files from scratch, or change the data in a loaded file and re-write it. I looked at using Collection.emptyList() and similar things to share instances when the lists were empty, but these solutions make immutable collections and so developers would get UnsupportedOperationExceptions if they tried to update the object model, so I steered away from that.

I've created an object pool in 2.3.1-SNAPSHOT for strings that's a little looser than using String.intern(), so that way there's a limit on the number of items in the pool, infrequently used strings can be evicted from the pool (since there's no way to know how many times a string will appear), and it doesn't go in permgen if you're using a JRE earlier than Java 8.

I'm beginning on #83 as part of v3.0.0 of the library, and this will make it possible to do a few things, including having an option on whether you want the Lists to be pre-initialized (so you don't have to do if-null checks), or left null so the memory is not allocated (but you have to do the if-null checks).

haralduna commented 8 years ago

Tried 988e61c. Changed to getters and added ~5 missing null checks. Works perfectly. Test results:

2 3 0-clean 2 3 1-snapshot-6d73abd-3 3 0 0-snapshot-988e61c

frizbog commented 8 years ago

Thanks Harald. When you say added a few null checks, do you mean to your own code in your app, or did I miss some in gedcom4j that needed to be touched up?

haralduna commented 8 years ago

In my own code.

frizbog commented 8 years ago

Released in v2.3.1