frizbog / gedcom4j

Java library for reading/writing genealogy files in GEDCOM format
http://gedcom4j.org
55 stars 37 forks source link

GEDCOM model RAM usage #103

Closed haralduna closed 8 years ago

haralduna commented 8 years ago

I am getting out of memory reports for my Android app. I am reproducing the error myself. It happens when parsing a GEDCOM file with 8000 individuals. The GEDCOM file itself has a size of 14Mb. The loading of the model alone brings the memory usage up from 40Mb to 174Mb. As the fixed maximum RAM for an Android app is typically 192Mb it is no surprise there are out of memory reports.

The GEDCOM file itself contains a lot of citations and notes which doesn't make it better.

I have looked a little bit into your code, but I didn't find any easy fix so far.

I have one theory though: Is it so that citations, notes, etc. are copied into the families or events where the are referenced, so that multiple references to the same citations, etc. causes duplication of the strings in the string tree?

Maybe you have ideas about how the RAM usage can be reduced?

frizbog commented 8 years ago

Here's a rambling response so you can see what my thinking is...hopefully it will be somewhat informative, but I don't really have a solution for you in the near term.

Regarding duplication: The way the data is kept in memory is mostly determined by the way the file is structured...particularly, with regard to NOTEs. Notes can be given an ID XREF in a NOTE_RECORD: 1 @N001@ NOTE Here is my note text and then another record can cross reference the Note in a NOTE_STRUCTURE: 3 NOTE @N001@ and thus avoid the duplication. If the GEDCOM file does that, that's how it's loaded in memory by gedcom4j, by keeping a reference to the common NOTE record rather than a copy. If the GEDCOM has each Note inline without using the XREFs, they're duplicated in the object model in memory.

For sources, GEDCOM supports tracking sources themselves using SOURCE_RECORD structures, and repeated citations to those sources with SOURCE_CITATION structures, in a method very similar to the NOTEs. And again, the data is loaded into memory in the same way that the data is recorded in the GEDCOM file.

Regarding reducing memory footprint One of the biggest things that makes memory consumption bigger than it needs to be is the fact that all Lists and Maps in the object model are initialized with empty structures, rather than being null; this helps avoid a lot of NullPointerExceptions and tedious null-checking - if it's a collection, you can generally assume it's initialized, but empty. This means most objects are sitting around with several empty collections, and each one takes memory. 8000 individuals means creating 8000 Lists of Notes, 8000 Lists of Submitters, 8000 Lists of Associations, 8000 Lists of custom tag data, etc., most of which are going to be empty, even after loading data.

The best way I can think of to really solve this in the long term is to rework the object model to use the Javabean spec (see issue #83) and have the getters on collections lazy-initialize collections on first access, and letting them remain null the rest of the time. This would greatly reduce the memory footprint of the memory model, at least right after parsing. That issue is something I'm looking to do in the 3.x release of gedcom4j (no planned date yet), and is becoming more and more attractive to me...I'm trying to do some JSP/JSTL/EL work with gedcom4j and finding that the lack of getters/setters in the object model are preventing me from using EL and JSP effectively with the data. Howeer, this will by necessity change the API (hence the major release number increment) in a way that will no longer be compatible with current code using 2.x releases of gedcom4j.

I'm not willing to stop initializing the collections in the data model without adding the getters and making the fields private...there is too much code (including gedcom4j itself) that would start breaking and getting NullPointerExceptions all over the place, and having to write all those if (x!=null) checks everywhere is just horrific. It would vastly change the semantics of the API without changing the syntax, which is a great recipe for bugs you only find at runtime and not at compile-time.

It might be possible to add a set of options to the Parser to identify tags and data that are not of interest to the app and to be ignored on parsing. For example, if your app does not deal with anything more than people's names, dates of birth and death, and their relationships to each other, then a lot of data can be ignored (like notes, sources, multimedia, repositories, etc) when reading the file. But I'm not sure how workable that really is.

As long as gedcom4j loads files into an in-memory object model, there will always be a practical size limit, and a fully sourced, highly detailed GEDCOM with 8,000 individuals is a rather big GEDCOM file by most standards. That said, expanding a 14MB file on disk into a data structure 10x that size is not ok and does not scale well at all for larger files, so this is making fixing Issue #83 more attractive and I am likely accelerate that effort.

The only way to make gedcom4j able to work with unlimited file sizes would be to create an event-based parser and let the consumer create their own object model (like SAX parsers for XML) rather than one that loads the file into a predefined object model for you (like DOM parsing for XML, which gedcom4j is analogous to).

Bottom Line aka TL;DR I'm afraid I don't have any good suggestions for you, but to put a cap in your app on how big a file it will allow the user to load; and stand by for v3.x of gedcom4j (whenever that is) so this can be addressed.

frizbog commented 8 years ago

I've been thinking about this a lot and I have something I'm going to try: I'm going to set the initialCapacity of most/all collections in the data model to 0. Most of the time, most lists will be empty but will allocate enough space for 10 items. I'm going to try it and do some profiling.

haralduna commented 8 years ago

I checked the contents on the heap and I noticed that all lists have a default capacity of 12, and most of them are empty or with less than 5 elements. That doesn't mean that 12 elements are allocated, only that the list itself is can handle up to 12 elements before its index must be reallocated. I suspect this will not have a great impact.

frizbog commented 8 years ago

I did a pseudo-scientific test and it reduced memory footprint by about 7%, and of course that will vary based on the data. I did not observe a significant change in performance...<1%. I think this is a change I will keep in place, small though the improvement is.

frizbog commented 8 years ago

~7% memory reduction included in 2.2.9-SNAPSHOT, around 2016-06-24T20:35:27-04:00

frizbog commented 8 years ago

I was able to achieve another 13% or so improvement by using String.intern on xrefs and tags (especially helpful on tags, since they are predefined and used repetively). Totalling a 21% reduction in heap usage so far in my pseudo-scientific tests.

haralduna commented 8 years ago

Firstly, I would like to say that I really appreciate your effort, this is of great value for me and my project.

I tried your new snapshot. I reduces the heap allocation for my test file from 162Mb to 149Mb, of which more than 24Mb is not related to the GEDCOM model. The reduction is more than 9%.

However, the loading of the file now takes 140 seconds while it only took 75 seconds on 2.2.8. When it comes to reduction of the processing time I have few ideas that I am trying out. I am in particular looking into AnselHandler and LinePieces, which are using StringBuilder with append which are very costly. I am also considering a rework of the AbstractEncodingSpecificReader.load function, so that is processes one line at a time and without putting temporary results in lists.

frizbog commented 8 years ago

Yes, it is definitely slower....the string.intern()ing is to blame for the increase. I'm also going to look at ways to offset that extra load time.

frizbog commented 8 years ago

Latest v2.2.9-SNAPSHOT of 2016-06-25T10:48:05-04:00 performs considerably better. There were numerous places where calling string.intern() was adding overhead with no benefit, so I did it a lot more judiciously. Also reworked some of how LinePieces did its work.

frizbog commented 8 years ago

I was also looking at your comment about AbstractEncodingSpecificReader and the GedcomFileReader which loads a whole copy of the file into memory as a List in order to process it, then throws it away. You're right, this is arguably unnecessary. Will give it some more thought.

frizbog commented 8 years ago

Did a number of things to try and improve performance while keeping the memory consumption benefits from earlier. Turns out the only real String.intern()ing that made any real difference was on the Tags, so I left that in place. I also did a lot of rework on the LinePieces and GedcomParserHelper classes, including breaking several methods of the latter class out into its own new StringTreeBuilder class. This new class is a little easier to understand, and no longer relies on recursion to find the most-recent parent node of a particular level ... it just keeps an index on the fly and this seems to be performing a bit better, but it will again vary on a file-by-file basis.

This was loaded up in Maven Central as a 2.2.9-SNAPSHOT at 2016-06-25T22:58:46-04:00

haralduna commented 8 years ago

Earlier this week I did some profiling and I managed to reduce the processing time for LinePieces and AnselReader siignificantly. Your latest LinePieces matches my new version w.r.t. processing time. With the optimized AnselReader the processing time for my 8000 person test file is reduced from 38s to 23s. Please find the Ansel code attached. AnselReader and AnselHandler are based on your latest version (3c1f77e).

AnselReader and AnselHandler

This is my LinePieces version:

    LinePieces(String line, int lineNum) throws GedcomParserException {
        String[] words = line.split(" ");
        int c = 0;
        if (words.length > 0) {
            level = Integer.parseInt(words[0]);
            if (words.length > 1) {
                if (words[1].startsWith("@")) {
                    id = words[1];
                    if (words.length > 2) {
                        tag = words[2];
//                        c = line.lastIndexOf(tag) + 1;
                    } else {
//                        c = line.lastIndexOf(id) + 1;
                    }
                } else {
                    tag = words[1];
                    c = line.lastIndexOf(tag) + tag.length() + 1;
                    if (line.length() > c)
                        remainder = line.substring(c);
                }
            }
        }
    }
frizbog commented 8 years ago

I had not noticed that ANSEL performed so badly....the files I generally use are UTF-8 so that was not the case I was profiling. I like what you did with the character array and adopted it.

frizbog commented 8 years ago

v2.2.9-SNAPSHOT of 2016-06-26T14:54:45-04:00 includes latest revisions.

haralduna commented 8 years ago

I discovered a bug in my changes to the AnselReader and AnselHandler. One of my GEDCOM files have line lengths well above the maximum of 256 characters (according to the standard), so I got index out of bounds exceptions in both Reader and Handler. The actual line length was 1276 characters. I increased the char buffer sizes on three places and then it works. I am not sure what would be a proper solution here.

frizbog commented 8 years ago

The most resilient solution I can think of is to split long lines up as they are read, and introduce synthetic CONC tags as if they had already been properly split. I did something similar to this for issue #100 (introducing CONT tags where they were omitted). Perhaps I can do something for that here.

frizbog commented 8 years ago

v2.3.0-SNAPSHOT of 2016-06-28T22:06:30-04:00 provides the capability to split long ANSEL lines up as if they had been properly split in the first place using CONC lines. This is mostly unchanged from previous releases, but please note that there have been a few changes to the API (mostly package changes).

The changes to the API are as follows:

frizbog commented 8 years ago

Released in v.2.30