ShammyLevva / FTAnalyzer

Family Tree Analyzer - Finds hidden details in your family tree. Install at
http://www.ftanalyzer.com/install
Apache License 2.0
54 stars 21 forks source link

Improving Memory Consumption #215

Open fire-eggs opened 3 years ago

fire-eggs commented 3 years ago

I happened to see Tamura Jones' tweet about FTA v8 and excessive memory usage.

I know that Tamura is always pushing the limits of genealogy software with his huge family trees. This is a hot button for my own projects, so I was curious and did a little memory analysis of FTA. Take it all with a dash of salt ...

I used a smaller .ged file than Tamura because I couldn't use Visual Studio analysis and a larger GEDCOM at the same time on my machine.

I tested using the following GEDCOM file taken off the internet: 3075781.ged.zip

Based on this file, here are a couple of suggestions to reduce memory consumption. I think these would apply for most people's GEDCOMs, even Tamura's.

  1. NOTE handling. Notes are tracked using an XmlNodeList (FamilyTree.noteNodes). This is populated from the XMLDocument created by the ged2xml code. Unfortunately, what this does is cause the entire XMLDocument to be kept in memory! So the larger the original GEDCOM, the larger the memory consumed by the XMLDocument. If the Xml objects stored in FamilyTree.noteNodes were copied to not keep references to the original XmlDocument, then a potentially large savings takes place.
  2. After the XmlDocuments, the largest count of objects was FTAnalyzer.Fact. For the test .ged, there were 163,967 instances. I have two ideas here. Both suggestions are based on the fact that C# strings are expensive: on the 64-bit platform, a C# string requires 26+(2*length) bytes for storage, and this is rounded up to the next 8 bytes! 2a. The Reference property appears to be a string copy of the owning record's ID. In my test ged, the IDs are 10 characters long, which requires 48 bytes to store! So for this ged, that is 7.8M of memory. Whereas a reference to the owning object (e.g. FTAnalyzer.Individual) would be an object which takes 8 bytes (1.3M). 2b. The FactType property is also a string and appears to be the record tag. Most TAGs are four characters long, requiring 40 bytes to store as a string. Here, 6.5M of memory. [And there are longer TAGs as well...] It's probably a pain but using an enum instead to encode the TAG requires 4 bytes (0.6M).

I hope this is informative and not a waste of your time.

ShammyLevva commented 3 years ago

Interesting. It's been so very very long since I last looked at the notes handling I'd never noticed that. I really can't recall the rationale behind keeping the nodes. If you'd asked I'd have said the XML is loaded and parsed then discarded once loaded clearly this isn't actually the case.

I'll have a look and see if there's some quick fixes. Although to be honest the attitude shown by Tamara makes me actively want to leave it as it is. That said their file is on the small side at just 250Mb. I've got one from a user supplied for testing purposes that's 300Mb I'll test with that.

fire-eggs commented 3 years ago

I understand completely! It's the 90/10 rule: 90% of your possible users will have no problem loading their GED files. Difficult to justify the extra effort [and possible complexity] for the remaining 10%. Hard to beat the convenience of C# as well, setting aside these minor inconveniences.

I've followed Tamura Jones closely in the past but recently he seems to be sounding too negative.

ShammyLevva commented 3 years ago

The doc isn't needed after loaded so disposing of the XML frees up almost 50% of used memory so good shout on that one. The Reference field isn't so straight forward it's not a great field anyway but it's used to store in inbound reference for the thing that created the fact. That can be an individual a family or a census reference (or a couple of other things). That reference is then used to report on date issues.

I'm not convinced at present there isn't a better way of handling the reference but at present it's used to highlight where an error appears in the file. Some sort of reflection to get the parent may be an alternative but I'd rather not go down that rabbit hole. So maybe just an object reference but I'm loathed to used untyped variables hang over from 30+ years of procedural coding.

ShammyLevva commented 3 years ago

The issue with FactType is that it's a string to tag conversion so the string is what it looks for in the raw GEDCOM so I can't see how an enum would work in those circumstances. Have I misunderstood?

fire-eggs commented 3 years ago

The doc isn't needed after loaded so disposing of the XML frees up almost 50% of used memory so good shout on that one.

I'm glad it was useful!

I'm not convinced at present there isn't a better way of handling the reference

I understand - adding complexity to address this isn't always the "best" thing to do. E.g. I built a string cache to re-use strings for things like tags, surnames, location names, etc, which made the code much harder to work with.

FactType is that it's a string to tag conversion so the string

This is almost certainly my misunderstanding, merely looking at the class members and not the code. I assumed the full "fact" tag string is being stored with every Fact instance.

My idea of an enum is to store the value TagEnum.BIRT instead of the string "BIRT". This requires a string <> enum translation table and (e.g.) replacing any code that references "BIRT" with TagEnum.BIRT. Definitely a lot of work for little gain.

ShammyLevva commented 3 years ago

To be fair the reference field is a bit of a hack, I needed a way when the Fact creation errored (eg: bad date) to let the user know which individual or family had the problem. This typically is a rare occurrence, and because the fact can be created to be attached to multiple different types of object, I used an ID to report back to user.

A significant refactor to pass object the reflect on object type could be an alternative which would have added benefit of then having access to the object and reporting fuller details of what object the fact was being created by rather than just an ID.