ShammyLevva / FTAnalyzer

Family Tree Analyzer - Finds hidden details in your family tree. Install at
http://www.ftanalyzer.com/install
Apache License 2.0
54 stars 21 forks source link

Improving GEDCOM import speed #220

Open ennoborg opened 3 years ago

ennoborg commented 3 years ago

Is your feature request related to a problem? Please describe. Now that Tamura Jones and I are trying to improve the Duplicate Search, the contrast between that and the GEDCOM import is getting quite big, and since I'm always looking for ways to improve, this looks like a challenge.

Describe the solution you'd like When I look at the code, I see that the XmlDocument class is used for the GEDCOM to XML conversion, and then again to read all sources, persons, etc. from that XmlDocument and store all relevant data in custom lists and dictionaries.

It appears that reading from XmlDocument objects is quite slow, and can be improved by a factor of 6 to 10 by using XDocument and XElement. And although this is just the reading part, my tests show that data storage is so fast, that I think that conversion to XDocument and XElement will still result in an obvious gain.

Describe alternatives you've considered Reading that ' the XML is loaded and parsed then discarded ' confirmed that there may be an even faster way, which is skipping the XML conversion alltogether, and using a new GEDCOM parser, that writes to the internal data structures right away.

Additional context Some modern suggestions for parsers written in C# are:

  1. https://github.com/jaklithn/GedcomParser
  2. https://github.com/TheGeneGenieProject/GeneGenie.Gedcom
ShammyLevva commented 3 years ago

Sounds interesting the XMLDocument processing has been in the code forever as basically that's at the core of the app, read the data and parse it into the structures ready for processing.

The thought behind using an XML parser originally was the ability to use xpath to pickout groups of data without having to iterate over the file. So that might be a factor to consider. Something that efficiently can pick data from parts of the file as required.

ennoborg commented 3 years ago

Using XDocument in the parser had less effect than I expected, so I am now concentrating on using the GeneGenie.Gedcom component, and that works much better for me. With that in place, I should be able convert your objects to wrappers for the ones loaded by the parser, which would result in less copying, and better performance.

This is a work in progress, with the Individual mostly done. My next focus is facts.

ShammyLevva commented 2 years ago

It may be worth also observing the newly released standard v7.0 https://gedcom.io/specs/ as that might throw up issues with GEDCOM parsers.

ennoborg commented 2 years ago

Any idea how much of that already exists in the wild?

ShammyLevva commented 2 years ago

My guess would be 0% of commercial products only development software. However it’s likely with the focus on including media and linking to URLs and html in comments it would become common enough. It would only take something like FTM 2022 to adopt this as an export format and it’s a game changer in terms of having a backup including media.

Software like Family Historian whose database is GEDCOM may be one of the first to make the move as they already have plugins that do a lot of this sort of thing.

ennoborg commented 2 years ago

Well, guess what, the 1st product is there, see:

https://chronoplexsoftware.com/blog/index.htm

from which I quote

My Family Tree 11.0 now available Today we are excited to announce the release of My Family Tree 11 available now from the latest updates page.

This major release adds full support for GEDCOM 7.0 which introduces a number of incredibly useful features for users of genealogy applications and services.

My Family Tree is the 1st useful program that shows up in the Microsoft Store when you let Windows search for applications that read GEDCOM files.

ShammyLevva commented 2 years ago

That’s good they are nice n quick off the mark. I suspect though that adoption rates will remain low until some of the majors update like FTM, RootsMagic, Legacy, Ancestry, Family Historian etc. Although I do wonder if Ancestry will drag their heels as not being able to download your tree, media and all, from Ancestry is a major factor in their subscriber retention.

ennoborg commented 2 years ago

H'm, yes, it depends. For Gramps, we have our own backup format with and without media, so we don't really need this, yet.

Downloading your tree from Ancestry with media is quite easy, when you use Rootsmagic, so avoiding GEDCOM 7 won't really help for retention.

ShammyLevva commented 2 years ago

Yes both FTM and Rootsmagic allow sync and download of media. my point was that Ancestry dont currently make it easy to even download a GEDCOM its somewhat hidden away in a maze of twisty menus all alike (bonus points for getting the reference). Since they rely on 3rd parties to add that functionality and for text only its somewhat hidden, I was thinking it would not be surprising if it took them some years to adopt a new standard.