Comparing large files from different origins is impossible

ennoborg commented 2 years ago

When I compare a GEDCOM created by Gramps with anoher one created by RootsMagic, the program runs out of resources and crashes, consuming more than 6 GB of my 8 GB RAM.

The GEDCOM from Gramps has close to 12,000 persons, the one from RootsMagic about 7,000. About 6,000 of these are the same or similar.

elliotchance commented 2 years ago

These aren't outrageously large files. I think 20k persons have been compared before. Can you provide the command you used and how large (in file size) is each file?

ennoborg commented 2 years ago

Here's the command, on the latest Linux Mint:

./gedcom diff -left-gedcom=../MEGAsync/Untitled_6.ged -right-gedcom=../MEGAsync/fs.ged -output=diff.html

File sizes:

-rwxr-xr-x 1 enno enno 2734017 nov 27 15:36 fs.ged -rwxr-xr-x 1 enno enno 7424781 nov 23 20:35 Untitled_6.ged

And although most of the persons are the same, their data is not, because Untitled_6 has IDs generated by Gramps, and fs has IDs generated by RootsMagic. Most of the persons in fs.ged also have place names that were normalized by FamilySearch. In other words, although most of the persons are the same individuals, most have small differences, primarily because of the normalized place names, or because they have been edited by other FS members.

When I compare two GEDCOMs generated by Gramps, the program runs fine, although in that case the HTML is too big to load, so I can't use it either.

I'm a bit confused, because you wrote that you tested it with Ancestry, which also changes all IDs.

elliotchance commented 2 years ago

Interesting. I wonder if it's running out of memory because of the sheer size of the HTML generated and not the diff itself. try running with the -progress to see how far it gets before crashing/consuming all of your RAM:

./gedcom diff -left-gedcom=../MEGAsync/Untitled_6.ged -right-gedcom=../MEGAsync/fs.ged -output=diff.html -progress

You can also try adding -only-vitals as an option which should severely reduce the amount of HTML produced: https://github.com/elliotchance/gedcom/blob/master/filter_flags.go#L53-L55

elliotchance commented 2 years ago

... and also -hide-equal to reduce the output side if a lot of information is similar between the files.

ennoborg commented 2 years ago

I tried all, but none of these really helped. And on this system, I did not get any out of memory errors. The program just slowed down to a crawl, so I had to abort it. Here are my results:

enno@desktop-mate:~$ cd bin/ enno@desktop-mate:~/bin$ ./gedcom diff -left-gedcom=../MEGAsync/Untitled_6.ged -right-gedcom=../MEGAsync/fs.ged -output=diff.html -progress Comparing Documents 3h30m51s 60196100 / 81458073 [==============>----] 1h14m28sComparing Documents 3h30m52s 60196200 / 81458073 [==============>----] 3h30m52s 0 / ? [----------------------------------------------------------------------=]2021/12/02 18:29:53 aborted enno@desktop-mate:~/bin$ ./gedcom diff -left-gedcom=../MEGAsync/Untitled_6.ged -right-gedcom=../MEGAsync/fs.ged -output=diff.html -progress -no-places Comparing Documents 1m16s 2275900 / 81458073 [>------------------------] 44m04s^C2021/12/02 18:31:41 aborted enno@desktop-mate:~/bin$ ./gedcom diff -left-gedcom=../MEGAsync/Untitled_6.ged -right-gedcom=../MEGAsync/fs.ged -output=diff.html -progress -no-places -only-vitals Comparing Documents 30m57s 35076000 / 81458073 [=========>-------------] 40m56sComparing Documents 30m59s 35076400 / 81458073 [=========>-------------] 30m59s 0 / ? [----------------------------------------------------------------------=]2021/12/02 19:02:52 aborted enno@desktop-mate:~/bin$ ./gedcom diff -left-gedcom=../MEGAsync/Untitled_6.ged -right-gedcom=../MEGAsync/fs.ged -output=diff.html -progress -no-places -only-vitals -hide-equal Comparing Documents 15m47s 13680900 / 81458073 [===>-----------------] 1h18m13sComparing Documents 15m47s 13681000 / 81458073 [===>-------------------] 15m47s Comparing Individuals 0s 0 / ? [---------------------------------------------=]2021/12/02 20:17:40 aborted enno@desktop-mate:~/bin$

Please note that, like I wrote earlier, almost none of the persons are equal. More than half are the same individuals, but almost every person downloaded from FamilySearch has standardized place names and none have the same attributes, simple because FamilySearch doesn't store that many. This is quite different from when you upload your own GEDCOM to Ancestry, and make some modifications on-line. If you do that, downloaded persons have different IDs, but most of their attributes are the same.

In other words, I think that to serve my purpose, the program needs to be way more fuzzy than it actually is.

elliotchance commented 2 years ago

The default behavior is to to compare every individual with every other individual (12k x 7k = 84m comparisons). Comparisons are made by taking into account the name, birth and death dates and producing a similarity number (0.0 - 1.0). Individuals that have a similarity higher than the threshold (default is 0.733, but you can override with options) are considered "equal". If there there are many pairs of individuals that match, the highest similarity is chosen.

The tool will try to use common IDs to reduce the product of the comparisons. This is useful when both sides come from the same source or at least IDs are maintained. However, if you say that these share no IDs then it will always fall back to comparing all individuals. You can verify this is the case by seeing that the total comparisons (84m) does not change. If ID matches are found this number will drop throughout the process.

I don't think this is a problem with how the tool works but rather just some memory leaking. You could try rebuilding from source and adding explicit runtime.GC() calls in places to make the garbage collector more aggressive at the expense of speed. I'm not sure if that would really make a difference though. Or, depending how involved you want to get you could modify the code to have less of a memory footprint (I haven't done any real memory optimization with this project).

As you say, you won't get any out-of-memory errors because it will just start consuming swap space which is ultra slow, but if you're willing to let it run over night it will eventually finish...

ennoborg commented 2 years ago

I am now running a session comparing my current tree from Gramps with the one that I have on Ancestry, which has the same UIDs, and now I see more progress. It starts slow, with expected run time shown as multiple days, but the actual run time is 8 minutes. This only works when I modify the GEDCOM produced by Ancestry, changing 1 UID to 1 _UID.

elliotchance / gedcom

Comparing large files from different origins is impossible #319