francoisforster / gedcom-cleanup

A simple Kotlin library that compares GEDCOM files, cleans them and performs limited validation
7 stars 0 forks source link

Diff multiple gedcoms and improve comparison algorithm? #5

Open Sternbach-Software opened 1 year ago

Sternbach-Software commented 1 year ago

I am writing the only Geni.com Android app (none are available as of yet), and am working on a feature to compare multiple family trees, based on this phenomenal library written to compare two trees - you may want to check out his amazing comparison algorithm. I started by making a map of diffs between each tree, so the diff of tree x with respect to y can be gotten through diffs[x][y].

I don't even have clear what the edge cases are in a k-diff (based on k-way merge); with a 2-way diff, a given element is either in left, right, or both (in which case a similarity value is given to represent how similar they are).

And how do I represent a k-diff graphically? Some of my thoughts now are along the lines of: find the "added" elements that are only in some lists, and make those green; removed elements that are missing from some lists, and make those red; changed elements, and make those orange. But how do I determine those things? All k elements could be different, and their similarity values could all be different relative to each other.

You may want to share your input on the Kotlin Slack thread I made about this

francoisforster commented 1 year ago

Thanks for your interest. I'm not sure I want to take on this significant change. It's quite more complicated to diff multiple trees rather than 2 at a time.

Also, the use-case I wrote the diff functionality for was a bit different than a generic diffing capability. I tried using Elliot Chance's diffing library for my trees (10k+ individuals) and couldn't get it to perform in any reasonable time (I killed it after 1hr). I just wanted to find the additions I had made to one tree that had been forked from another at some point, along with minor changes on both trees. This is why I start with a known identical individual in both trees being compared, and navigate from there, rather than trying to generically compare all nodes. That said, I'm sure the logic I use to check if 2 individuals in 2 trees are equivalent can be improved.

Sternbach-Software commented 1 year ago

Understandable.