elliotchance / gedcom

👪 A Go library and CLI tools for encoding, decoding, traversing, merging, comparing, querying and publishing GEDCOM files.
MIT License
94 stars 20 forks source link

Diff multiple GEDCOMs #323

Open Sternbach-Software opened 1 year ago

Sternbach-Software commented 1 year ago

Over time, I have accumulated multiple trees across multiple platforms, and they have regrettably became out of sync. I want to condense them all into one tree, but that means diffing multiple trees, something gedcomdiff doesn't currently support.

It would be unbelievable if I could diff multiple GEDCOMs and display the diff into a single HTML. The coloring would represent that a field is either missing from one file or added/missing from the rest. It would be cool to see for a given field which files are missing it and which have it. Is this feasible?

Sample code (not a Go expert, but I think this works):

func runDiffCommand() {
    ...
    var gedcoms = []gedcom.IndividualNodes{leftIndividuals, rightIndividuals}
    var multiple map[*gedcom.IndividualNodes]map[*gedcom.IndividualNodes]gedcom.IndividualComparisons
    //var comparisons gedcom.IndividualComparisons
    go func() {
        //comparisons = leftIndividuals.Compare(rightIndividuals, compareOptions)
        multiple = compareMultiple(gedcoms, compareOptions)
    }()
    ...
    page := html.NewDiffPageMultiple(multiple, filterFlags, optionGoogleAnalyticsID,
        optionShow, optionSort, diffProgress, compareOptions, html.LivingVisibilityShow)

    go func() {
        /*_, err = page.WriteHTMLTo(out)
        if err != nil {
            log.Fatal(err)
        }
        _, err = pageMulti.WriteHTMLTo(out)
        if err != nil {
            log.Fatal(err)
        }

        close(diffProgress)
    }()
}

func compareMultiple(gedcoms []gedcom.IndividualNodes, compareOptions *gedcom.IndividualNodesCompareOptions) map[*gedcom.IndividualNodes]map[*gedcom.IndividualNodes]gedcom.IndividualComparisons {
    var mapOfLeftToRightsComparisons = make(map[*gedcom.IndividualNodes] /*left*/ map[*gedcom.IndividualNodes] /*right*/ gedcom.IndividualComparisons /*left.Compare(right)*/) // comparisons[x][y] is the diff of x with respect to y ("x.Compare(y)")
    for _, left := range gedcoms {
        var rightsToDiffs = make(map[*gedcom.IndividualNodes]gedcom.IndividualComparisons)
        for _, right := range gedcoms {
            if &right == &left {
                continue //don't compare the same gedcom with itself
            }
            rightsToDiffs[&right] = left.Compare(right, compareOptions)
        }
        mapOfLeftToRightsComparisons[&left] = rightsToDiffs
    }
    for left, rightsToDiffs := range mapOfLeftToRightsComparisons {
        for right, diff := range rightsToDiffs {
            s := fmt.Sprint("Left (", left, ") compared to right (", right, "):", diff)
            fmt.Println(s)
        }
    }
    return mapOfLeftToRightsComparisons
}
Sternbach-Software commented 1 year ago

I'm wondering if the best you can do is remove definite duplicate fields or individuals.

elliotchance commented 1 year ago

There are two things to explore here:

  1. Cartesian match. This would allow any number of GEDCOMs to be provided (rather than left and right). However, this would be extremely expensive to process because the number of comparisons required would take it up to the next power. I don't think this is helpful in cases where a file contains more than even a few hundred individuals.
  2. Compare a primary against multiple others. I think this is the case you're referring to? That is, where you specify a left (primary) GEDCOM, but may specify multiple right GEDCOMs. This would certainly be more expensive, but linearly rather than exponentially.

Going with option 2, you may be able to test how that might work with something like:

-left-gedcom primary.gedcom -right-gecom right1.gedcom -right-gecom right2.gedcom

There might need to be some special options in this case where primary is always included and it just show the closest match (if any) from each respective right GEDCOM.

Sternbach-Software commented 1 year ago
  1. is not feasible even with goroutines? It may take some time, but the user should know what they are getting into. I have some good hardware to spare and let it run for while (even for a few days honestly, but that is probably just me).
Sternbach-Software commented 1 year ago

In my case, I have many trees from Geni.com and Ancestry, and I don't know which one is most comprehensive, and some have a lot of information (not necessarily regarding overlapping parts of the tree) that the others don't, so I don't have a primary. It is possible that by comparing each one, I could identify (or create) a primary though, but it would be work.

Sternbach-Software commented 1 year ago

Unless similar to the tune command, there could be a command to determine which tree makes most sense to make the primary one and diff the others against it. Maybe the one which is missing the least? Though, that may not be good because if there was a tree with too much information on it which was incorrect, and the primary tree is the revised one with less information on it.

Sternbach-Software commented 1 year ago

Unless similar to the tune command, there could be a command to determine which tree makes most sense to make the primary one and diff the others against it. Or, merge all of the "right" trees, and compare that to the left.

elliotchance commented 1 year ago

The GEDCOM comparison can already make full use of multiple cores to speed it up, see -jobs:

https://github.com/elliotchance/gedcom/blob/master/cmd/gedcom/diff.go#L84-L86

And, perhaps even better, if it knows two individuals are the same (by an identifier) it can avoid the expensive comparison altogether.

However, consider the numbers: Comparing two small trees of 1000 individuals takes 1 million comparisons (that's fine), but three trees of 1000 individuals requires 1 billion comparisons to be exhaustive (not fine). Trying to compare many trees (even if they are quite small) will exponentially increase the processing time required.

Depending on what you're goal is, it probably makes more sense to just choose a primary file and have everything work against that.