FamilySearch / GEDCOM

Apache License 2.0
171 stars 22 forks source link

Should we remove implicit order constraints on structures in v8.0? #528

Open dthaler opened 3 months ago

dthaler commented 3 months ago

Currently CHIL and NAMEs have an implied ordering, and other things don't. Implicit ordering has problems with applications that do merging and transforming, which issues would not exist if ordering were explicit.

Norwegian-Sardines commented 3 months ago

First, I think more than just NAME and CHIL tags are order specific and GEDCOM says that "like" tags at the same level are order specific,

Some events/attributes like BIRT and DEAT have an "implied" singularity and therefore when more than one is present the first one is considered "primary" or "most likely". Order is important!

On the other hand many (if not all) other event/attribute tags that can have multiple instances like OCCU, EDUC, RESI, CENS, or even BURI can be order specific as well. Order is important!

All tags except for NAME can have a DATE subtag which should be the first to control order, when dates are the same or missing (for example two OCCU tag the first one should be considered of a higher order than the second one. The same is true for the CHIL linkage. Just like multiple OCCU tags, when ordering children in a family for display two things should happen, 1) get the birth date of each child, 2) when a date is not available the order of CHIL tags should be considered.

I've run into this issue for some applications where I have entered multiple BIRT or 'DEAT` tags because of conflicting data, the software I use knows that the first instance is "most likely" and uses it when displaying life data. However some software loads the birth or death date in the order they find it and I always see that last instance (the least likely) date in the life data!

The same could be true for OCCU or other event/attributes when creating a series of sentences describing life events. If the order of the OCCU is not maintained the sentence order could be created incorrectly when dates are not provided!

The order for any event from the GEDCOM must be maintained for all undated facts.

Programs that Merge information must check the dates and order of all data events/attributes. Order is important!

dthaler commented 3 months ago

Of course order is important. The question in this issue is whether the order is implied by the order in which tags appear in the file (which is fragile as files get merged for instance), or made explicit in 8.0 such as in priority value in a substructure so that they are more robust.

Today in 7.0 one must do:

...
1 CHIL @I1@
1 CHIL @VOID@
1 CHIL @I3@

to say that child I1 was firstborn and I3 was the third child and the second child is unknown.

As opposed to (for example):

1 CHIL @I1@
2 ORDER 1
1 CHIL @I3@
2 ORDER 3

Consider merging the above family with one in another gedcom file that only contains one child in the file (say because it only contains the ancestors of the submitter and not siblings of ancestors) but the child is known to be the second child:

1 CHIL @I100@
2 ORDER 2
Norwegian-Sardines commented 3 months ago

The two GEDCOM snippets acknowledge that each know they are missing one child, why not either create a "placeholder" child (I do this in v5.5.1 now). Have the option of adding a @VOID@ pointer make things easier!

The bigger issue with merging is the following case (1):

1 CHIL @I1@  <Jane
1 CHIL @I2@  <Joe
1 CHIL @I3@  <Bob
1 CHIL @I1@ <Jane
1 CHIL @I2@ <Ralph

Where Ralph could be anywhere in the list and no amount of ordering will help because neither submitter knows about the missing children.

So you could get this case (2):

1 CHIL @I1@  <Jane
2 ORDER 1
1 CHIL @I2@  <Joe
2 ORDER 2
1 CHIL @I3@  <Bob
2 ORDER 3
1 CHIL @I1@ <Jane
2 ORDER 1
1 CHIL @I2@ <Ralph
2 ORDER 2

Where in your example they at least acknowledge the missing child just don't think they need to add them into the mix, either because they don't have the info or just don't care!

More intervention is needed in case (2). Maybe Ralph is actually Joe's middle name and he went by that in some circles. Or Ralph died at birth and he was only known in the church book before the family moved! Or Ralph was last born!

albertemmerich commented 3 months ago

The examples show that explicite ordering will not help when merging data. But explicite ordering will help to keep order when other criteria do not word.

1 CHIL @I1@  <Jane
2 ORDER 1
1 CHIL @I2@  <Joe
2 ORDER 2
1 CHIL @I3@  <Bob
2 ORDER 3

will help, if we know: Jane is born 1880, Bob is born 1885, but we do not know excatly when Joe is born. Only he is born after Jane and before Bob. If the application tries to order these children without having the explicite ORDER tags it is likely the order will be rearranged.

However we have the possibility to use SDATE to ensure the order we would like to have: Joe gets a birth date

2 SDATE 1882

and the application has the data it needs to correctly order the children. Use of SDATE helps in case of merging data, too. The merged order will be as exact as the SDATEs meet the correct order.

If Jane and Bob are put to the family when ist is not known that other children may be born to the parents, too, explicite ordering would give them ORDER 1 and ORDER 2 - now we merge with data which have only Joe: This child carries ORDER 1. No way for merging process to find a correct order for all three of them using ORDER! This situation I see more often than the situation that data tell: "I have only one child, but it is the second to its parents".

Said this I prefer ordering criteria which help when data are merged. SDATE helps a lot!

dthaler commented 3 months ago

For children, I do like the use of SDATE for explicit ordering, but that goes under a specific event. If a source says "Bob was the third child in his family", we should have a way to record that in GEDCOM. My ORDER example above might be a way to do that. Or if the source says "Bob was the second son" it might be nice to be able to record that too, to assist in ordering. So if the meaning were that the record indicated it was the Nth child, then Bob and Ralph may be the same person (e.g., full name might have been "Ralph Robert /Smith/").

For NAME, BIRT, and DEAT it seems that the notion of "primary" is important, though the rest of the ordering is much less important. Just brainstorming... perhaps just a "PRIM Y" substructure, though that would be a bit hard to deal with when merging two gedcom files with different "PRIM Y" superstructures. Another possibility would be to have something like "PRIM" at the same level with cardinality "{0:1}" so it forces a merge to be correct.

1 NAME Ralph /Smith/
1 NAME Bob /Smith/
1 NAME Ralph Robert /Smith/
1 PRIMNAME Ralph Robert /Smith/   <-- primary in this GEDCOM file

(substructures omitted just to make the main point more obvious)

Norwegian-Sardines commented 3 months ago

A lot of the talk recently asks the question: "What happens when a merge occurs?". I don't have a lot of experience with "full on, unassisted merging", personally I would never allow a program to merge a GEDCOM into my GEDCOM (i.e. a snippet GEDCOM into a master GEDCOM) without my intervention on all additions and removals. The problems as outlined by Dave (two PRIMARY anything, birth, death, name) should not be resolved by any program, but my the owner of the master GEDCOM. It is his/her database that is being changed by the snippet GEDCOM, not only could the PRIMARY name be incorrect, but any number of other bits of data could be wrong, or not follow the master GEDCOM's well defined data entry standard.

IMHO if we think too hard about how a full on unassisted merge will cause issues, we could reject good concepts and design because we are afraid they outcome will be misinterpreted!

albertemmerich commented 3 months ago

As I am the admin of big databases of a genealogical association, and this assiciation uses my application for team work, I very often see loading big GEDCOM files (> 100 MB) with ten thousands of individuals to an even bigger database. Any structure in GEDCOM which needs more manually support in merging those records describing duplicate individuals will result in many hours of work, and normally will result in the option to skip those data at import. As in your case the decision to merge is made by a team member, however the application offers a solution ready to merge and so far only a few modifications are necessary before release the merge. So merging about 100 duplicates per hour will work.

So I think about issues assisted merging will cause if we get more data which cannot be merged by program but need manually help.

One example in existing standard is the number of children NCHR. There are applications in the wild which create this data by counting the children in the records pointed to by FAMS. But the application cannot see whether there is a source telling about this number of children or a application has added this without extra source. As NCHR is {1:1} I cannot show differnet versions found in different sources. So at import NCHR is ignored if there is no source citation in its substructure. If it is coming with source citation, and there already exists a NCHR with another payload and again a source citation, the user has to decide and manually enter the his solution. As in most cases there are no source citation under NCHR this will happen very seldom.

This said ORDER would be one of the tags I will ignore at import when coming without own source citation. SDATE works much better, as this can be ignored when the other record of the duplicate comes with a DATE value.

Norwegian-Sardines commented 3 months ago

First, you should be using NCHI not NCHR, probably a typo!

Second, I agree that counting the number of children connected to a family or individual is not a very good use of this tag! If I was to send this tag, it could only be created with my knowing the data is true and thus have a citation. I treat it like any other “fact” not as a calculated value!

A point of interest, your merge program either must be very robust and your user base must prescreen all data collisions before merging. I seen too many “unattended merges” with lesser software and no user screening creating a mess of unreal dates, bad name recognition, and in general unusable data. Most GEDCOMs I’ve seen have either no citations or unusable ones at best (i.e. Not enough artifact source information to find the assertion again). So I suspect everyone in your group does a better job of citation building and you have a review process installed as well!

jkr-wrk commented 2 months ago

What do we expect from the ORDER tag? Could it overwrite the DATE order? Or do we only use it when dates are the same or not given? Do we use it to order twins?

If I have 3 children: Born 1-1-2022 Born unknown but middle child Born 1-1-2024

I could write Born >1-1-2022 and <1-1-2024? That way I know the order will stay correct when merging. A note mentioning that the dates are based on the fact it is a middle child will help.

When there are no dates present, this is a bit harder. In that case some ordering would help. But probably still better to have a way to tell who was born after who.

So could we think of some system to tell that events happened before and after other events, if we don't know the dates of these events.