cgeltly / treechecker

Error checking for genealogical data files
GNU General Public License v3.0
2 stars 1 forks source link

Events appearing multiple times in `events` table #17

Open cgeltly opened 9 years ago

cgeltly commented 9 years ago

With some imported GEDCOM files, events are appearing multiple times in the events table. For example, there may be more than one birth event for the same individual.

This is occurring because events may be listed more than once in some GEDCOM files, e.g. note how BIRT appears four times in the following GEDCOM code:

This example is from MyHeritage software (http://www.myheritage.com/):

0 @I67@ INDI 1 RIN MH:I69 1 _UID 53A5778C3E6765FD7240A64747F751AA 1 _UPD 21 JUL 2014 10:03:03 GMT+1 1 NAME John /Doe/ 2 GIVN John 2 SURN Doe 1 SEX M 1 BIRT 2 _UID 65431635415615165165116516516516 2 RIN MH:IF1481 2 DATE 1832 2 PLAC Nibbixwoud 1 DEAT 2 _UID 65431635415615165165116516516516 2 RIN MH:IF1483 1 BIRT 2 _UID 65431635415615165165116516516516 2 RIN MH:IF500091 2 DATE ABT 1832 2 PLAC Nibbixwoud, Noord Holland 1 BIRT 2 _UID 65431635415615165165116516516516 2 RIN MH:IF500101 2 DATE ABT 1832 2 PLAC Nibbixwoud, Noord Holland 1 BIRT 2 _UID 65431635415615165165116516516516 2 RIN MH:IF500103 2 DATE ABT 1832 2 PLAC Nibbixwoud, Noord Holland 1 OCCU dienstknecht 2 _UID 65431635415615165165116516516516 2 RIN MH:IF1482 1 OCCU Dienstknecht 2 _UID 65431635415615165165116516516516 2 RIN MH:IF500092 1 OCCU Dienstknecht 2 _UID 65431635415615165165116516516516 2 RIN MH:IF500102 1 OCCU Dienstknecht 2 _UID 65431635415615165165116516516516 2 RIN MH:IF500104

Importantly, the details of the birth event change. In the first example, the date appears as '1832', but in the last 3 examples, it appears as 'ABT 1832'.

This example is from Haza-21 software (http://www.hazadata.com/):

0 @I6517@ INDI 1 NAME Jane /Doe/ 1 SEX F 1 FAMS @F334@ 1 FAMC @F334@ 1 BIRT 2 RFN 90800 2 DATE 18 FEB 1826 1 DEAT 2 RFN 90801 2 DATE 21 MAR 1915 2 PLAC Delft, ZH 1 BURI 2 RFN 90802 2 DATE 24 MAR 1915 2 PLAC Delft, ZH 1 BIRT 2 RFN 96062 2 PLAC Delft, ZH

In this example, the birth date is under the first occurrence of the BIRT tag, whilst the birth place is under the second occurrence.

The occurrence of duplicate events in the events table is not in itself a problem, but it causes problems downstream. In the statistical tables, for example, individuals may appear multiple times, because they have more than one birth. It may also be leading to incorrect details of individuals being reported to the user, e.g. 1832 being reported instead of ABT 1832 (see above example).

A fix to this problem may be to always select the last occurrence of an event from the events table, and to do this separately for each variable in question, i.e. date, place, or lati and long (combine lati and long).

The most satisfactory fix may be to do this during parsing and merge the duplicate records into one. The question is whether this will always result in the correct data being retained? This will need to be tested.

mhkuu commented 9 years ago

Having taken a look at some files I think we should really consider this a user/program error, and report this as a parsing error. Take for example 5047.ged (anonymized):

1 NAME John /Doe/ 1 SEX M 1 BIRT 2 DATE 12 SEP 1987 2 PLAC Den Bosch 2 SOUR @ S1 @ 1 BIRT 2 DATE ABT 1988 2 PLAC 's Hertogenbosch

There's no way of telling which of the dates is right; one might presume that the first is right (more complete and a source), but then again the second might have been added later on to correct the first (hence being listed later in the file).

I would advise to take either the first (or the last) occurrence and add a parse error for subsequent events. I would advise also to add a unique key on the events table for the combination of indi_id and event, so we can be sure this won't be allowed database-wise either, and we won't have to take possible multiple events into account for later processing.

coret commented 9 years ago

In genealogical research you collect data from several sources, some sources with more details than others, some even contradicting others. A lot of genealogical programs only allow for a single entry of an event (eg. one birth), others allow for multiple entries (based on which a conclusion about the event can be drawn). Multiple individual events (like BIRT records) are valid in GEDCOM!

From: The GEDCOM Standard Release 5.5

Lineage-Linked Form Usage Conventions

The order in which GEDCOM lines are written to a GEDCOM file is controlled by the context and level number. When the lines are of equal level number but have a different tag name then the order is not significant. The occurrence of equal level numbers and equal tags within the same context imply that multiple opinions or multiple values of the data exist. The significance of the order in these cases is interpreted as the submitter's preference. The most preferred value being the first with the least preferred data listed in subsequent lines by order decreasing preference. For example, a researcher who discovers conflicting evidence about a person's birth event would list the most credible information first and the least credible or preferred items last.

Systems that support multiple fields or structures should allow their users to indicate their preference opinion. Systems that only store single value structures should use the preferred information (the first occurrence listed) and store the remaining information as an exception, preferably within an appropriate NOTE field or in some way that the patron has ready access to the less-preferred data when viewing the record.

Conflicting event dates and places should be represented by placing them in separate event structures with appropriate source citations rather than by placing them under the same enclosing event.

cgeltly commented 9 years ago

Okay, I think we should recognise that it is actually a good thing to have information from different sources, even if it is contradictory. It may indicate good research rather than an error.

As I mentioned before, having an event appearing multiple times in the events table is not a problem. The question is how we deal with this information downstream, particularly when making joins onto the events table.

I suggest that we tag all duplicate events as 'ambiguous', using a boolean field in the events table. This would be similar to what we are doing with estimated dates, and would allow the duplicate records to be identified and dealt with as necessary.

I think we should flag the duplicates in the error table, but not as an error, as a 'notification' instead, i.e.:

type_broad: 'ambiguous' type_specific: 'ambiguous_birth' / 'ambiguous_death' / etc. eval_broad: 'notification'