fisharebest / webtrees

Online genealogy
https://webtrees.net
GNU General Public License v3.0
452 stars 298 forks source link

2.0.15 - Sorting issue in Fact and events: BIRT on the wrong place #3841

Open TheDutchJewel opened 3 years ago

TheDutchJewel commented 3 years ago

Sometimes (very rarely) I have an issue with the sorting in the Facts and events pane.

Normal presentation is BIRT/OCCU/MARR/OCCU/DEAT.

But sometimes it happens that OCCU is sorted at the top (OCCU/BIRT/MARR/DEAT).

Here a minimal GEDCOM file with presentation OCCU/BIRT/MARR:

0 HEAD 1 SOUR webtrees 2 NAME webtrees 2 VERS 2.0.15 1 DEST DISKETTE 1 DATE 09 APR 2021 2 TIME 08:38:32 1 GEDC 2 VERS 5.5.1 2 FORM Lineage-Linked 1 CHAR UTF-8 1 FILE tree1.ged 0 @I2981@ INDI 1 NAME Wife /Name/ 1 SEX F 1 FAMS @F2197@ 0 @I5530@ INDI 1 NAME Mother /Name/ 1 SEX F 1 FAMS @F1890@ 0 @I6122@ INDI 1 NAME Father /Name/ 1 SEX M 1 FAMS @F1890@ 1 DEAT 2 DATE 01 FEB 1818 0 @I6125@ INDI 1 NAME Occu Sorting Issue /Name/ 1 SEX M 1 FAMS @F2197@ 1 FAMC @F1890@ 1 OCCU occupation 1 BIRT 2 DATE 01 JUL 1818 0 @F1890@ FAM 1 HUSB @I6122@ 1 WIFE @I5530@ 1 CHIL @I6125@ 0 @F2197@ FAM 1 HUSB @I6125@ 1 WIFE @I2981@ 1 MARR 2 DATE 01 JAN 1860 0 TRLR

Here same minimal GEDCOM without MARR date, with presentation MARR/OCCU/BIRT:

0 HEAD 1 SOUR webtrees 2 NAME webtrees 2 VERS 2.0.15 1 DEST DISKETTE 1 DATE 09 APR 2021 2 TIME 08:37:16 1 GEDC 2 VERS 5.5.1 2 FORM Lineage-Linked 1 CHAR UTF-8 1 FILE tree1.ged 0 @I2981@ INDI 1 NAME Wife /Name/ 1 SEX F 1 FAMS @F2197@ 0 @I5530@ INDI 1 NAME Mother /Name/ 1 SEX F 1 FAMS @F1890@ 0 @I6122@ INDI 1 NAME Father /Name/ 1 SEX M 1 FAMS @F1890@ 1 DEAT 2 DATE 01 FEB 1818 0 @I6125@ INDI 1 NAME Occu Sorting Issue /Name/ 1 SEX M 1 FAMS @F2197@ 1 FAMC @F1890@ 1 OCCU occupation 1 BIRT 2 DATE 01 JUL 1818 0 @F1890@ FAM 1 HUSB @I6122@ 1 WIFE @I5530@ 1 CHIL @I6125@ 0 @F2197@ FAM 1 HUSB @I6125@ 1 WIFE @I2981@ 1 MARR Y 0 TRLR

If I change the father's DEAT to 1819 or later, or delete it, the son's facts will be sorted correctly!

So there seems to be a relation between the DATE of the father's DEAT and the son's BIRT.

And I suppose it's related to this issue (which I couldn't reproduce in the past).

TheDutchJewel commented 3 years ago

In other words: each time a child with OCCU is born after the DEAT of a parent, the child's OCCU is shown before the BIRT in the Facts and Events tab of the child.

TheDutchJewel commented 3 years ago

Tested on stable Demo tree:

As you can see, the occupation of George I of Greece is now shown above his birth.

afbeelding

afbeelding

jon48 commented 3 years ago

I can reproduce the issue, and actually, that can be reproduced with a range of events, as long as they are not dated.

I have done some troubleshooting on the main branch (so 2.1.0-dev), but the analysis is probably the same for the stable version (2.0.16). The problem comes from the part in the fact sorting algorithm (in Fact::sortFacts()) trying to figure out the order for non-dated facts within the dated facts: to do so, it basically checks the fact tag against a pre-defined order of the tags (constant FACT_ORDER). This is working fine with the individual's facts because BIRT is the first element of that list, so all non-dated facts should come after. But, in addition to the individual's facts, relatives' facts are also added (in particular, the death of parents is explicitly added, even if it takes place before the child's birth). In the list of facts to sort, in the use case above, there is therefore a dated fact generated with the EVEN tag to hold the father's death, which is sorted before the birth (based on date sorting). The problem is that the tag OCCU, when not dated, is listed before EVEN in the pre-defined order, so the occupation is sorted before the father's death by the algorithm, and by transition, before the individual's birth. That is more obvious when the "Events of close relatives" is expanded.

image

The same logic applies to any non-dated fact which tag is listed before EVEN (for instance, christening, graduation, immigration...), when the father dies before his child (you can probably find some other combinations).

The fact sorting logic has not actually changed for a long time (and was already essentially the same in 1.7 versions), but I believe that this behaviour started with commit e4cf93e367ec5647c295019975cd39de03da8c74, which added deaths of parents even if they happened before the birth (before that, any event before the birth was not included, and therefore the case above could not happen).

I have not investigated solutions, and my feeling is that it may not be that simple to address. Maybe events of close family should not be type-compared to other events, but that may have more impact than I think.

jon48 commented 3 years ago

I have played a bit around the logic of the facts sorting, and could "create" some other oddities in fact sorting - some of them subtle. Here is the GEDCOM to produce them, and I have uploaded it on a sandbox instance for quick reference (this is a custom 2.1.0-dev version, based on 77c72ee5 at the moment, with some custom modules, but the facts logic does not have any difference with the standard one)

Gedcom Facts Order ```text 0 HEAD 1 SOUR webtrees 2 NAME webtrees 2 VERS 2.1.0-dev 1 DEST DISKETTE 1 DATE 09 AUG 2021 2 TIME 15:02:14 1 GEDC 2 VERS 5.5.1 2 FORM LINEAGE-LINKED 1 CHAR UTF-8 1 FILE factsorder.ged 0 @X1@ INDI 1 NAME John /SMITH/ 1 SEX M 1 BIRT 2 DATE 01 JAN 1850 1 FAMC @X113@ 1 OCCU Farmer 1 FAMS @X115@ 1 DEAT 2 DATE 05 MAY 1904 1 PROP Farm with cows 1 WILL 2 AGE 45y 0 @X112@ INDI 1 SEX M 1 NAME Peter /SMITH/ 1 FAMS @X113@ 1 DEAT Y 1 BIRT 2 DATE 12 SEP 1827 1 FAMC @X118@ 1 BAPM Y 2 AGE 5y 1 WILL Donation to future born 2 DATE 01 NOV 1849 2 _ASSO @X1@ 1 OCCU Merchant 0 @X114@ INDI 1 SEX F 1 NAME Mary /TAYLOR/ 1 FAMS @X115@ 1 BIRT Y 1 OCCU Maid 0 @X116@ INDI 1 FAMC @X115@ 1 SEX M 1 NAME Mark /SMITH/ 1 BIRT Y 1 DEAT Y 0 @X117@ INDI 1 SEX M 1 NAME Robert /SMITH/ 1 BIRT Y 1 DEAT 2 DATE 10 AUG 1827 1 FAMS @X118@ 0 @X119@ INDI 1 SEX F 1 NAME Catherine /BROWN/ 1 BIRT Y 1 FAMS @X121@ 1 WILL 2 DATE 10 JUL 1832 0 @X120@ INDI 1 FAMC @X118@ 1 SEX M 1 NAME William /SMITH/ 1 FAMS @X121@ 0 @X122@ INDI 1 FAMC @X121@ 1 SEX M 1 NAME James /SMITH/ 1 BIRT Y 0 @X113@ FAM 1 CHIL @X1@ 1 HUSB @X112@ 0 @X115@ FAM 1 MARR 2 DATE 01 JUN 1875 1 HUSB @X1@ 1 WIFE @X114@ 1 CHIL @X116@ 0 @X118@ FAM 1 CHIL @X120@ 1 CHIL @X112@ 1 HUSB @X117@ 0 @X121@ FAM 1 HUSB @X120@ 1 WIFE @X119@ 1 MARR 2 DATE 05 MAR 1838 1 CHIL @X122@ 0 TRLR ```

The oddities in the non-dated facts order I could create are:

  1. the same use case as the one you reported, with Peter Smith - X112, where Baptism and Occupation are sorted before the Birth.

    • This is due to the explanation I gave, i.e. the algorithm determines that BAPM [non-dated] < OCCU [non-dated] < EVEN (father's death) [dated], and because EVEN (death of father) [dated] < BIRT [dated], then BAPM [non-dated] < OCCU [non-dated] < EVEN (father's death) [dated] < BIRT [dated]
  2. A similar use case, but with associated events rather than events of close relative can be seen with John Smith - X1, with an additional twist, because in that case, even the son's non-dated birth appears before the individual's own birth.

    • This is a rather constructed use case, as it requires an individual to be associated to an event that occurs before their birth. Quite unusual, but arguably it could be used to record a will which references a posthumous heir.
    • The reasoning is the same as the close relative one, i.e. the algorithm determines that OCCU [non-dated] < PROP [non-dated] < EVEN (son's birth) [non-dated] < WILL (asso) [dated], and because WILL (asso) [dated] < BIRT [dated], then OCCU [non-dated] < PROP [non-dated] < EVEN (son's birth) [non-dated] < WILL (asso) [dated] < BIRT [dated]
  3. A slightly less obvious case can be seen with Catherine Brown - X119. Here the issue is that the non-dated son's birth is appearing before the marriage, whereas tradition would usually expect the contrary.

    • The logic is roughly the same as above, there is a dated event (the will - this is usually done closer to death, but it could be written a lot earlier in life, especially when valuable assets are involved), which occurs before the marriage, and so all non-dated events that are determined as happening before a non-dated will are sorted before
    • So the algorithm determines that EVEN (birth's son) [non-dated] < WILL [dated], and because WILL [dated] < MARR [dated], then EVEN (birth's son) [non-dated] < WILL [dated] < MARR [dated].
  4. A more subtle use case is with Mary Taylor - X114. Here nothing stands out, except that a non-dated occupation should be sorted after the marriage. To reproduce this one, the French historical events should be enabled (could work probably with other countries' historical events, as long as there is an event between the birth estimated date and the marriage date)

    • Here the culprit is, as suggested, the historical events, which are using the EVEN tag, and interfere in the same manner as an EVEN with the BIRT in the top two cases above. When the historical events are disabled, the order becomes Marriage then Occupation.
    • Indeed, the algorithm determines that OCCU [non-dated] < EVEN (historical event) [dated], and because EVEN (historical event) [dated] < MARR [dated], then OCCU [non-dated] < EVEN (historical event) [dated] < MARR [dated]

It is easy for me to highlight issues with the algorithm, but having had a play with it, I have not found yet a magic fix that would address those edge cases altogether. I have tweaks for use cases 1, 2 and 4, but they may introduce new use cases that will contradict them (commit jon48/webtrees@d8d8cade8c0c10433695ceabd74d9100533ec4d2). Globally, I reckon that most of those issues comes from dated events not occurring in the order they are "expected to happen", and then the algorithm can produce oddities trying to sort the non-dated ones, as it compares pairs, without considering the other events. I think as well that using EVEN for close relatives and historical events slightly increases the likelihood of the expected order to be broken (as they can be quite random events). There are a couple of other tweaks I want to test (for instance, trying to order starting at the end rather than the beginning), and I am trying to think if a different algorithm could be used, but it is possible that the current one may already be the best to solve the difficult problem of sorting non-dated events.

fisharebest commented 3 years ago

it requires an individual to be associated to an event that occurs before their birth.

The only undated events will be the births of the children. (All other events must have a date, and must occur between the individual's birth and death.) This might create an opportunity for some special logic:

if (event type is BIRT and date is not known) then {sort as day after individual's own birth}

However, I'd prefer to avoid this type of special logic, as it can be hard to maintain.


The function Fact::typeComparator() has some logic for family records. https://github.com/fisharebest/webtrees/blob/f74962deb25314243e931b4a254615968e2e5deb/app/Fact.php#L694-L699

Perhaps removing the check for instanceof Family will help? Presumably any events from different individuals will remain in their original order - which is the order we added them (individual, then relatives...)

fisharebest commented 3 years ago

The code is showing the father's death before the individual's birth.

This isn't intended. Perhaps this line is at fault:

https://github.com/fisharebest/webtrees/blob/f74962deb25314243e931b4a254615968e2e5deb/app/Module/IndividualFactsTabModule.php#L604-L605

fisharebest commented 3 years ago

This logic is including the parent's death.

https://github.com/fisharebest/webtrees/blob/f74962deb25314243e931b4a254615968e2e5deb/app/Module/IndividualFactsTabModule.php#L843-L845

jon48 commented 3 years ago

First, let me reiterate that my analysis is in no way a criticism of the current algorithm. I have a soft spot for algorithmics, so this issue was the occasion for me to dissect the sorting algorithm, and I just wanted to share - more for reference than action - the few problematic use cases I could spot as I was digging through.

I'd prefer to avoid this type of special logic, as it can be hard to maintain

I would tend to agree with you.

The algorithm is actually fairly well balanced between its relative low complexity (which translates in fast execution), and results accuracy, and tweaking it further may increase the accuracy, but at the expense of increasing its complexity (whether performance or maintainability). Using the current code, 100% accuracy is not possible anyway, because fundamentally no formal order (in mathematical terms) is defined between the elements, so "sorting" can only be a best effort, but pragmatically, it is able to cope with probably 95+% of real genealogy cases in a fast and accurate manner, and considering how generic it is, this is quite remarkable. Adding more genealogical rules can help in increasing the accuracy, but the readability will clearly suffer, and there is a risk of an endless race for more rules.

if (event type is BIRT and date is not known) then {sort as day after individual's own birth}

This was my initial thought, but I realised that within the comparator function, you do not have the individual's own birth date (and more generally, no more context than just the two facts to compare). You can access the individual of the fact, but when comparing facts not relating to the same person (which is the case for the parent's death vs the individual's events), you cannot decide which one is the primary. That was however an idea not dissimilar to one of the tweaks I tested: I isolate the dated events happening before the birth (and sort them easily because they are dated), then sort the remaining ones independently (non-dated events, and post-birth dated event), so that no event can be sorted before the birth.

Perhaps removing the check for instanceof Family will help [in typeComparator]

That is another idea I tested, but whereas it seems to produce slightly better result on the initial specific issue, it also results in clustering non-dated close relatives' events (birth of children, and death of parents), and pushing them at the end of the ordered facts, so you end up with a bunch of close relative events after the main individual's events, which can be as awkward as the initial issue.

image

WGroleau commented 2 years ago

How much complexity would be added by asserting BIRT is before all other events, BURI or CREM is after all other, and DEAT is after all except those two? And where there is still doubt about the order of two events, use the order in the GEDCOM. (Since we have the ability to edit that order.)

WGroleau commented 1 year ago

Instead of complicating the sorting algorithm, what if it were to use a calculated column? (Does mySQL have those?) In g7, SDATE would override.) For example with birth: BIRT-SORT_DATE = BIRT.SDATE if present else BIRT-SORT_DATE = BIRT-DATE if present else BIRT-SORT_DATE later than any present of parents’ BIRT-DATE and
earlier than any known BURI-DATE or DEAT:DATE else ? Perhaps other heuristics, but even this gets so complicated if one codes all the possible considerations a human would use. Like, comparing to siblings, eight months after death of father, prior to other events for the individual, etc.