FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
356 stars 67 forks source link

Shared events #118

Closed EssyGreen closed 12 years ago

EssyGreen commented 12 years ago

If I understand the Record Model correctly, facts are either defined/embedded within (a) a Persona (b) a Relationship or (c) within the Record (if neither a nor b apply).

This means that the event/fact details will be duplicated (in the physical model) if the people participating are not a couple, so for example, in order to record witnesses to a marriage the fact would need to be replicated for each witness (as well as being included for the couple); the details of a death would need to be duplicated for the informant (as well as the deceased) etc

Would it not be better to have all the facts at the Record level and references to them in the Persona and Relationship objects? This way the fact details could be shared without the need to duplicate the data.

Is it also worth considering a "Role" field in addition to the "Principal" flag?

stoicflame commented 12 years ago

Hey, @EssyGreen, I just want to say how impressed I am with how quickly you're able to pick up on all these tricky concepts and join the conversation with some very intelligent comments and contributions.

Anyway, my 2 cents:

I think shared events is a good idea for solving a problem that we still don't fully understand yet. A shared event model was actually originally what we started with in the record model, but it eventually got overturned because an analysis of our own records showed that sharing events turned out to be more hassle than it was worth, even considering the cost of redundant event data. Record events just weren't being shared very often. And when they were, they were hard to work with. And most of the time, shared events were just getting denormalized in the back-end (for feeding into the search cluster). And sharing these events made it harder to map data to conclusions.

See #99 and all the issues mentioned from that issue for more details.

I do think that there is room for a shared event model, though. But I also think that we need to understand the problem space better so we can be sure to do it right. So let's keep this issue open to track that work.

Is it also worth considering a "Role" field in addition to the "Principal" flag?

The fact value was intended to contain a role, if needed.

EssyGreen commented 12 years ago

Hiya @stoicflame :) Many thanks for the compliments! Actually I have been struggling with the same concepts myself for the past 3 years (albeit on my own) so I know the problems, recognise the circles and am really thankful to be able to share thoughts with others doing the same thing!

Back to the subject of shared events ... my view is:

Record Model ... records usually record events and as a consequence they mention people e.g. Birth, Marriage, Death certificates, Baptism, Burial entries - the focus is the event ("A birth/death/marriage/burial/baptism etc occurred on [date] at [place]") and the people involved are secondary ("The child/mother/father/bride/groom etc was [person]"). Early censuses and taxation records are just head-counts with no names or just heads of households for identity. Of course in genealogy our interest is reversed and we focus on the people but if we are modelling the real world then I would recommend we model it as it is rather than as we would prefer it to be. Hence I believe that in the record model a Record should contain Events which contain Roles which reference Persons (or rather Personas in GEDCOMX terminology). If a record exists which has no events (maybe someone remembering attributes about a person they knew) then an event type of "Other" or "Personal Information" or some-such could be used to model the information. The benefits of modelling this way are that multiple people can be joined together in a single contextual situation without the need for a specific "Relationship" object. A birth/baptism might record the roles of child, mother and father; a death might record the roles of deceased and informant; a marriage the roles of bride, groom, bride's father, groom's father and two witnesses. There is no need to replicate the date, place and descriptive information about the event since the event is the object (ie context) in which they are embedded [recorded]. Furthermore, it makes it very easy to record sources which relate to places (e.g. changes in county boundaries; descriptions of how places used to be), occupational data (e.g. mining accidents, experiences of being in service) and historic events (Battle of Trafalgar, WWI, Hiroshima) without the need for a separate structure or the need to detail any roles/people if these are not relevant. The downside is that in order to find the people you need to search through all the events but this is a performance issue which can be resolved by indexing at the application level.

Conclusion model ... I'm less concerned about shared events at the Conclusion level since an application can implement a feature to "Show others at the same place and date" without needing to have the records directly linked. Indeed I believe this can actually help discovery ("I hadn't realised person X lived in the same street as person Y!").

So I'm happy to back down on the Conclusion Model but really keen to thrash this one out for the Record Model .... any takers?

stoicflame commented 12 years ago

I'm still trying to decide what you're suggesting. If you're saying that we need to add the notion of shared events in order to support a solid record model, then I'm probably fairly easy to persuade. Can you take a stab at quantifying the cost of not sharing events within the record? I understand the duplicate data argument, but I'm just not seeing a heavy cost associated with it.

If you're saying that all events in the record should be shared, even the ones that apply to just a single persona, then I'd have pretty serious concerns that the added complexity and awkwardness just isn't worth the cost. From our (quite substantial) experience, there just aren't very many cases in the majority of records we're indexing today where event data is really shared across more than two people.

EssyGreen commented 12 years ago

What I'm suggesting is:

  1. A Fact is an entity in it's own right (not just a Field)
  2. A Role is a link between a specific Persona/Fact (and/or Person/Fact) which describes the relationship between the two (and the personal attributes relevant at the event e.g. Age)
  3. The Relationship is redundant since it can be represented by a Fact with multiple Roles (which provides greater flexibility than just linking 2 people together)

The "cost" of not doing this is that the model can only accurately reflect a real record if there are 1 or 2 people involved. If there are more than two people then either the extras (or their facts) must be discarded (rendering the record inaccurate and incomplete) or the Fact must be replicated (leading to loss of data integrity since it is then implied that the facts are different even if they appear to be the same, and if one is accidentally edited and the other not then they will actually become different).

I would argue that many (if not most) genealogical sources refer to more than 2 people: witnesses, informants, censuses etc. The inability to model these is a serious limitation in a genealogical model and encourages deviation from the standard.

With regards the problem of a Fact which concerns only a single Person/Persona ... I understand that there is an added level of complexity in having a Role. Personally I would not find it a problem since a default ("self") Role seems to fit quite neatly but I would be happy to have a different type of Fact which relates solely to one person and is embedded within it ie an "Attribute" (I'm sure we can come up with a better name!).

EssyGreen commented 12 years ago

An additional cost is that the Record (as defined at the moment) cannot represent multiple Facts for a single Persona unless they have the same Date since the Age (which necessarily relates to the Date) is defined in the Persona. So for example, a probate record which included administration papers as well as the original will would need to be split into multiple Records which loses (or at least weakens) the context of both.

ttwetmore commented 12 years ago

I should have read this thread before posting my comment about Relationships being N-squared.

What I really believe: essentially every raw genealogical record that we wish to extract information from can be thought of as describing an event, many of them multi-roled. Thus the object called a "Record" in the Record model I would simply call an Event object. The key facts about an Event object are its date and place and its role-players, so I wouldn't try to be fancy and think of these as examples of facts, but as they are as dates and places. The personas are constructed more or less as in the current Record model. The only issue becomes how are facts about the person that are true WITH RESPECT TO THE INTRINSIC PERSON, versus the facts that true WITHIN THE CONTEXT OF THE EVENT, to be treated differently.

The obvious answer to me has always been facts intrinsic to the person are put in the persona record, and facts true about the person only within the context of the event are placed in the role-references (that also contain the actual persona pointers) within the event objects.

I will readily concede that there may be important genealogical records (I really wish the GEDCOMX vocabulary did not overload the term "record" in so many ways) that might not be best interpreted as an event (though I haven't been able to come with any good examples!!), so the idea of converting every raw record to a single event object with personas hanging off, might not be universal, but it is surely handles the vast, vast majority of the cases, and also contributes to making the model easy to understand and process with.

The DeadEnds model covers these ideas in detail. There is never any redundancy about facts in a DeadEnds model. Every fact occurs once and in the most appropriate object.

EssyGreen commented 12 years ago

I totally agree :)

stoicflame commented 12 years ago

I'm listening :-)

I'm curious about the assumption that "events have roles" which reference personas. Why wouldn't it be that "personas have roles" that reference events? The latter seems significantly more natural to me. Am I backwards?

ttwetmore commented 12 years ago

Ryan, For an archival/transport format it could be either (or both). Once loaded into a database or imported in memory for algorithms it would presumably be placed in whatever format deemed best by the engineer.

A role from an event to a persona (or vice versa) is more than just a pointer/reference/index. The role has other information. For example age of the person with respect to the event. In the Record Model you have chosen to put the age inside the persona record. To me this is incorrect, as I believe the only facts/protperties/atrributes inside a person record should be intrinsic to the person. Of course this breaks down almost immediately when you think of a name as a non-intrinsice property of a person. And I don't like the age in there because it introduces an artificial (in my opinion) difference between the persona record and the person record, when, in my opinion, there shouldn't be.

EssyGreen commented 12 years ago

@ttwetmore +1

Also, if you have thing->Persons->Roles-Events you must also have thing->Events to cater for things where there are no Persons (or none of interest) e.g. a description of WW1 or of a particular place or details about what a particular occupation was like in the 1840s etc

Hence both the Event and the Person would need a reference to the thing they were contained in ... this then raises the question (in a Relational Model) about data integrity .... Is it OK to have a Role with a Person from one thing and an Event from another thing? If so, what does this mean? If not, then it adds extra complexity to the business logic when validating the objects.

stoicflame commented 12 years ago

I'm just doing some issue scrubbing here.

Does anyone disagree that this issue is covered by the thread at #134? Because if there are no objections, I'd like to close this one and consolidate discussion on this topic at #134...

EssyGreen commented 12 years ago

Fair enough - since I started it I'll close it down :)