FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
355 stars 67 forks source link

justify date and place on characteristic #45

Closed gechols closed 13 years ago

gechols commented 13 years ago

There are two options for modeling a genealogical characteristic:

  1. Keep date and place on characteristic to reduce the cost of disruption in the current genealogical development paradigm.
  2. Remove date and place on characteristic and move any known types that need date and place that have historically been characteristic types to become known event types.

Neither option is wrong nor right; it's mostly a matter of preference. Currently, option 1 has been selected to move forward. We need to gather majority opinion across industry on the matter.

ianstiles commented 13 years ago

It only makes sense to me to have date and place on Event. If a Characteristic is associated with an date and/or place, it should be a Characteristic of an Event, and NOT the other way around.

I would propose we change that.

carpentermp commented 13 years ago

There are a couple of reasons for date and place on characteristic: 1. Gedcom had it. 2. There are legitimate characteristics that use it. For example, "Occupation"--He was an ohio farmer from 1880 to 1890. Yes, you could have a "became a farmer" event in 1880, and a "stopped being a farmer" event in 1890, or a single "was a farmer" event where the date is a range, but all of these seem a little odd. Given that Gedcom provided for characteristics like these and that they are in use, it seemed expedient to leave that part of the model unchanged.

stoicflame commented 13 years ago

We have discussed this issue a lot. We can come up with cases (like the one Merlin cites above) where date and place fit on a characteristic, but the cases never seem to satisfy those who don't want to see date and place on characteristic; they can always say that those cases could be modeled as events.

The thing is, the notion of a "genealogical characteristic" has consistently been modeled for the past 30 years as having a date and a place. The industry knows what a genealogical characteristic is and knows when (and when not) to use the date and place on a characteristic. This isn't a new concept.

So until it can be shown that having a date and place on characteristic is inherently a wrong model (as opposed to an inconvenience to those who are new to the concept) then date and place on characteristic are going to stay. Date and place will always be optional on characteristic.

jeffph commented 13 years ago

I realize we've discussed this before, so I hesitate to beat a dead horse here, but here goes. :)

I believe most of the disagreement on this issue is based on how similar one believes the record model should be to the conclusion model. By using a "genealogical characteristic" across both models, we benefit from a similar look and feel, greater model compatability, shared vocabulary, etc. For those of us who work exclusively in the record world, and are not as familiar with the conclusion model, we would probably use a "typed-field" construct (i.e. subclass field, extend with a "type" enum/qname, call it something other than Characteristic) if we were to author a model independent of the conclusion model.

To help those on the conclusion side understand our perspective, here's why we would use "typed-fields" instead of "genealogical characteristics":

1) While the gedcomx conclusion model nods to the legacy gedcom model, the record model doesn't need to be backwards compatible with gedcom. It's a different domain. There's no legacy gedcom data to migrate to the record model. 2) It is not always clear to a user or developer where to put date/place information. A classic example is "Residence" for census records. Currently, we create a "Residence" event for a census persona. Conversely, based on the Occupation example mentioned above, it would also seem perfectly acceptable to create a Residence characteristic with a date and place. What criteria does one use to decide where "Residence" should go? Can this be documented somewhere? 3) Creating an "Occupation" or "Career" EventType seems odd to those in the conclusion world. Treating "Occupation" as a characteristic with a date range seems odd to those in the record world. Both are right! :) This is because they are two different domains. Of course, we all want to share data-structures as much as possible, but this is a case where the models should slightly diverge, in my opinion. 4) We currently have a "military_service" EventType in gedcomx. Why is Occupation different than this? We believe, at least in the record world, it shouldn't be. If it truly does need to be different, can we clearly document the reasons for this difference?

Hopefully, this doesn't come across as being disagreeable. We're simply having a hard time (still) understanding how to handle some of these use cases. Perhaps your answers will help us understand, accept, and clearly codify the current proposal.

stoicflame commented 13 years ago

I'll reopen the issue in an effort to make sure everybody feels free to comment on it.

stoicflame commented 13 years ago

So I don't feel really strongly about removing date and place from characteristic. What I do feel strongly about is that a characteristic in the record world be the same as a characteristic in the conclusion world. I haven't heard anything that would convince me otherwise. A characteristic is a characteristic, whether in the record world or in the conclusion world. Can we at least agree to that?

Jeff seems to be saying that a conclusion characteristic can have a date and a place, but a record characteristic shouldn't. I think that's a very bad idea because of how important it is to be clear on how to bring data from the record world to the conclusion world. Sure, you can remove the notion of a characteristic from the record world (rename it to be some kind of specialized field) and keep it in the conclusion world, but that just adds unnecessary complexity to the conclusion-to-record bridge.

stoicflame commented 13 years ago

Based off my previous assertion that a characteristic should be modeled consistently, I would really like to keep the discussion on this thread to the merits of the two following options:

  1. Keep date and place on characteristic to reduce the cost of disruption in the current genealogical development paradigm.
  2. Remove date and place on characteristic and move any known types that need date and place that have historically been characteristic types to become known event types.

There are other options. (Some have suggested moving date and place up in the inheritance model so that they apply to everything--even names, for example.) I suggest we refrain from considering those options in this thread and open new issues if we want to seriously consider them.

As Jeff said, neither option is wrong nor right. It's primarily a matter of preference. My perception is that we're pretty evenly split within FamilySearch. I think we're going to need to gather further opinions from other genealogical developers and industry leaders in order to determine where the majority preference lies.

We'll keep this issue open as we gather more opinions on the matter. It goes without saying that this issue will need to be closed before 1.0.0 final is cut.

gechols commented 13 years ago

It feels like the reason we are split is that we cannot decide whether or not Gedcomx is record focused or conclusion focused. I think Jeff identified the affects of each decision - so what is the decision? Is Gedcomx record or conclusion focused? Its unhealthy to straddle this fence.

stoicflame commented 13 years ago

So what is the decision? Is Gedcomx record or conclusion focused?

Neither. Why would it be one or the other? The purpose of GEDCOM X is to create a standard GEnealogical Data COMmunications model for all profiles of genealogical work.

gechols commented 13 years ago

Because one is 'evidentiary' while the other is a set of 'conclusions'. If you are saying there is no difference between the two then I think we have a problem.

stoicflame commented 13 years ago

Because one is 'evidentiary' while the other is a set of 'conclusions'. If you are saying there is no difference between the two then I think we have a problem.

I think you must be confused. GEDCOM X defines two separate and distinct models. One for record data and one for conclusion data.

When I asserted above that "a characteristic in the record world be the same as a characteristic in the conclusion world", I wasn't saying that we have to use the same object but that the two separate and distinct characteristic objects should be consistent between the record world and the conclusion world.

carpentermp commented 13 years ago

For those of us who work exclusively in the record world, and are not as familiar with the conclusion model, we would probably use a "typed-field" construct (i.e. subclass field, extend with a "type" enum/qname, call it something other than Characteristic) if we were to author a model independent of the conclusion model.

So we actually have that in the Gedcomx-record model. RecordField has a "type". Remember, there is no "characteristic" on record anymore, only on persona and relationship.

1) While the gedcomx conclusion model nods to the legacy gedcom model, the record model doesn't need to be backwards compatible with gedcom. It's a different domain. There's no legacy gedcom data to migrate to the record model.

The domains are not as distinct as they might seem. Sure, it is unlikely that a fielded record will have an occupation that has a date and/or place, but what about an obituary or a journal? These "non-fielded records" could easily contain this type of information.

2) It is not always clear to a user or developer where to put date/place information. A classic example is "Residence" for census records. Currently, we create a "Residence" event for a census persona. Conversely, based on the Occupation example mentioned above, it would also seem perfectly acceptable to create a Residence characteristic with a date and place. What criteria does one use to decide where "Residence" should go? Can this be documented somewhere?

It is documented. There is a "residence" event type, but no "residence" characteristic type.

3) Creating an "Occupation" or "Career" EventType seems odd to those in the conclusion world. Treating "Occupation" as a characteristic with a date range seems odd to those in the record world. Both are right! :) This is because they are two different domains. Of course, we all want to share data-structures as much as possible, but this is a case where the models should slightly diverge, in my opinion.

Again, non-fielded records blur the line between domains. Perhaps the two domains seem so different to you because you have been dealing only with fielded records and because you have had little exposure to how record data is used "downstream"? In my mind, we are attempting to model "genealogically useful information about people." This is true in both the "record" and "conclusion" models. The distinction between the two domains is just this: in the record model, we are attempting to model "recorded information about a person". In the conclusion model, we are attempting to model "what was true about a person." I don't see how this distinction would lead to a different model for the actual "information about a person," particularly since anything "true about a person" may have been "recorded" somewhere.

There are strong reasons to have the "person information model" between the two domains be as similar as possible and little reason to have them diverge. Wherever they diverge it will produce an impedance mismatch that will be a thorn in the side of every consumer of the data.

4) We currently have a "military_service" EventType in gedcomx. Why is Occupation different than this? We believe, at least in the record world, it shouldn't be. If it truly does need to be different, can we clearly document the reasons for this difference?

Now to the question for why "residence" and "military_service" are events, and "occupation" is a characteristic. I believe our primary reason is simply for compatibility with long-standing common usage. Model-wise, I admit that it is easy to make a case that they are all the same kind of thing. However, after a careful review of the list of event types and the list of characteristic types I believe I may have discovered the underlying rationale for what went where. Whether the inventors of GEDCOM were conscious of the rationale or chose instintively, I cannot say.

In general, it seems to me that the event types are things that you might expect to find a record of--that is to say, you might find a record where an event of that type is the "primary event". The characteristic types appear to be for information that would generally be thought of as ancillary in the record. So what about "military service"? This appears to be a catch-all for many different kinds of military records, such as draft cards, service records, pension records, etc. In many of these records, there will be a specific date for the beginning or end of service. In others it will be the period of service. How about "residence"? "residence" is an event produced by many records, but may be thought of as a "co-primary event" with "census" records since the date and place of residence is so central. In records, "residence" is usually a point in time, but it is easy to see how it could be a range, particularly in non-fielded records. How about "occupation". In most records, "occupation" is ancillary, captured as part of a census or death record, or whatever. It is easy to see how records could exist where it would be "primary", but generally this would not be the case.

I'm sure many counter-examples could be found. All models are an imperfect reflection of reality. The trick is to make a good balance between simplicity, accuracy, and usefulness. We all see mostly the same things, but put different weights on their relative importance. This, in turn, leads each of us to a different place when striking that balance.

You have posited what you believe to be a more ideal model for fielded records--characteristics with no date or place. I have countered with legitimate use cases, in the record domain, where date and place are useful on characteristics. You have cited the confusion that exists between what is logically an "event" and what is a "characteristic". That confusion, while unfortunate, is somewhat unavoidable because there is real-life overlap in the definitions. If "event" is "something that happens to a person" and "characteristic" is "some quality, attribute, or trait" of a person, then what do you do when the "quality, attribute, or trait" is not inherent, but true only for a given time and/or in a given place? Must it then be classified as an event, even when this is non-intuitive?

If I felt that the model was clearly "wrong", then I would be in favor of changing it, let the data migration chips fall where they may. But since there seem to be tradeoffs both ways with no clear winner, why not let "ease of data migration and consistency with long-standing usage" decide the issue?

stoicflame commented 13 years ago

If I felt that the model was clearly "wrong", then I would be in favor of changing it, let the data migration chips fall where they may. But since there seem to be tradeoffs both ways with no clear winner, why not let "ease of data migration and consistency with long-standing usage" decide the issue?

FWIW, Merlin has very proficiently articulated my own personal feelings on the matter.

ianstiles commented 13 years ago

When I get an invitation to an event, it has a place, start time, and end time. Most people think of events as being one day or less in duration, but there are multi-day events, but not multi-year events. For genealogical events, they are mostly specified as a day. Perhaps "things" that have a start and end date, (like occupation: Truck Driver 1990-1995) should be a different object than either Characteristic or Event. How about "DateRangeCharacteristic" or "DurationCharacteristic" that explicitly has a start and end date, and so you could remove the date option from Characteristic.

People who previously used Gedcom Characteristics can now understand when to map these to DateRangeCharacteristics or to Events when a single date is involved.

carpentermp commented 13 years ago

If not either/or, then why not both? Here are some downsides to having both kinds of characteristics:

  1. It complicates the model with another kind of object
  2. I suppose you will want to evaluate events and characteristics to decide which ought to be "DurationCharacteristics". Any characteristic that is not inherent to an individual could potentially have a temporal and/or spatial context and so, to be safe, we would probably want to put them in the DurationCharacteristic pile. For example, "physical_description". Many physical descriptions are permanent, like "eye color", where others change over time, like "height". Do we make "physical_description" a DurationCharacteristic even though most usages will be better served by it being a plain old Characteristic? What about "military_service" (which is now an event), do we make it a DurationCharacteristic with the reasoning that military service is not a single-day or few-days thing, but a few-years thing? What about military records that have only begin date? Do we create an event type for "military induction"? The record model has a "primary" flag for events to indicate events that are the cause for the creation of the record. If we move "military_service" to DurationCharacteristic, do we have to add a "primary" flag to DurationCharacteristic for military service records? If so, consumers of the data would have to look in the list of Events and the list of DurationCharacteristics for this flag, complicating usage.
  3. It complicates data migration.
  4. It doesn't really solve the "problem". The date and place will still be optional in DurationCharacteristics, and will not be known most of the time. "residence", "military_service" and "occupation" still make more sense in some cases as events, and in others as duration characteristics.

For all of these reasons, I would not be in favor of adding DurationCharacteristic.

stoicflame commented 13 years ago

The "Date" object defines its dates as strings. What's wrong with putting the string "1990 to 1995" as the value of the date? I know that our standards libraries understand "1990 to 1995" as a valid date and they can parse it to identify the "min" value and the "max" value. Maybe someday we can define a standard syntax for genealogical date strings.

But the nature of the "Date" object is kinda separate from this thread, which is about whether date and place belong on characteristic. If we need to, let's define the requirements and open another issue about the nature of the Date object.

stoicflame commented 13 years ago

By way of an update on this issue, we're doing some analysis of the characteristics we have in FT and evaluating the feasibility and practicality of option 2.

stoicflame commented 13 years ago

I've finished my analysis of date and place on characteristic.

The data was taken from person characteristics in the conclusion data that we have in FT. Analysis was done according to the following algorithm:

type total count percentage note
age 211 18 8% age doesn't make sense without a relative date and place
caste_name 0 0 0
citizenship 19 17 89% doesn't make sense as an event
clan_name 1283 1 .07%
count_of_children 102665 102 .1% doesn't make sense as an event
count_of_marriages 94239 72 .07% doesn't make sense as an event
died_before_eight 19279334 32 .0001% do any date or place make sense on this?
dwelling 4 4 100% why not residence event?
ethnicity 6 3 50% date, place don't make sense
gedcom_uuid 197 0 0%
household 5 2 40%
marital_status 531 143 27% doesn't make sense as an event
military_rank 17 11 64% doesn't make sense as an event
namesake 58 7 12%
national_id 303433 11 .003%
national_origin 26927 300 1.1% date, place don't make sense
not_accountable 0 0 0
occupation 1237019 497089 40%
never_had_children 10 0 0%
never_married 460 6 1.3% date, place don't make sense
physical_description 86202 842 .9% doesn't make sense as an event
possessions 6896 6742 98% doesn't make sense as an event
race 2454111 40 .001% date, place don't make sense
religious_affiliation 267354 32453 12% doesn't make sense as an event
scholastic_achievement 68470 15265 22%
social_security_number 526 177 33% doesn't make sense as an event
stillborn 174251 25214 14%
title_of_nobility 967506 2741 .28% doesn't make sense as an event
tribe_name 3929 22 .55% doesn't make sense as an event
twin 1794 194 11% date,place don't make sense

Interestingly, this analysis didn't significantly shift my position, although I did gain more sympathy for the proponents of diverging the model between the record world and conclusion world.

There are clearly some characteristics where having a date and place don't make sense. But there are others whey date and place do make sense. The 'age' characteristic is a special case because we've already diverged between the models there. Here are the other characteristics that I think date/place make sense on:

For some of those, it makes sense to move them to become events. But these ones don't make sense as an event:

Now, in the record world, you can make the argument that the date and place of these characteristics are already accounted for by the primary event of the record. For example, "citizenship" is data that is found on a census record, so adding date and place to a citizenship characteristic is redundant. Likewise for "possessions" which is often associated with a probate record and "military rank" associated with a military record. I think you could make the same argument for all of the characteristics where date and place make sense.

ianstiles commented 13 years ago

Very good to have this data. It is interesting to note that every time a Characteristic has a date that could also be represented by an Event shows the confusion case. We really need to nail down when to use which.

The "count of children" example (assuming it means the record doesn't have all the children as Personas) is being used as information valid at the time the record was created, so the two need not be combined in a Characteristic. If that is the case for the others as well then that may indicate that the Characteristic doesn't need the date, it is elsewhere in the record.

carpentermp commented 13 years ago

It is interesting to note that most of the dates and places on characteristics are there to preserve explicitly in the conclusion world what was implicitly understood in the record world. For example, take "count of children." In a record, it means "count of children at the time of the record". It is not meant to indicate the total count of children that a couple ever had--which is what you might suppose if you saw a "count_of_children" characteristic with no date on a conclusion person. By putting a date on the "count_of_children" characteristic on a conclusion person you preserve the meaning that this is the count of children at a particular point in time. The same goes for "marital_status". It is a little weird as a characteristic on a conclusion person if no date is given, but perfectly reasonable in records because the date is implicit.

Having considered all of that, we still need to remember that not all records are the same and some of them behave a lot like "conclusion data" (like obituaries and other unfielded records). You may very well get information in these records where a date would be helpful e.g. "...he served in the army from 1940-1944." If we took it off of record characteristic, but allowed it in person/relationship characteristic, then you don't have this capability for records where it makes sense.

Having a spot for a date is more flexible and sometimes useful, though I agree it can be confusing. Perhaps if we documented that, in records where a primary event exists, characteristics of the following types...yada, yada, yada...will assume the date and place of the primary event to be the date and/or place of the characteristic without the need for this information to be explicitly stated. Then, when harvesting characteristics from records for use in a conclusion person, the system would know to automatically insert the appropriate date and/or place. (I was noticing that for some characteristic types, like marital_status or count_of_children, place is meaningless.)

jeffph commented 13 years ago

If I understand the explanations above, it sounds like putting a date/place on a characteristic in the record world might only be necessary if all four of these conditions are met:

1) We are working with an unfielded record (e.g. obituary, article, etc.) 2) The characteristic on the unfielded-record has a date/place 3) The date/place of the characteristic is different than what the unfielded-record implies 4) The characteristic seemingly doesn't make sense as an event. Specifically, one of the characteristics below, from Ryan's list:

Then, from this list, there exists characteristics that are either:

a) Rarely used - Citizenship and military_rank fall into this category. Each have fewer than 20 usages within our data. b) Rarely have Date/Place - Count_of_children, count_of_marriages, physical_description, and tribe_name. Each have date/place less than 1% of the time.

This leaves the following characteristics with possibly-significant date/place usage:

Of these three characteristics, we must consider which, in the cases where a date/place value exists, would have a different date/place than what the unfielded-record implies. This is more difficult to answer definitively since we don't yet have unfielded record data to compare to, but we can probably assume it's less than 100%.

Lastly, and perhaps most interestingly, I'm noticing the following patterns when looking at the actual data for these three remaining characteristic types:

Please note, the above data analysis is based on a sampling of our legacy gedcom data and is not exhaustive.

stoicflame commented 13 years ago

Based on the analysis of the current set of use cases and scenarios, we're going to remove date and place from record characteristic until the point when a well-defined use case arises that requires them.

stoicflame commented 13 years ago

applied at 480f7c7