FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
356 stars 67 forks source link

clarify support for "n-tiered" implementation, a.k.a. "inner persons" #72

Closed stoicflame closed 13 years ago

stoicflame commented 13 years ago

The following commentary was submitted by user @ttwetmore regarding the "two-tiered" definition of the model:

The GEDCOM X model is what I call a “two-tier” model because it has two levels, the records level and the conclusion level. These two levels exist with different names in other contexts. For example, sometimes the records level is called the evidence level (e.g., Better GEDCOM uses the term “Evidence and Conclusion Model”), and sometimes the conclusion level is called the person level (as in discussions about “record-based” versus “person-based” genealogy).

Because many objects used in both levels (e.g. especially, Person and Event) are essentially identical, it is simple to convert a two-tiered model of records and conclusions into a single but “N-tiered” model. In this model Person records can be quite naturally organized into trees, with the Persons at the leaves of the trees coming from evidence, the Person at a root being the current state of a genealogist’s conclusions about a real person, and Persons in the interior representing the state of all intermediate conclusions a genealogist has made about constructing the current root Person from the evidence/record Persons.

Using an N-tiered system has the curious effect of simplifying the model, while at the same time making it more powerful, but with an accompanying problem of making it harder to explain and understand.

Where is an N-tiered model needed? It’s easy to find places where it is not. For example, web services that provide original genealogical data from historical records clearly need only the record/evidence model (e.g., FamilySearch and Ancestry). And most on-line pedigree services clearly need only the conclusion/person model. And the vast plethora of desktop systems use only the conclusion model, causing much angst among their users who want to better encode their evidence in a more useful form. Currently the NewFamilySearch model does hide a two-tier model of Personas and Persons, without formally tying the Persona level to strict records or evidence.

But programs intended to provide researchers with good support for the research process could benefit greatly. These systems need to record all the original data the researcher discovers (the “record/evidence-tier”), all the currently final conclusion the researcher has made about the persons he/she is researching (the “conclusion/person-tier”), but also all the intermediate decisions and arrangements of earlier conclusions and decisions (handled by the “middle-tiers”) that went into making those final conclusions. However, these capabilities are only really useful for an advanced and experienced genealogist who needs to perform detailed research more akin to historical research in order to make his/her conclusions.

The DeadEnds model has been an N-tiered model nearly from its inception. I think the advantages of simplifying the model to a single level, with the capabilities this provides for supporting computerized genealogy in nearly any context, outweighs the disadvantages of documenting and explaining these capabilities.

ttwetmore commented 13 years ago

I would prefer getting rid of the Persona and letting Person be the evidence level person object as well. Then every Person record can have these "inner person" properties. (I have wondered, looking at the current model, how Persons get linked to the Personas they are based on -- I'm still not sure how that happens.)

stoicflame commented 13 years ago

I've updated the issue to include the more detailed comments submitted by @ttwetmore. My response still to follow.

stoicflame commented 13 years ago

GEDCOM X supports an "n-tiered" model. This has been clarified in the documentation, particularly in the Developers Guide and in the new Source Reference Examples document.

The key to understanding how GEDCOM X supports an n-tiered architecture lies in (1) the way it references sources, (2) the fact that conclusion persons are designed to be cited as evidence, and (3) a more clear understanding of the nature and purpose of the record model.

Using @ttwetmore's description of an n-tiered model, GEDCOM X supports "persons at the leaves of the trees coming from evidence." These "leaf" persons take the form of conclusion persons that cite images, physical artifacts and/or indexed record data as evidence. "Persons in the interior representing the state of all intermediate conclusions a genealogist has made ... from the evidence/record Persons" are represented as conclusion persons that cite these "leaf" conclusion persons as evidence. For an example, see the section on "Citing Conclusions" in the Source Reference Examples.

I'm not yet familiar enough with the DeadEnds model to know how it's done, but I assume that there is a mechanism to support persons that cite evidence that is not other persons or conclusions. Examples of such evidence include online images, web pages, physical artifacts (books, journals), etc. It's important to understand that the GEDCOM X record model is designed to be another type of evidence that is a peer to the other types of "non-conclusionary" evidence. It's designed to model indexed record data. It's not designed to be the "leaves" of the n-tiered model, it's designed to be cited as (another type of) evidence for those leaves.

I think I'm suggesting that the DeadEnds model isn't designed to support indexed record data, so the record model doesn't "fit" in the comparison between GEDCOM X and DeadEnds. I hope what I'm saying is understood; I'm not saying that the DeadEnds model is deficient in any way; I'm just saying it addresses a different set of requirements than does the record model.

ttwetmore commented 13 years ago

There is some misunderstanding, since it is the Personas that are at the leaves of the trees. I don't yet appreciate how the two models are connected. As I said in my original comments, many applications are exclusively record level or exclusively conclusion level. However, applications that wish to support the historical and research process much support both and must be able to seamlessly connect the two levels together. I think there is more here that has been addressed, but I'm putting it on the back burner. Are comments seen on closed items or am I whistling in the dark?

stoicflame commented 13 years ago

Are comments seen on closed items or am I whistling in the dark?

Nope, comments are still seen and the discussion still continues even after the issue is closed. And I'd be happy to reopen it if there still needs to be some things addressed. I only closed it because I think the documentation I added met the task to "clarify support for n-tiered implementation." If more still needs to be done, I'll reopen.

There is some misunderstanding, since it is the Personas that are at the leaves of the trees.

So maybe some vocabulary normalization and metaphor clarification is needed? I understand that in an n-tiered model, the "leaves of the trees" can be called "personas". But that doesn't mean that the data structures that support an n-tiered model have to be called "personas," right? Last time I looked, there's not type called "persona" in the DeadEnds model, is there? Even though the DeadEnds model also supports an n-tiered model.

Another question: if the "leaves of the trees" are called "personas" what do you call the evidence that supports the leaves of the trees such as the historical records and other artifacts? And couldn't one make an argument for adjusting the metaphor to state that these historical records and artifacts are actually the "leaves of the trees"?

ttwetmore commented 13 years ago

There are no Persona records in DeadEnds because the N-tiers all use the same objects. So evidence persons are just regular ole Person objects.

The evidence that supports the evidence persons at the leaves of the trees are anything that can be referred to by a SourceReference, presumably usually Records (in your sense of the word). You will see that Person objects in DeadEnds can have any number of SourceReferences pointing off to evidence.

Also in the DeadEnds model you will see that each Person object can have any number of PersonReferences. This is how the trees are implemented. The PersonReferences point to the lower level Person objects that are being grouped together into the higher level Person, so the higher level Person represents the conclusion that the lower level Person objects all refer to the same human being. Note that this ability is useful for any software supporting genealogical research. Just imagine New Family Search with the added capability of joining person level objects together into an even higher level person level object and you will see the picture.

The only question in my mind is whether the N-tier trees will be useful enough to be used. I think they will be and my DeadEnds research software will use them. I believe I am the only one advocating the N-tier approach, but then again, I am probably the only person who has written complex person grouping algorithms recently. When I worked for Zoom Info I designed all their combining algorithms. These were the algorithms that took 100s of millions of "mentions" of persons (extracted by natural language processing while crawling the web), and had to combine them down to 10s of thousands of conclusion persons representing real persons living and working in the English-speaking world. The 100s of millions of mentions are what you would call Personas, and the 10s of thousands of conclusion persons are what you would call Persons. To do the combining I had to work through many different phases, each of which would build another layer in the trees by using specific criteria for combining.

Of course, this completely leaves open the question of whether the N-tiered necessities I had should be simply considered the problems of the application, and not of concern to the data model. Certainly in my application, we didn't have the facilities or the interest to keep around the trees once I had built them all. Just like New Family Search, at the end we were left with the original mentions and the final persons who all referred to the mentions that made them up.

However, if we want to be able to represent the "state of research," evidence, conclusions, and all, in our external data, then we do have do get that ability into the data model. Thus my belief that the model itself should be able to support an N-tiered system.

But I hope you can see how truly trivial it is to provide support for the N-tiered system. All you have to do is take any "normal" model and allow the Person objects to have a list of PersonReferences to the lower level Persons. Applications that don't support the ability simply never use these structures, so they never appear in the data. Basically a 1-tier system transforms into an N-tier system by this one change, that would be invisible and take up no space for applications that choose not to use it. And of course if you decide to keep the GEDCOM X model a 2-tiered system, I think you couldn't do any better than combine your Persona and Person objects into the same object type and insert PersonReferences in them. Then your GEDCOM X applications can be just a 2-tiered as you are anticipating them to be, but they are also capable of shifting into N-tiered mode for any application that chooses to take advantage of the power.

Note that I think the best way to let the tiers connect to one another is through these PersonReferences. This is why I asked you before how your Record and Conclusion tiers get connected. I wanted to know how the Persons at the Conclusion level know who the Personas at the Record level they are constructed from. For me that is a basic need, and I don't yet see how the current GEDCOM X model accommodates it.

stoicflame commented 13 years ago

There are no Persona records in DeadEnds because the N-tiers all use the same objects. So evidence persons are just regular ole Person objects.

Yes, and I'm asserting that's the way it is with GEDCOM X, too.

The evidence that supports the evidence persons at the leaves of the trees are anything that can be referred to by a SourceReference, presumably usually Records (in your sense of the word). You will see that Person objects in DeadEnds can have any number of SourceReferences pointing off to evidence.

Nice. So the two models are aligned here, since that's the way GEDCOM X does it as well.

These were the algorithms that took 100s of millions of "mentions" of persons (extracted by natural language processing while crawling the web), and had to combine them down to 10s of thousands of conclusion persons representing real persons living and working in the English-speaking world.

Wow. Very cool indeed.

This completely leaves open the question of whether the N-tiered necessities I had should be simply considered the problems of the application, and not of concern to the data model.

Well said. I totally agree.

Certainly in my application, we didn't have the facilities or the interest to keep around the trees once I had built them all. Just like New Family Search, at the end we were left with the original mentions and the final persons who all referred to the mentions that made them up.

Yes, for good or for ill, depending on who you ask.

However, if we want to be able to represent the "state of research," evidence, conclusions, and all, in our external data, then we do have do get that ability into the data model. Thus my belief that the model itself should be able to support an N-tiered system.

Agreed.

But I hope you can see how truly trivial it is to provide support for the N-tiered system. All you have to do is take any "normal" model and allow the Person objects to have a list of PersonReferences to the lower level Persons.

Yes, I think I understand. But I can't tell if you understand that those Person references to lower-level Persons exists in the GEDCOM X model. They're there. Here's what they look like in serialized JSON:

person : {
  sources : [
    {
      type : "person",
      resource : "person/12345"
    }
  ]
}

Applications that don't support the ability simply never use these structures, so they never appear in the data. Basically a 1-tier system transforms into an N-tier system by this one change, that would be invisible and take up no space for applications that choose not to use it.

I agree completely.

And of course if you decide to keep the GEDCOM X model a 2-tiered system, I think you couldn't do any better than combine your Persona and Person objects into the same object type and insert PersonReferences in them.

What I'm saying is that we don't need any change to "Persona" to support an N-tiered model. Like you said earlier, our "Persona" is the same as what you call "Mention". It's not designed to be more than that. So I suggest we leave the discussion about renaming "Persona" to the other thread; "Persona" is outside the scope of discussion on an N-tiered model.

Then your GEDCOM X applications can be just a 2-tiered as you are anticipating them to be, but they are also capable of shifting into N-tiered mode for any application that chooses to take advantage of the power.

This is where I'm thinking you don't quite understand what I'm asserting: the "Person" object in the conclusion model can already support references to other "Person" objects in support of an N-tiered model.

Note that I think the best way to let the tiers connect to one another is through these PersonReferences.

Agreed.

This is why I asked you before how your Record and Conclusion tiers get connected. I wanted to know how the Persons at the Conclusion level know who the Personas at the Record level they are constructed from. For me that is a basic need, and I don't yet see how the current GEDCOM X model accommodates it.

I'm not sure how to better describe it other than what's already in the Developers Guide and in Source Reference Examples. What else do you need?

ttwetmore commented 13 years ago

Thanks for the example here, Ryan. Yes I now see how the references to persons are done. So GEDCOM X does support trees of Person objects where the final leaves are Personas. Very nice. In the DeadEnds model these references are called out as specific person references whereas in GEDCOM X the person references are simple a subtype of source references. I think both ways of handling the concept are equivalent.

The only difference between GEDCOM X and DeadEnds, at least in this context, is that DeadEnds doesn't have a separate Persona data object for the leaves of the trees, and uses the same Person object there. That is, at a larger scope, DeadEnds doesn't formally separate the Record Model from the Conclusion Model, and uses the same data objects for both. I do believe that it is not a good idea to be so strict in defining two models in which there is near complete overlap in important concepts (person, event, source, relationship). The only thing unique about the Record Model is that it holds extracted evidence, and that you have chosen to contain the extracted evidence by using a record object to more or less collect together the set of references to other objects that come from the same source. Fine idea, but not big enough to force the other objects into their current dualities.

Here is the DeadEnds Person object in Google protocol buffer specs. Equally suited to both Record and Conclusion models. Note references to event objects, other person objects, and relationship structures. The source references are inside the thing called an InfoStructure. So all concepts of names, gender, events, relations, attributes, sources and notes are also unified in a single model.

message PersonMessage { optional UUIDValue recordId = 1; // Record id of this person. repeated NameStructure names = 2; // Names of this person. required GenderValue sex = 3; // Gender of this person. repeated EventStructure vitals = 4; // Event structures of this person. repeated UUIDValue eventIds = 5; // Event records of this person. repeated RelationStructure relations = 6; // Relationships this person has with other persons. repeated UUIDValue personIds = 7; // References to sub-persons this person is based on. repeated AttributeStructure attributes = 9; // All other attributes of this person. optional InfoStructure info = 10; // Sources, media and notes about this person. }

stoicflame commented 13 years ago

I do believe that it is not a good idea to be so strict in defining two models in which there is near complete overlap in important concepts (person, event, source, relationship).

And believe me, you're not alone.

Those that disagree would take issue with the statement "near complete overlap". The architects and engineers that build the extraction tools for FamilySearch are firmly in this camp, as am I. There are significant enough differences between the model that is needed for extraction and the model that is needed for conclusions that to try to merge the two would be a significant violation of a few of the SOLID design principles especially the Liskov substitution principle and the principle of separation of concerns. I expect to find the time soon to formally document the rationalization of the record model to give some more detail about this.

But, like I said, there are many people (including yourself) that very validly differ in their opinions.

ttwetmore commented 13 years ago

Ryan ... I've had a chance to say my piece and thanks for listening.

Just one point. I'm an outsider, which means I thought about GEDCOM, and think about GEDCOM X from the point of view of a person needing to implement genealogical applications, especially applications that support a full genealogical research process. So from my point of view I have strong ideas about the best form a model would take to fit in with such applications. And it is this perspective that is behind the ideas I have put into the DeadEnds model.

I do understand that the engineers building the FS extraction tools have the best viewpoint on how to extract record data and put it in your internal formats for your tools. My assumption is that this data goes into large relational databases. But even if that it not the case, does that mean the internal formats best suited for extracting data is the format to be used for the file format for general genealogical data interchange?

What I am trying to politely ask is whether you can truly think of GEDCOM X as the format designed to best foster data interchange within the world of genealogical applications, when your concerns must be so weighted the by needs of your organization to support all its major FS applications? Must you necessarily equate the data formats you use internally in your applications to the data exchange format that would be the best as a standard for external genealogical applications? Where is the boundary between the internal software and data architectures of FS applications, and an external standard to be used for genealogical data interchange between all types of genealogical applications?

I don't have answers to these questions; I just want to be sure you get an honest opinion from at least one outsider. Wonderful as you guys are, you are the Microsoft in this industry, and you have very big feet. What you say goes for a very long time.

And obviously I am still in the camp of those who believe there is significant overlap between the two models and that the same key object classes should occur in both! Though I will read your references!

ttwetmore commented 13 years ago

Ryan,

Just finished reading your references. As you might predict, there is nothing in there that makes me change my mind. For me a Person object is "what is known or inferred about a human being from information available at some point in a research process."

And I see your view as well. A Persona is "the information about a human being that can be extracted directly from a single item of evidence," and a Person is "information about a human being collected and inferred from a variety of items of evidence."

For me the difference is small, not even needing subtypes, whereas for you there is such a significant difference that different classes/types are needed. So we both seem justified when considered in the light of our differing definitions!!