FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
349 stars 67 forks source link

Merge Record and Conclusion Models #138

Closed joshhansen closed 12 years ago

joshhansen commented 12 years ago

Executive Summary

The record and conclusion models actually model the same domain but have been artificially separated. Thus, GedcomX isn't even interchangeable with itself in spite of trying to be a widely useful interchange format. For almost all classes in the record model there is a corresponding, parallel class in the conclusion model. Merging these parallel classes together into a single model will make GedcomX easier to understand, easier to implement, and more powerful as a representation of genealogical data. In the remainder of this issue, I explain what's wrong with the current situation, why it's so harmful, and how the problem can be resolved.

What's Wrong

First, some quotes for context:

"The GEDCOM X record model provides the data structures that are used for defining the content of an artifact."[link]

and

"The GEDCOM X conclusion model provides the data structures that are used for making genealogical conclusions."[link]

As it stands, the GedcomX record model does not model the "content of an artifact." A record model that actually modeled "the content of an artifact" would do things like specify its dimensions, textual transcription, identifying marks, etc. Instead, the record model represents conclusions drawn from the content of an artifact. For example, the claim "John Smith was born 1 Jan 1930", though supported by the contents of Mr. Smith's birth certificate, is a conclusion a researcher drew based on that certificate. Conclusions such as this that are made on the basis of artifact contents are just another kind of conclusion.

In its current form, GedcomX tries to model one aspect of conclusion metadata (whether or not a fact was concluded based on the contents of a document or artifact) not by allowing for this to be represented in the metadata classes themselves, but rather by duplicating the entire set of data classes and metadata classes and declaring that metadata dealing with this new set represent conclusions drawn directly from a document. As a result, the record and conclusion data models are two separate but almost exactly parallel models of the same domain. The distinction upon which this duplication is justified is essentially arbitrary, treating a special kind of conclusion as if it were so distinctive that it must be modeled as an entirely different domain.

Why It's Harmful

The model duplication that exists in the current GedcomX specification adds to user confusion ("What's the difference between a person and a persona?"), complicates the task of implementing the standard (twice as many entities to represent), and reduces the utility of data represented using GedcomX (a persona transcribed from a record is not necessarily comparable to a corresponding person in a pedigree, even if they actually represent the same individual).

Resolution

Instead of making a complete copy of the data and metadata classes, this distinction can be much more parsimoniously modeled by simply enriching the metadata model. I propose modeling the genealogy domain as a set of core entity types (person, place, event, date/time, document, etc.) and a vocabulary for making statements about such entities (e.g. person X was born in place Y), combined with a metadata vocabulary for justifying these statements, recording the reasoning behind them, and showing who exactly is making the claims (e.g. researcher A claims/asserts/believes that person X was born in place Y because of evidence found in document Z). This lends itself to a two-part model, one for making statements about the core entities (data), another for making statements about those statements (metadata).

Rather than embedding Facts within the entities they are about, a general Fact class should be created that can represent claims of fact about any entity type. For example, in Turtle syntax:

#Subclassing rdf:Statement gives us subject, predicate, and object 
#properties by which any RDF statement can be represented
:Fact rdf:type owl:Class ;
    owl:subClassOf rdf:Statement .

:assertedBy rdf:type rdf:Property ;
    rdfs:domain :Fact ;
    rdfs:range :Person .

# This property can point to anything -- a document, or a literal string with the researcher's explanation
:supportedBy rdf:type rdf:Property ;
    rdfs:domain :Fact ;
    rdfs:range rdfs:Resource .

:subFactOf rdf:type rdf:Property ;
    rdfs:domain :Fact ;
    rdfs:range :Fact .

(A similar approach is described in my RootsTech Family History Technology Workshop paper.)

An appropriate resolution to this issue would involve either 1) merging the record and conclusion models and perhaps refactoring the result into data and metadata models, or 2) giving a convincing argument for why the current state of affairs is necessary, including specific use cases that could not be modeled using a single merged model. The burden of evidence that would justify the duplication of a substantial subset of the GedcomX vocabulary is, in my opinion, fairly high, given the cost in user confusion, implementation difficulty, and data format utility mentioned above.

See Also

Issue #131 "A Persona IS a Person" Conclusion Record Distinction

jralls commented 12 years ago

GEDCOM X's task is therefore to determine the bare minimum that can be considered necessary for a 'good' application to support.

I'd say rather that GedcomX's task is to determine, and provide for, the maximum variation of data to be transferred. Applications don't have to fill in all the blanks, but if there aren't blanks for what they want to transfer they will either leave it out or invent non-standard extensions as they have with GEDCOM.

EssyGreen commented 12 years ago

GedcomX's task is to determine, and provide for, the maximum variation of data to be transferred

I would agree providing this doesn't result in ambiguity. It must be clear what each piece of data, each link represents. Otherwise it can appear flexible but actually be so ambiguous as to be meaningless ... My main example here would be the over-extensive use of NOTEs in old GEDCOM which allow citations within NOTEs within Citations within NOTEs ad infinitum ... at the end of the chain who can work out what the heck is being said about what and in what context?

ttwetmore commented 12 years ago

I agree with @jralls that GEDCOMX should cover a superset, so disagree with @lkessler that it should cover a subset of the features to be provided by an application program.I don't believe that "over"-standardization limits the differentiation of products.

Of course there is a problem inherent in superset standards -- how does a subset application import data from applications that use different subsets of the standard? Problems of importing N-tier data to an application that doesn't support it is an issue that could possibly render the application or GEDCOMX defunct. We all know this. It's a problem inherent in standards. It can't be avoided. Should GEDCOMX give up because of it? If adding N-tier ideas to GEDCOMX is deemed too high a risk we should leave it out.

EssyGreen commented 12 years ago

Problems of importing N-tier data to an application that doesn't support it

Can I just reiterate that it is your particular version of N-tier that I think is the subject here. Applications may well support N-tier but in a different way to that envisaged by you. GEDCOMX should (and does) enable let's call them "multi-tiers" to differentiate. N-tier (a la @ttwetmore) is the one which I personally think will cause the additional complexity.

lkessler commented 12 years ago

As Tom says, what providing "the maximum variation" does is that it requires every software vendor to support the maximum variation - otherwise some data will not transfer properly.

The more complex the standard, the less likely every vendor will be willing to follow it, the more likely they are to leave parts out of it, the more likely they are to make mistakes in implementing it, and the more likely they are to misinterpret it and thus implement it improperly.

Not only that, they must all find some way of correctly translating their data structures to the new standard and back. Every added complexity makes this significantly more difficult.

Is it not two of the primary goals of a new standard that all vendors adopt the standard and that all data transfer properly?

Louis

jralls commented 12 years ago

Is it not two of the primary goals of a new standard that all vendors adopt the standard and that all data transfer properly?

Can't transfer all the data unless there's a place for all of the data.

EssyGreen commented 12 years ago

The more complex the standard, the less likely every vendor will be willing to follow it, the more likely they are to leave parts out of it, the more likely they are to make mistakes in implementing it, and the more likely they are to misinterpret it and thus implement it improperly.

Not only that, they must all find some way of correctly translating their data structures to the new standard and back. Every added complexity makes this significantly more difficult.

Is it not two of the primary goals of a new standard that all vendors adopt the standard and that all data transfer properly?

+1

lkessler commented 12 years ago

John:

Yes. We need the minimally sufficient standard that will handle all the data.

Louis

ttwetmore commented 12 years ago

@lkessler I agree with you. I don't have an glib answer. But I hope that GEDCOMX won't be just a GEDCOM tweak, but if it has any significant feature beyond what GEDCOM is today, it would seem to fall prey to the same argument. One could see the argument as justification not to move forward. Wouldn't that be unfortunate?

ttwetmore commented 12 years ago

@EssyGreen

Can I just reiterate that it is your particular version of N-tier that I think is the subject here. Applications may well support N-tier but in a different way to that envisaged by you. GEDCOMX should (and does) enable let's call them "multi-tiers" to differentiate. N-tier (a la @ttwetmore) is the one which I personally think will cause the additional complexity.

I am very interested in approaches that would support N-tiers better than I have been able to imagine it. Would you mind describing these alternative N-tier approaches you envisage? Would you mind pointing out where GEDCOMX currently supports multi-tiers? I haven't been able to yet see how GEDCOMX supports the basic 2-tier linkage between persons and personas.

EssyGreen commented 12 years ago

I am very interested in approaches that would support N-tiers better than I have been able to imagine it. Would you mind describing these alternative N-tier approaches you envisage?

I am not pretending to envisage it better than you - just differently. I already gave an explanation of how this could be achieved in #149:

by creating new trees/files for the different possibilities which can then be used/referenced as sources if/when a conclusion is reached in the original file. Since with GEDCOMX we will now be able to handle recursive sources I don't see why we should add the complexity into the base model.

The only requirement needed for this is that Personas can be linked to Persons - which I'm assuming is coming anyway (although I agree it hasn't emerged in the code yet) or the Personas in the Record Model become somewhat useless.

ttwetmore commented 12 years ago

Concerning the view that my N-tier approach is complex, its entire impact on a model is the addition of a single 1-to-many relationship to the person record. Could there possibly be a simpler approach? Is there a concern about this? Or does the concern relate to the user interface? What am I missing?

ttwetmore commented 12 years ago

@EssyGreen

The only requirement needed for this is that Personas can be linked to Persons - which I'm assuming is coming anyway (although I agree it hasn't emerged in the code yet) or the Personas in the Record Model become somewhat useless.

My solution has always been to link Persons to multiple Persons (I leave Personas out as an extraneous concept). This supports 1-tier, 2-tier, 3-tier, ..., N-tier structures. We seem to be in near agreement. But now I'm wondering what in the world you thought I was proposing!

joshhansen commented 12 years ago

Well, it's both gratifying and frightening to see the issue I filed take on such a life of its own! Great discussion, interesting viewpoints, lots of enthusiasm to get things right.

I'm sure there are a million thoughts I could share, but here are the main ones:

  1. My understanding is that, as a result of this issue and subsequent discussion, @stoicflame now plans to take the record model out of GedcomX, but to continue developing it internally at FamilySearch and with FamilySearch's partners. (Correct me if that's not what you said.) I feel that this would be a serious mistake. All of the arguments I made for why the Record/Conclusion model duplication is a Bad Thing still apply, except that now half of the model is submerged from the public. That means there will be parts of the genealogical research process that the GedcomX model is unable to model effectively (document transcription). There will still be duplication of effort, the need to maintain two toolsets, the creation of data in two incompatible formats and the need to convert between them.
  2. If we really want GedcomX to be able to model genealogical research and conclusions, we most have a mechanism for modeling assertions. Though the GDM seems to be much-vilified around here, there was a reason the professional genealogists wanted to model assertions. There must be an entity representing an assertion of fact in order for the reasoning that led to a particular conclusion to be meaningfully modeled. Assertions also provide a unified mechanism for citing sources and giving attribution for any statement. Right now GedcomX lets Researcher A make an assertion and say "I'm fairly confident about this." But can Researcher B chime in and say, "Actually, I disagree with Researcher A's assertion"? Can he make an assertion and indicate that it derives from the assertion of somebody else? Unless I'm misunderstanding the current model, this sort of back-and-forth between researchers isn't possible. That's unfortunate, because I think it could unlock a social dynamic by which assertions are made, then evaluated, then revised, etc., until consensus emerges. I'm certainly not saying GedcomX needs to reproduce the Assertion stuff from the GDM. But we need some sort of assertion, and there is a general and (in my opinion) elegant way to do this if we were to commit to RDF as a data model and use "reification" to make statements about statements. It's also vital that the evaluation of assertions be separate from the assertions themselves, so that more than one person can render judgment on the merits of a particular argument.
  3. @ttwetmore and @jralls both complain of the GDM's "over-normalization". I agree that the GDM is a nasty mess, but let's not make normalization into the villain in our data modeling approach. Normalization isn't just a way of avoiding redundancy or database update anomalies. It also facilitates extensibility. For example, if GedcomX always records names as strings then there's no possibility of richer name representations being introduced without shoehorning them into the string format and getting people to understand your new encoding. But if Name is factored out as its own class (as happily it is in GedcomX), then other types of Name can be introduced as subclasses (or as additional properties of Name) in a way that keeps the usual semantics of Name intact, but also provides additional information. If something isn't modeled as a first-class entity, it becomes much harder to make statements about it specifically.
ttwetmore commented 12 years ago
  1. I don’t believe there are two models to separate. I believe they should simply be merged. Some person records, if they hold the evidence taken from a single item in a source, play the role of persona. Other person records, if they hold conclusions made from many items of evidence are the person records intended by the Conclusion Model. The GEDCOMX model must be able to hold personas. If removing the Record Model implies loosing the persona concept, it’s a mistake.
  2. A persona level person record contains a reference to its source. That reference need not be a pure pointer; it can have attributes for surety, and attributes for location in source, etc. Therefore the triple made up of a persona, its source reference, and its source record completely specify its assertion and its citation. There is no need for a separate concept object called an assertion. We have it covered. Stuff about RDF and reification, is IMHO, overkill for a genealogical data model. If anyone thinks that the N-tier concept is too esoteric for application developers and users, try to get them to understand and use assertions about assertions.
  3. The data model, the databases used for backing stores, the external file formats used for external archives and transport, and the software objects used by running software, are the four major ways that genealogical entities show up in computer representations. None of these require normalization. I don’t believe normalization facilitates extensibility. However, an unnormalized document-based database (e.g., MongoDB) does facilitate extensibility because it is formally schema-less, though it still has all the advantages of a database with schemas. Such a database can support indexing and querying just as effectively and possibly with better performance than a classic normalized RDBMS.
jralls commented 12 years ago

However, an unnormalized document-based database (e.g., MongoDB) does facilitate extensibility

+1

Genealogy is document-based. It does not lend itself to being chopped up into little pieces like accounting data.

Stuff about RDF and reification, is IMHO, overkill for a genealogical data model.

I don't know about overkill, but RDF is an implementation detail. We're still working on what to implement, and bringing in things like RDF now is just confusing.

Therefore the triple made up of a persona, its source reference, and its source record completely specify its assertion and its citation

This I don't agree with. I think hanging all assertions/conclusions on the person object loses most of the context that drives genealogical analysis. For example, it's not terribly interesting that a person was enumerated in the census (guess what I was doing this afternoon). What's interesting is who else was enumerated in the household and who are the neighbors.

I don’t believe there are two models to separate. I believe they should simply be merged.

+10. I think that's the original proposal, eh?

jralls commented 12 years ago

If we really want GedcomX to be able to model genealogical research and conclusions, we most have a mechanism for modeling assertions. Though the GDM seems to be much-vilified around here, there was a reason the professional genealogists wanted to model assertions. There must be an entity representing an assertion of fact in order for the reasoning that led to a particular conclusion to be meaningfully modeled. Assertions also provide a unified mechanism for citing sources and giving attribution for any statement.

+1

Right now GedcomX lets Researcher A make an assertion and say "I'm fairly confident about this." But can Researcher B chime in and say, "Actually, I disagree with Researcher A's assertion"?

That's really hard, and probably beyond the scope of GedcomX. Even if you were to use a GedcomX file here on Github as your medium (coincidentally, Dick Eastman was motivated by the recent Wired article to comment on using Github for collaborative genealogy ) you will still have to deal with the edit war problem.

Github offers a solution, of course, and we're using it now -- but that won't be captured in the GedcomX file itself, it will be in the issue discussion referenced in the Git change message.

Perhaps worthy of a separate issue.

lkessler commented 12 years ago

John said:

I think hanging all assertions/conclusions on the person object loses most of the context that drives genealogical analysis. For example, it's not terribly interesting that a person was enumerated in the census.

+1

EssyGreen commented 12 years ago

@ttwetmore

Concerning the view that my N-tier approach is complex, its entire impact on a model is the addition of a single 1-to-many relationship to the person record. Could there possibly be a simpler approach?

Infinite recursion is easy to model but difficult to make sense of. For example, as a user/researcher I would not want my name index of people I'm researching to be a list of every name in every source document - if I want that view I can simply list all Personas. To prevent this the application needs to reduce your ALIAses down to their roots to separate the key ones from the duplicates and that is virtually impossible to do given the infinite recursion. Similarly, it makes it extremely difficult to validate since your conclusions are spread amongst a multitude of Person fragments.

As a researcher I am trying to model the real world. To that end my Persons represent real people who I am researching and trying to come to conclusions on. They do not/should not represent fragmented bits of source data. The place for that is within the interpretation of each Source.

@joshhansen

@stoicflame now plans to take the record model out of GedcomX, but to continue developing it internally at FamilySearch and with FamilySearch's partners. (Correct me if that's not what you said.) I feel that this would be a serious mistake.

That is my impression too but I can understand their need and some things we just have to accept. If the target audience for the Record Model is the web-publishers then we can't really complain but, as consumers of the model, we can work out how it will impact and can be used within research applications.

@ttwetmore

I don’t believe there are two models to separate. I believe they should simply be merged.

We have the Record Model. Signed, sealed, done thing (bar some tweaks). If an application doesn't find it useful then just treat it like any other media file that might be used as a source. (Personally I think that the Record objects are useful because they allow the researcher to interpret a source document into a "mini-tree" solely within the context of that source.)

What we should be fighting for (or not) here, is the retention of the Conclusion Model as a separate entity. Does the Record Model enable researchers to publish and/or exchange their research data? No. OK, since we can't influence the Record Model then we need a Conclusion Model (which may use/reference/include the Record Model).

EssyGreen commented 12 years ago

Right now GedcomX lets Researcher A make an assertion and say "I'm fairly confident about this." But can Researcher B chime in and say, "Actually, I disagree with Researcher A's assertion"?

That's really hard, and probably beyond the scope of GedcomX.

Actually if you allow for derivative sources (see #136) then it can be done fairly easily but the "edit war" is a problem (see #151)

ttwetmore commented 12 years ago
Infinite recursion is easy to model but difficult to make sense of.

A real database would be 1-tier and 2-tier 99% of the time and would be 3-tier essentially the rest of the time. Infinite recursion is no where near infinite, and is easy to make sense of.

For example, as a user/researcher I would not want my name index of people I'm researching to be a list of every name in every source document - if I want that view I can simply list all Personas.

An index is something you search when you need to find something; why would you not put the things you need to search for in your index? Wouldn’t you want to search for every name form a person might have been documented under so you can go immediately to the evidence with the names in that form?

To prevent this the application needs to reduce your ALIAses down to their roots to separate the key ones from the duplicates and that is virtually impossible to do given the infinite recursion.

There is no need to reduce anything to their roots, and infinite recursion has nothing to do with this. What you call virtually impossible are simple matching algorithms I’ve been writing for a decade. What you may not realize is that it is the recursion that makes this so simple to do. It makes the user interface easy, it makes the algorithms easy, it makes the conception of the model easy.

Similarly, it makes it extremely difficult to validate since your conclusions are spread amongst a multitude of Person fragments.

They are not spread out amongst a multitude of fragments. They are organized into a tight tree structure that exactly matches the decisions and conclusions you made in deciding which of your source records refer to each of your persons. Your conclusions are organized for you in the best possible manner. Writing proof statements in an N-tier system is a dream come true.

As a researcher I am trying to model the real world. To that end my Persons represent real people who I am researching and trying to come to conclusions on. They do not/should not represent fragmented bits of source data. The place for that is within the interpretation of each Source.

This is the argument that every person record in a database should represent a real individual. I’ve called this the conclusion-only argument for twenty years. This is also @lkessler’s anti-persona argument. He too wants to put the evidence in the source records. Conclusion-only desk top programs essentially stopped all advancement in genealogical software twenty-five years. There are essentially no differences between desktop systems today. They all compete by the slickness of how you enter data, how they claim to support citations, whether or not you can add photos, whether or not you can tweet your relatives and whether or not you can inspect data from on-line services. Whoop-tee-doo. There has been little advancement in the support of the processes of actually doing genealogy during this time. Maybe my N-tier approach is not that answer, but continuing to stick to a conclusion-only model has a decades old history of not being the answer.

EssyGreen commented 12 years ago

@ttwetmore

OK so if it's so easy then why, in the "20 years" you've been obsessed with your version of N-tier, haven't you yet written that killer app using existing GEDCOM - since it nigh as dammit supports exactly the type of thing you require with its ALIAs links?

ttwetmore commented 12 years ago

@EssyGreen Good question. I have written this application in a non-genealogical domain, where billions of person records are extracted via NLP from the world wide web, and then combined based on many properties, but primarily on their names, the companies they work for, the positions they hold at those companies, and their locations. Since there are billions of records in this application, I wrote matching algorithms that build up the N-tier structures automatically by finding efficient ways of dealing with the O(n-squared) comparison issues by using many different comparison and combination phases. As the billions of original records (personas in the genealogical terminology) are combined down to a few million business professional profiles (called conclusion persons in genealogical terminology), the N-tier structures that build up show the combination history. The properties of the final individuals, called the person's profile in this application, are automatically computed from the personas in the N-tier structure. In this application the N-tier structures can grow to over ten tiers, and celebrity persons (e.g., Bill Gates, who is mentioned throughout the web) may have over 100,000 "personas" in their final N-tier structure.

In the genealogical application one would not use algorithms to automatically combine the persona records into N-tier structures, but the combination algorithms would be converted to make high-liklyhood suggestions of which personas match others personas or person-trees (shaking leaf algorithms), leaving it up to the user to accept the suggestions or not.

During the five years I worked on this problem, I implemented the solution three times, each time refining ideas. The first implementation was 2-tiered, written in C++, and used a highly normalized relational database. The 2-tiers lost all history of the combination, which made tuning the combination algorithms nearly impossible. The final implementation was N-tiered, written in Java, and used a document database with full text indexing. I wrote software to visualize the N-tier structures. The main purpose of the visualization was to aid me in tuning the combination phases. In a genealogical application the visualization would be used to help users manipulate their data (i.e., proceed with the genealogical research process).

To see the results of these algorithms see the website ZoomInfo and search for a few names of people you know in industry. Every profile you see is automatically generated on the spot from an N-tier structure of person records that the combination algorithms described above have built. This application is fully automatic. No human being ever creates or modifies these profiles.

I took the job at this company because I had been interested in the genealogical application of these ideas for a long time, and working for this company seemed the best way to get access to a bulk of data sufficient to truly test out the algorithmic ideas, and to experiment and refine those ideas (and get paid). I am now semi-retired and able to spend some time working on the purely genealogical applications of these ideas, which I call DeadEnds.

You can argue whether the ZoomInfo application is sufficiently similar to any problem in the genealogical domain, that even talking about it makes any sense. I see that application as analogous to the genealogical research problem. Others may see no resemblance at all. But I would like to counter some of the concerns that an N-tier approach is conceptually or practically difficult to work with. If it can be made to work effectively in a world where there are billions of records, it can certainly be made to work in applications that use orders of magnitude fewer persona records.

lkessler commented 12 years ago

Tom,

Thank you for providing the background that provides the foundation of your thinking behind your N-tier persona-based system.

Let me say I'm very impressed, and I can see many applications for it, especially in artificial intelligence (which is another of my interests).

I can see it being used as an excellent way to get smart matches for people in large online databases, like Ancestry's "shaky leaf".

But in real life genealogy, I don't believe people want to follow chains of conclusions through persona to persona to get back to the source data. Doing that would properly document each step in a conclusion, but to understand the reasoning, every step must be followed and thought through individually.

I think instead, every conclusion needs linkage to all the source data (both supporting and contrary) that is used to come to that conclusion. This way, to interpret the conclusion, one need only do one evaluation of all the source data together that it references (i.e. the source data that is used as evidence to derive the conclusion).

Should a new item of source data come about, it could be simply added to the already linked source data and the conclusion can be revised if needed. If each "snapshot" of the conclusion is kept in a history file, then the history of how the current conclusion came about can be easily accessed.

The other part I don't agree with in your model is your making everything N-tier at the persona level. All conclusions don't occur at the persona level. They also occur at the individual event and fact level, at the family event and fact level, at the releationship level between parents and children and husbands and wives and events and their witnesses. Your system would work if all we were trying to do was to identify conclusion people, but genealogical research does more than that.

Thank you for telling us about Zoominfo. It definitely shows the sort of system that your N-tier persona based methodology can work, and work well in. Maybe FamilySearch might want to implement it for their smart matching for their New Family Tree.

But I don't think the place for it is a new GEDCOM standard.

Louis

ttwetmore commented 12 years ago

@lkessler Thanks for your very kind words. I can certainly understand how the algorithms developed for the automatic business application might seem to have little application to non-automatic, user-driven genealogy, and if I am wrong about all this stuff, then they don't.

Note, however, that the only support required from GEDCOMX to allow the possibility of handling these N-tier person structures is a single person->person* relationship in the person record. Small cost for future potential. As @EssyGreen pointed out, the ALIA tag of GEDCOM, is sufficient, when used in a strict way, for this.

And if GEDCOMX does not support the idea, it is trivial to add in an update version, if future brains deem it worthwhile.

I feel honored that other people deeply concerned about genealogical data models have been willing to read my ideas and comment cogently upon them.

lkessler commented 12 years ago

Tom,

I agree. One tag, like ALIA would handle the connections.

But all programs today assume the people being transferred are conclusion people. There would also need to be some indication that the persona are not conclusion people. Otherwise they may all be included in reports or indexes, and showing 40 people with the same name but all with slightly different information would be quite confusing.

If GEDCOMX wants to support this structure, then they'd have to make sure that programs not implementing it could still input data sets containing it, process the rest of the data their way, and then export their modified data along with the non-processed persona data so that the persona linkages are still valid.

I don't know how that can be guaranteed. What if a conclusion person is deleted? They'll lose the linkages to the 1st level personas, and those will all become top level.

And if a new person is added, they'll have no persona linkages, so the data will become incomplete.

And this is a tremendous example of what the challenges GEDCOMX has. Any developer who includes some new data structure in his program will be challenged, no matter what standard is developed, to have other programs pass their data through properly.

Louis

EssyGreen commented 12 years ago

I think @lkessler said it all :) An impressive application but not what I personally would want to use as my genealogical research software.

the only support required from GEDCOMX to allow the possibility of handling these N-tier person structures is a single person->person* relationship in the person record. Small cost for future potential.

It might seem small but it is an unnecessary complication which will result in data loss, ambiguity and confusion. I maintain my point that the same could be done with the existing model by traversing the Person-Persona links rather than taking a short cut and omitting the (in my opinion) important Persona records.

ttwetmore commented 12 years ago

@EssyGreen

It might seem small but it is an unnecessary complication which will result in data loss, ambiguity and confusion. I maintain my point that the same could be done with the existing model by traversing the Person-Persona links rather than taking a short cut and omitting the (in my opinion) important Persona records.

You speak of data loss, ambiguity and confusion as if you understand how the N-tier approach causes them. Since major goals of the N-tier approach are specifically to prevent data loss, and to control ambiguity and confusion, all of which occur in a conclusion-only system, we are on different wavelengths. If you could explain how you see the N-tier approach causing these shameful things I would be interested in learning it.

I don't understand your comments about traversing person-persona links, short cuts or omitting important persona records. Can you explain the shortcuts you think I am proposing, and the important persona records I am proposing to ignore? My approach is usually criticized for keeping too many persona records, not for ignoring them!

I welcome criticisms of my proposals, since I learn so much from others' ideas, but it would be helpful if I could understand the criticisms well enough to reply. These comments seem so non-germane that I can't figure out what you are trying to say.

ttwetmore commented 12 years ago

@lkessler

But all programs today assume the people being transferred are conclusion people. There would also need to be some indication that the persona are not conclusion people. Otherwise they may all be included in reports or indexes, and showing 40 people with the same name but all with slightly different information would be quite confusing.

This is absolutely correct! And the criteria to decide is very simple. Any person record that is pointed to by a person record higher up in an N-tier structure is not, by definition, a conclusion person. Every person record that is not pointed to by a person higher up in an N-tier structure is, by definition, a conclusion person. These are fluid definitions that change as the user fiddles with the structures.

There is an interesting implication of this. Every newly added persona record is a conclusion person, even though we hope that it will eventually get placed into a growing structure. But this gives the user interface exactly what it needs to see -- all the structure roots and all the stand-alone person records represent the current “state of your research,” the proper set of persons to be visualizing.

Note that the user interface must also give easy access to seeing the contents of the N-tier structures, since the user must be able to reckon with the information at this level.

If GEDCOMX wants to support this structure, then they'd have to make sure that programs not implementing it could still input data sets containing it, process the rest of the data their way, and then export their modified data along with the non-processed persona data so that the persona linkages are still valid.

Certainly the GEDCOMX standard will have to explain this.

I don't know how that can be guaranteed. What if a conclusion person is deleted? They'll lose the linkages to the 1st level personas, and those will all become top level.

When a conclusion person is deleted, it was a root of an N-tier structure. All the person records one level down in that tier are suddenly transformed into conclusion persons. Isn’t this precisely what it means to remove a conclusion person? It means that you have decided that your earlier decision to bring together the data “below that person” into an individual was wrong. You want those persons below you to now re-enter into the research dance once more, to be combined in other ways that better represent your corrected conclusions.

And if a new person is added, they'll have no persona linkages, so the data will become incomplete.

Exactly! But they are not incomplete. They are simply stand alone records. If they are legitimate conclusion persons they can remain in that state forever (they are simply 1-tier persons, perfectly legitimate in an N-tier system). If they are personas in the traditional sense they will soon or eventually be placed into a structure under a conclusion person.

And this is a tremendous example of what the challenges GEDCOMX has. Any developer who includes some new data structure in his program will be challenged, no matter what standard is developed, to have other programs pass their data through properly.

Exactly. Change forces change. If the change is ultimately good then the pain caused by the change will be worth it. If not not. But this is how progress progresses.

EssyGreen commented 12 years ago

@ttwetmore

This could go on and on and on endlessly. Can we just agree to disagree?

We have both made our arguments and ultimately it will be up to Ryan to decide.

I suspect you will get your Person->Person links simply because it is similar to GEDCOM ALIAses and because it is easy to see the benefits for the social-networking aspects of genealogy.

If so, you will be able to finalise your dream and actually utilise DeadEnds.

Personally, I will not be using it (either as a developer or as a genealogist).

ttwetmore commented 12 years ago

@EssyGreen I was hoping you would try to explain your latest criticisms since they make no sense to me, but I'm fine with just ending the discussion here.

EssyGreen commented 12 years ago

@ttwetmore

OK, I'd hate you to think I couldn't explain so here goes back into the fray:

Data loss - will occur when importing into a system which does not adhere to your specific implementation and yet needs/wants to ensure data integrity.

Ambiguity - will occur because it is not clear what the link implies - does it mean that Person A is proven to be Person B (in which case where is the proof/evidence and why are they not condensed into a single Person representing the real person in the real world) or does it mean Person A looks like it might be Person B but needs further research (in which case do I put the next bit of research against Person A or Person B). Also, you argued elsewhere that the order of discovery was important and hence Person A = Person B in your model not the same thing as Person B = Person A so does the link really mean "Person A (who was discovered first) is thought to be the same as Person B (who was discovered later)". If a user then attaches the reverse link then this statement no longer makes any sense. Should the application allow this or not? (Rhetorical question - I'm just trying to explain the problem - you won't be there to give the answers when the developer has to make the decision)

Confusion - most (if not all) genealogists think of people in their tree as representing real-world people whose lives they are trying to re-construct. Your model has no such thing as a real-world person because the things representing that person's life are fragmented. I think most users would desperately miss being able to see their Persons as whole people.

Complexity - this comes from the confusion above since it would be the responsibility of the developer to pull your fragments together into a model which resembles the real-world again. This would mean repeatedly iterating through all your Persons to try to establish which ones were the real/base whilst avoiding circular relationships. It's tricky but it can be done but then we still haven't got to the end of it because we need to then merge say the Names (after all a user would want to see that Freda Bloggs' maiden name was Smith). Again it can be done but the application would be focusing all it's energy on re-constructing. The re-construction should be the job of the researcher not the application.

I firmly believe that genealogists are trying to re-create the real-world. So the primary objects should be modelled on the real world (ie Persons/People). Your model is a model of the interconnectivity of references to people in sources. That is not the same thing.

And this leads nicely back to the subject of this post ... I personally would use the Record Model to show how representations of people as they were recorded in particular sources and I would use the Conclusion Model to model the 'real-world' people that genealogists are trying to re-construct. The Person and Persona are the same objects (they are both representations of people) in different contexts (with different functional needs) but both are needed.

ttwetmore commented 12 years ago

@EssyGreen Thanks for taking the time. It is about time to let this drop, but since all your concerns are mislaid I'll make very quick responses to them:

Data loss -- criticism unfair -- it has nothing to do with the model, only its acceptance.

Ambiguity -- there is never ambiguity -- the sub person relationship always means believed to be the same person because of ..., where the because of ... is supplied by a conclusion statement or a proof statement.

Confusion -- the top level person in a cluster always represents the conclusion person. In 99.9% of the cases data will be 1 and 2 tier so exactly as today. The users of NFS have no trouble with 2 tier, because the UI makes it seamless.

Complexity -- unfair -- your criticism revolves around the assumption that developers are incompetent and some odd misconceptions that the model requires repeated activities and reconstructions and that circular relationships are difficult to prevent. Merging names? Never happens.

The N-tier model merges the record and conclusion models with the best features of both. I am sorry I have made that difficult to see.

EssyGreen commented 12 years ago

@ttwetmore - I have replied in #149 since I think this thread is getting swamped with N-tier when it is actually attempting to address a completely different issue. You already have 3 threads on N-tier so let's try to keep our debate in those rather than letting it bleed so profusely elsewhere.

stoicflame commented 12 years ago

@joshhansen

My understanding is that, as a result of this issue and subsequent discussion, @stoicflame now plans to take the record model out of GedcomX, but to continue developing it internally at FamilySearch and with FamilySearch's partners. (Correct me if that's not what you said.)

Actually, the plan was to put it in a separate--but public--project where its initial scope would be limited to bulk exchange of field-based record image extraction. I don't deny that--at FamilySearch--it might become the primary means of publishing derivative source information, but we don't have the resources to promote it as a broad industry standard right now.

So we'd like to focus first on getting the "core" project right and promoting it as a standard. The goal for this "core" project is to define a model and serialization format for exchanging the components of the proof standard as specified by the genealogical research process (see #141) in a standard way.

A lot of this is based on resource constraints. We've got hard requirements to meet some specific deadlines for the sharing of this field-based record data. And we have a limited amount of resources for getting it done. Because of these limitations, we don't have as much room to accommodate a broad community influence on it. So we'd rather not pretend it's a community standard if we don't have the means to treat it as such. Unfortunate, yes, but those are the realities.

It's different for this "core" project. We're committed to seeing it through as a real community-supported, broadly-adopted standard.

EssyGreen commented 12 years ago

@stoicflame - Many thanks for that clarification. I think that's actually great news :)

stoicflame commented 12 years ago

I'd just like to say thanks to everybody who can contributed to this thread to help us understand and articulate the goals, scope, and context of the different models (conclusion, record) we were proposing.

I hope things are much more clear now:

http://familysearch.github.com/gedcomx/2012/03/23/gedcomx-identity.html

With the projects now separated, we're going to close this issue and move on to the (many) other high-priority issues.