FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
347 stars 66 forks source link

Merge Record and Conclusion Models #138

Closed joshhansen closed 12 years ago

joshhansen commented 12 years ago

Executive Summary

The record and conclusion models actually model the same domain but have been artificially separated. Thus, GedcomX isn't even interchangeable with itself in spite of trying to be a widely useful interchange format. For almost all classes in the record model there is a corresponding, parallel class in the conclusion model. Merging these parallel classes together into a single model will make GedcomX easier to understand, easier to implement, and more powerful as a representation of genealogical data. In the remainder of this issue, I explain what's wrong with the current situation, why it's so harmful, and how the problem can be resolved.

What's Wrong

First, some quotes for context:

"The GEDCOM X record model provides the data structures that are used for defining the content of an artifact."[link]

and

"The GEDCOM X conclusion model provides the data structures that are used for making genealogical conclusions."[link]

As it stands, the GedcomX record model does not model the "content of an artifact." A record model that actually modeled "the content of an artifact" would do things like specify its dimensions, textual transcription, identifying marks, etc. Instead, the record model represents conclusions drawn from the content of an artifact. For example, the claim "John Smith was born 1 Jan 1930", though supported by the contents of Mr. Smith's birth certificate, is a conclusion a researcher drew based on that certificate. Conclusions such as this that are made on the basis of artifact contents are just another kind of conclusion.

In its current form, GedcomX tries to model one aspect of conclusion metadata (whether or not a fact was concluded based on the contents of a document or artifact) not by allowing for this to be represented in the metadata classes themselves, but rather by duplicating the entire set of data classes and metadata classes and declaring that metadata dealing with this new set represent conclusions drawn directly from a document. As a result, the record and conclusion data models are two separate but almost exactly parallel models of the same domain. The distinction upon which this duplication is justified is essentially arbitrary, treating a special kind of conclusion as if it were so distinctive that it must be modeled as an entirely different domain.

Why It's Harmful

The model duplication that exists in the current GedcomX specification adds to user confusion ("What's the difference between a person and a persona?"), complicates the task of implementing the standard (twice as many entities to represent), and reduces the utility of data represented using GedcomX (a persona transcribed from a record is not necessarily comparable to a corresponding person in a pedigree, even if they actually represent the same individual).

Resolution

Instead of making a complete copy of the data and metadata classes, this distinction can be much more parsimoniously modeled by simply enriching the metadata model. I propose modeling the genealogy domain as a set of core entity types (person, place, event, date/time, document, etc.) and a vocabulary for making statements about such entities (e.g. person X was born in place Y), combined with a metadata vocabulary for justifying these statements, recording the reasoning behind them, and showing who exactly is making the claims (e.g. researcher A claims/asserts/believes that person X was born in place Y because of evidence found in document Z). This lends itself to a two-part model, one for making statements about the core entities (data), another for making statements about those statements (metadata).

Rather than embedding Facts within the entities they are about, a general Fact class should be created that can represent claims of fact about any entity type. For example, in Turtle syntax:

#Subclassing rdf:Statement gives us subject, predicate, and object 
#properties by which any RDF statement can be represented
:Fact rdf:type owl:Class ;
    owl:subClassOf rdf:Statement .

:assertedBy rdf:type rdf:Property ;
    rdfs:domain :Fact ;
    rdfs:range :Person .

# This property can point to anything -- a document, or a literal string with the researcher's explanation
:supportedBy rdf:type rdf:Property ;
    rdfs:domain :Fact ;
    rdfs:range rdfs:Resource .

:subFactOf rdf:type rdf:Property ;
    rdfs:domain :Fact ;
    rdfs:range :Fact .

(A similar approach is described in my RootsTech Family History Technology Workshop paper.)

An appropriate resolution to this issue would involve either 1) merging the record and conclusion models and perhaps refactoring the result into data and metadata models, or 2) giving a convincing argument for why the current state of affairs is necessary, including specific use cases that could not be modeled using a single merged model. The burden of evidence that would justify the duplication of a substantial subset of the GedcomX vocabulary is, in my opinion, fairly high, given the cost in user confusion, implementation difficulty, and data format utility mentioned above.

See Also

Issue #131 "A Persona IS a Person" Conclusion Record Distinction

jralls commented 12 years ago

+1 No, make that +10. However, #7 rather thoroughly explains an alternate view. I think that the two can be reconciled; there's no reason that carpentermp's dual use (personas and persons, e.g.) can't be accomplished with the same data structures.

EssyGreen commented 12 years ago

++++++1 Although I would reserve judgement on the details of the resolution which seems (but maybe my misinterpretation) to be totally dependent on RDF meta-data .... as I've said elsewhere I believe that RDF/FOAF/DC are simply the data interchange formats not the object definitions which should be uncluttered with this level of detail.

stoicflame commented 12 years ago

Great job, @joshhansen. Thanks for writing this up. This will provide a great point of discussion about this issue.

We need a wider audience and the alternate point of view clearly articulated before we can make a clear decision. I'll work on that.

stoicflame commented 12 years ago

Marking this issue as priority 1. We need to resolve this sooner rather than later.

lkessler commented 12 years ago

Personally, I don't think it is bad if the Record Model and Conclusion Model are separate.

The Record Model should be made up of "just the facts" and information that repositories put together about the source material that they have. They can then use this Record Model for posting and transmitting their information.

The Conclusion Model should be interpretation goes.

By keeping these separate, and using the facts/interpretation distinction, the parts that go where can be easily defined.

ttwetmore commented 12 years ago

I have found that I generally disagree with Louis, and in this case there is no exception. Except for the fact that Louis is the most anti-persona person I have yet come across in discussions, and if he is now embracing the separation of the record and conclusion models he is tacitly accepting the need for the persona concept. For this I do applaud.

It is clear that in a multi-tiered system of the type I have advocated for nearly twenty years, the lower tiers hold data from the records (the evidence) and the higher tiers hold the conclusions. Thus a single model encompasses both the record level and the conclusion level in a seamless manner. As I have explained in excruciating detail a number of times, an N-tiered model is both considerably SIMPLER and considerably MORE POWERFUL than a dual record and conclusion model. And the real fact is that 95% or more of the time the N-tier system would be used with 2 tiers, resulting in the basic record/evidence and conclusion level model anyway.

It is not too surprising to me that Family Search would begin with a black and white, record and conclusion vision for a model, because it is so much the nature of their environment. They deal in billions of records, and they deal in large constructed pedigrees. At these levels genealogical data does seem starkly divisible into two extremes.

The rest of us deal with a more complex world, in which much of our data about people come from sources that cannot be cataloged as either low level records or as high level conclusions. And when we have gone far enough back, when we can no longer easily follow clear ancestral lines back in time, we must grow up and deal with a world consisting of a perplexing array of records and evidence, that must be fit together, and refit, and refit again, to find the best matches between that evidence and an assumed, but never to be fully known world of possible real human beings whose traces we perceive in the records. The N-tier model is the only way to hold our ideas, our pending decisions, our conclusions, in a way that traces our thoughts, and in a way that we can extricate ourselves from later as we either discover errors, or change our minds, or find new evidence that turns old conclusions on their ears.

lkessler commented 12 years ago

Tom and I actually agree on most things. But our disagreement regarding Personas and an N-tiered system are well-documented on the BetterGEDCOM wiki, so there's no need to rehash that over again in the GEDCOM-X community.

I went to RootsTech and expounded on my ideas for source-based data entry and evidence/conclusion modeling. No program today does that properly, and there was a lot of interest in the ideas I had and in my plans to implement this in my program Behold. This will be done by taking the raw source details, which will hopefully be able to be taken from repositories that store their data in a "just-the-facts" Record Model, and give the user tools to allow them to find related events, people, places and dates that are relevant to their research. The user will use these source details that are found as evidence and will use them to add their assumptions/conclusions to their own research data Conclusion Model.

That's my thinking and why I believe it is fine to leave the two models separate, if desired. I can handle them separate or together, but I think the repositories need that factual Record Model for them to store their data.

EssyGreen commented 12 years ago

I think what you are saying @lkessler is that we need to be able to distinguish between the "original" sources (raw source details) from "interpretations" and I would agree with you on that ... the problem is that the "just-the-facts" records are often (if not usually) just someone else's interpretation and hence are not really "originals" at all but derivations. As these records are built on and interpreted we get layers and layers of interpretations and as genealogists we need to peel back the layers as far as we can. If the layers don't exist then we can't.

Like you (I think), when I am downloading/accessing records from Repositories I would want the derivative which is closest to the original (e.g. I want the digital image from Ancestry not their transcription or their interpretation into "Facts" and/or "Roles" etc) but there may be times when a derivative is useful (e.g. a translation) providing that the information about the source it was derived from is also provided and since that source may itself be a derivative it is simplest to model this as a multi-layered structure.

The important thing for me is that the provenance trail is kept intact.

From a commercial point of view I suspect that most on-line record providers will stick to a single layer where the digital image(s), transcription and fragmentation into searchable fields is presented as a single "Record" with a pre-formatted Bibliographic citation as the only link to the original. Similary, research software suppliers will use a model tailored to their own USP and will tend to do as little as possible to comply with whatever de facto standard is out there (necessarily focusing on import/export).

Where GEDCOMX can/must add value is in providing a "best practice" genealogical standard which will encourage and enable quality research. Best practice depends on reaching plausible conclusions by making and investigating hypotheses based on interpretation(s) of information from a wide range of sources. Since we can never be 100% sure of the past, any "conclusion" is in itself a source for further research. Ergo we must have an N-tier approach to support the recursion.

ttwetmore commented 12 years ago

@lkessler:

but I think the repositories need that factual Record Model for them to store their data.

Combining the Record and Conclusion Models into an N-tiered, seamless model, provides the same lower, persona-based layer that the Record Model provides. It requires no changes to repository data. Why choose a more complex model when a simpler model with more power is available? Parsimony is the best policy.

lkessler commented 12 years ago

Essy:

I feel that sources that are derivations of other sources should not be treated as layers. They can be much more simply handled by having a source link to it's source. In GEDCOMish this would be:

0 @S1@ SOUR 1 TITL Derivation from original 1 SOUR @S2@

0 @S2@ SOUR 1 TITL Original

This can be chained as deep as necessary. This will do what you want, and do it simply.

If you want to call that an N-tier approach, then okay.

But if you are referring to the hypothesis/conclusions being N-tier, then I have a different but also simple model for that. In GEDCOMish this might be:

1 BIRT 2 DATE 5 JUL 1910 2 NOTE Birth date from 1st source believed to be true. 2nd source stated August. Believed wrong. 3 SOUR @S10@ 3 SOUR @S11@ 1 CHAN 22 NOV 2011 15:05:00

Let's say some new information in another source comes along. You find it supports the 2nd source and now you change your conclusion:

1 BIRT 2 DATE 5 AUG 1910 2 NOTE Birth date from first 2 sources believed to be true. 3rd is believed to be wrong. 3 SOUR @S11@ 3 SOUR @S20@ 3 SOUR @S10@ 1 CHAN 17 FEB 2011 17:08:30

The N-tier if you want to call it that is simply the Change (or Undo) history of that Event. It documents the complete history of your assumption/conclusion over time, and what additional sources you added to come to the conclusion at each step.

This is what I feel needs to be implemented, is as simple as you can get it, and handles every case.

Everything here is currently possible in GEDCOM 5.5.1 except the Source of the source.

Louis

EssyGreen commented 12 years ago

@lkessler

sources that are derivations of other sources should not be treated as layers. They can be much more simply handled by having a source link to it's source

There needs to be an indicator in the source that it is a derivation so that the user and application understand that e.g.:

0 @s1@ SOUR 1 TITL Derivation from original 1 _DERIVEDFROM @S2@

That way it is also easy to find the "master" by recursively going back through the _DERIVEDFROM pointers until there isn't one. A generic pointer of SOUR gives no indication of the context - it could mean its derived from, a component of, supplied by, referenced in etc etc the other source

if you are referring to the hypothesis/conclusions being N-tier, then I have a different but also simple model for that

Your simplification is similar to what GEDCOM does now. And the same problem as above occurs in that there is no context for the source reference - is it positive or negative evidence? do all the sources refer to all the fact fields or just some of them? how does the application know what the NOTE is? Is it a conclusion/proof which references the sources? or just a working hypothesis? or just a descriptive narrative of the fact?

Without context neither the application nor the reader can be sure what was intended.

lkessler commented 12 years ago

Essy:

I don't understand what you mean. Whether or not you use the SOUR tag or a _DERIVEDFROM tag makes no difference. They mean the same thing. They simply mean that the source of this source was that source. If you are going to start pigeonholing where something came from into such fine divisions as "derived from", "component of", "supplied by", "referenced in" and who knows how many more, then you're going to make the task that researchers and repositories will have to catalog their materials much more onerous, mainly because it is going to be extremely difficult to define those terms clearly enough that everyone will use them in the exactly the same way. You'll only introduce inconsistency and confusion. It is enough simply to have any derived piece of data point to where it was derived from, because the ability to go to that original source is what is needed.

My simplification was simply to indicate how the evidence for the hypothesis/conclusions can be referenced. The NOTE tag could be a _CNCL (conclusion) tag or a _ASMP (assumption) tag if you wish.

For simplicity and to make a point, I left out the detail that goes under a source (that yes is currently in GEDCOM as well), e.g. to reference Source Detail (specific information in a source) and to indicate positive/negative evidence plus anything else you want, which is basically the misnamed SOURCE_CITATION entity in GEDCOM e.g.:

3 SOUR @S11@ 4 PAGE < WHERE_WITHIN_SOURCE> 4 EVEN < EVENT_TYPE_CITED_FROM> 5 ROLE < ROLE_IN_EVENT> 4 < NOTE_STRUCTURE> 4 QUAY < CERTAINTY_ASSESSMENT>

Now I'm not saying that the above has everything needed, but it is an excellent starting point.

Louis

EssyGreen commented 12 years ago

@lkessler

Whether or not you use the SOUR tag or a _DERIVEDFROM tag makes no difference. They mean the same thing.

No they don't ... one (_DERIVEDFROM) has context which gives it meaning; the other (SOUR) just says the type of object which is being referenced.

you're going to make the task that researchers and repositories will have to catalog their materials much more onerous

And why is that a bad thing?

it is going to be extremely difficult to define those terms clearly enough that everyone will use them in the exactly the same way

We seem to be using DC which has already done this (although I have some reservations about it's wholesale adoption)

You'll only introduce inconsistency and confusion.

Why is it more confusing to have defined a specific context/meaning than not to have defined it (and hence left the meaning as ambiguous)?

the ability to go to that original source is what is needed.

Indeed but the source will just tell me about itself. It cannot possibly know about the context in which it was referenced.

ttwetmore commented 12 years ago

The awkwardness in @lkessler 's solution is based on his rejection of the persona concept, so he is forced to try to make a strict Conclusion Model approach (e.g., GEDCOM) seem to handle evidence, sources and conclusions in a reasonable way. Because he doesn't use personas, all the evidence (the actual facts derived from the sources) that would be in the personas must either be placed in source records or in general notes or not be in the database at all.

The idea of putting content (e.g., actual evidence) inside the source records is the only way that persons who reject the persona idea can get Record Model information into their databases.

We might need another issue here entitled "Where Do We Store Our Evidence?" I asked this question on soc.genealogy.computing last year and it generated a long and interesting thread. In the GEDCOMX model that evidence is stored primarily in persona records which then refer to source records, very proper source records that do what source records are supposed to do, refer to where the evidence can be found. There are many other ways to answer the question. @lkessler 's approach is one of those alternatives, placing the evidence inside the source records and forcing all person records to be conclusions only. Other approaches are simply to leave the evidence out of the database, that is, to only place conclusions in the database, and depend upon anyone using the database to look at the source records and go get the evidence on their own. Others suggested a "dual program" approach, using a commercial genealogy desktop system to store their conclusions, and another more general purpose database to store their evidence, then finding some way to link from their genealogical database to their evidence database.

I believe the perfect solution to the "where do we store our evidence" question is in the persona record. And I am very happy to see that the industry as a whole has also chosen that concept as the necessary core concept.

I have faith that the GEDCOMX model will avoid the folly of removing the Record Model level of data. The fact that seems obvious to nearly everyone, is that the persona concept has become the lingua franca core concept of Family Search, Ancestry.com and all the other major service providers. Personas are the currency of modern genealogy. Personas are what will flow and actually already do flow from genealogical service providers to genealogical clients as the result of queries and searches. Modern genealogical client programs must be able to accept personas if they wish to provide their users with access to the modern service providers. @lkessler 's model requires a client program to accept persona records from a service provider, and them immediately disembowel them, artificially placing some of their information into source records that will be a nightmare to maintain, and placing other parts of the information into conclusion facts in conclusion persons, along with little notes that attempt to explain what was done and are also a nightmare to maintain.

My only desire beyond the current GEDCOMX model is for the model to unify the Record and Conclusion models into a more integrated whole that can handle N-tier structured person clusters made up of evidence personas and conclusion persons.

EssyGreen commented 12 years ago

My preference is that we keep two models but the entities in both should inherit from common objects (i.e. a Person and Persona should derive from a Common Person and have the same properties; a Fact whether in the Record Model or the Conclusion Model should have the same attributes; ditto Relationships, Roles etc) so that when a researcher publishes or uploads or shares with someone their research, it can be used the other end as a (secondary) source.

lkessler commented 12 years ago

Tom is fine with his opinion, but that is all it is, his opinion. I happen to have a different one that believes that the best place to store our evidence is in the source details attached to the source record. I do not want to have the data ripped apart into multiple derivatives to attach them to countless persona that we poor developers will have to attempt to reassemble and present to users in an understandable way.

So let it be said that there's the two viewpoints, and please don't let Tom's bitter attacks to my way of thinking sway you to think that multiple levels of persona are the only solution.

Essy: You said: "Indeed but the source will just tell me about itself. It cannot possibly know about the context in which it was referenced."

That is correct. The source should only tell you about itself. All the subjective information about how it was accessed, the context in which it was referenced, and how it was used as evidence to come to your assumption or conclusion should be part of the Conclusion Model, not the Record Model.

Louis

jralls commented 12 years ago

I happen to have a different one that believes that the best place to store our evidence is in the source details attached to the source record.

That is correct. The source should only tell you about itself. All the subjective information about how it was accessed, the context in which it was referenced, and how it was used as evidence to come to your assumption or conclusion should be part of the Conclusion Model, not the Record Model.

Louis

Those assertions seem to be contradictory.

The information about how a source was accessed and the context in which it is found (along with its provenance) are important attributes of the source itself. A careful researcher will note that information along with how to find the source again (the reference information), and those notes should be kept together with the reference information as part of the source object.

I agree that the extraction of evidence from the source is conclusional and belongs in the conclusion model -- but I also think that FamilySearch has a different view: They are, after all, in the business of providing source records, and in order to index the sources so that we can find them they have to put at least some of the source's evidence into machine-readable form. None of them (that I've found yet, anyway) has explicitly said so, but I suspect that the Record model is designed for that purpose.

lkessler commented 12 years ago

John:

How a person accesses a source for his/her conclusions is important in how their conclusions came to be. It has no bearing whatsoever on the source itself.

If you have a source derivative, then it should point to the source it comes from (as I gave an example of earlier in this issue), and then yes, it should along with that link to its source, give the information about how the source derivative was derived. But there should be no subjective information in it.

When I access a source, I want to know only how the source was derived - again, just the facts. I don't care about how other people accessed the source. I only care how they accessed it if I am looking at their conclusions in the Conclusion Model so that I can evaluate if they accessed it in a manner that they were able to properly get the data so that I can assess the validity of the conclusion.

So I'm saying that "how a source was accessed" should not be with the source or source details in the Record Model. It should be with the the Conclusion Model where the source is used as evidence. This is why I somewhat like the separation of the Record and Conclusion Models. It perfectly delinates the difference, being that Records are "just the facts" and the Conclusions are the conjecture and assemblage of conclusions.

I hope that FamilySearch originally separated these two models because of this idea, and so that the Record Model could be handed to repositories to standardize and make their data globally available. I could see genealogy programs using this Record Model to go forth (with an API or whatever) and access online data from repositories to download the relevant source details as evidence that will be stored locally (in the Record Model format) for inclusion into their database.

Louis

ttwetmore commented 12 years ago

Louis, Your opinion is to not use the persona concept for record data, but to store record data in source records. If an item of evidence includes data on five persons and an event, then your source record for that evidence will have to hold the facts about the five persons and the event. Is that what you are suggesting as an alternative to the persona idea?

However, when searching for data on those persons on genealogical data servers, the data is going to be returned as persona records. That is the type of input that modern client programs are going to have to deal with. If the client programs do not support persona type records, client programs are going to have to immediately patch those personas into source records and conclusion persons already in the client database. Do you think that is the right thing for client programs to do? Is that what Behold is going to do? Do you think it proper for client programs to modify source records, either already in the client's database, or imported along with the persona records, as the result of importing personas? Do you think that will be an easy thing to do? Will the users have to get involved?

Please try to explain your comment that things get ripped apart into multiple derivatives when using personas. That seems meaningless, and I interpret it to be an example of FUD. The data comes in the form of personas, and should stay in the form of personas. What gets ripped apart?

we poor developers will have to attempt to reassemble and present to users in an understandable way.

Would it be unfair of me to interpret this statement to mean that a big reason that you object to personas is that you think personas and the concept of "person record clusters" would be hard for a developer to deal with? If this is a concern I think it can be allayed. Think of a persona record, and a person conclusion record, and a cluster of person records (2-tier, 3-tier, n-tier) as specializations of an abstract Person class. In displaying information about these three specializations, the methods have to be different, but there is no real difficulty in the implementation. Yes the person cluster requires extra software when the user wants to view it in its dynamic research-based context, but don't you think that this is a necessary thing for software that supports the research process to do?

EssyGreen commented 12 years ago

@lkessler

I share your concerns about how the developer will have to make sense of it for a user and I think this is a disadvantage of the N-tier approach - it's easy to model but much less easy to make something useful out of it. However, I think the benefits will outweigh the problems in the long run.

All the subjective information about how it was accessed, the context in which it was referenced, and how it was used as evidence to come to your assumption or conclusion should be part of the Conclusion Model, not the Record Model.

In a purist sense I would agree with you but if we allow the source to be broken into fragments in the Record Model then this in itself is subjective interpretation albeit within the scope of the one source. Indeed the source itself may be a secondary one anyway and/or it may reference other sources, so what is the "original" source and what is "subjective interpretation"? My requirement is that each source (however it is represented) has properties to indicate what it was derived from and what it is a component of (this latter being for fragmented interpretations which are a part of something bigger). These could be shoe-horned into a single Source (as was the case with old GEDCOM) but I believe it will be easier and simpler to allow sources within sources within sources ad infinitum. This also allows the user to structure their sources in the same way that the originals are structured in reality (e.g. a transcription of a census entry is a derivative of a real census entry which is a component of the district entries which are components of the whole year census etc).

An interpretation of a source by breaking it down into Facts, Persons etc (in the Record Model) is just one step further than a transcription.

The Persons and Facts that are re-constructed by a researcher are no different from these - the data model is the same but the context is different. Imagine a shop (=source) with lots of furniture (=facts, personas) .... You buy a selection of furniture from different show-rooms and put them together in your house. Does it stop being furniture just because it's now in your house instead of the shop?

EssyGreen commented 12 years ago

@ttwetmore

If an item of evidence includes data on five persons and an event, then your source record for that evidence will have to hold the facts about the five persons and the event.

I don't have a problem with this ... I now have an extra source (CreatedBy me) which contains references to the other sources (just like a book might do) and which contains my interpretations of the relationships, facts etc between the 5 people ... the only difference is that I call my source a "Family Tree" (aka a GEDCOM file)

when searching for data on those persons on genealogical data servers, the data is going to be returned as persona records

Hmmm methinks my local archives won't be doing this in a hurry. They have their own established formats already. OK so the likes of Ancestry might jump on the band wagon ... but even if they did, I personally would throw away their "interpretation" and do my own based on the image copy they have. The level of errors would soar if I were to trust what the web-servers interpret!!

lkessler commented 12 years ago

Tom:

Yes, the record data is in the source records. It could be transcibed information as text fields. It could be OCR'd info as a PDF document. It could be links to multimedia files. They could even index the names, places, dates and events in the source record so that the information is discoverable. But the raw data in its entirety should be as complete as possible and as close to its raw form as possible.

I envision that when searching for the data on genealogical data servers, you could do it in a Google-like fashion - that can search through the text fields and find the most relevant sources data for you, or you could do it in a Steve Morse One Step Search type of fashion to go through the discoverable indexed fields from the source records encompassing smart searches using Soundex or distance between cities, etc. What would be returned would be the source records that are most pertinent to your search - not persona records.

Client programs will never modify source records. They are the facts and can stay in their own Record Model structure. All the conclusions go in the Conclusion Model. Now what FamilySearch will want to build is the compendium of everyone's Conclusion Models. What will be discoverable from that will be what conclusions in the combined world family tree were based on a specific record from the Record Model. That will allow you to find other people who used the same record who you may be able to share information with. I think it's all a wonderful idea.

My example of where this works and where the data gets ripped apart when using personas is for something like a ship's record or census data, which I presented to you a few weeks ago on the BetterGEDCOM wiki in my Flintstone example: http://bettergedcom.wikispaces.com/message/view/Data/48419278?o=40

p.s. I have no Fear, Uncertainty, or Doubt about this.

Louis

ttwetmore commented 12 years ago

Louis,

You have said that keeping the Record Model and the Conclusion Model separate makes sense. However, the GEDCOMX Record Model places the record data about persons into persona records. Does this mean that you would envision a substantial change in the GEDCOMX Record Model?

ttwetmore commented 12 years ago

Louis,

Thanks for your clarifications. I have said my piece supporting the persona record as one of the key data constructs for the next generation of genealogical servers and clients. And you are making your points also about how you think record level evidence should be recorded. For now that is enough. I have no concerns over the direction that must and will be taken.

lkessler commented 12 years ago

Tom,

Thanks for being open-minded enough to let me say my piece.

In answer to your question about the GEDCOMX Record Model, I did see the inclusion of persona and relationship entities in the model. I have no problem with that as long as all they are doing are trying to disaggregate the facts into fields to make searching (a la Steve Morse) easier. However no assumptions or conclusions should be included in the Record Model. Reading the model details, I believe their intention is to separate out the assumptions and conclusions into the Conclusion model. So I'm fine with this.

Louis

ttwetmore commented 12 years ago

Louis,

A most interesting response. It implies you will add a GEDCOMX import function for Behold that accepts persona records! Will you also add a GEDCOMX export function that writes persona records? This is encouraging news since you will definitely come to appreciate the value of persona records!

lkessler commented 12 years ago

Tom,

If it comes down to that, then yes. But the personas in the Record Model are not multi-level, so they will not perform all the functions that you say your N-tiered persona model does.

Louis

stoicflame commented 12 years ago

Hi folks.

So after some pretty intense scrutiny, a lot of FamilySearch-internal discussion, and consultation with other interested parties, we'd like to concede the excellent points that have been voiced here. In accordance with the request, we propose that we remove the record model and focus here on addressing the needs of the genealogical community with regards to the models and standards for the core genealogical research process. We'd like to come up with a core model that accommodates the great ideas of everybody who has contributed to this thread.

As additional information, FamilySearch has a real need to have a model that deals with a different problem set, defined by the need to do bulk field-based record indexing. And we don't think we're the only ones that need this, but we recognize that the goals and audiences are separate and distinct from the goals and audiences interested in the standards for the core genealogical research process.

Let me put it this way: FamilySearch is teaming up with Archives.com, FindMyPast.com, and others to index the 1940 census. Somehow, we've got to share that field-based record data among ourselves to we can integrate it into our products and offer meaningful solutions to our users. That's a Good Thing, right? I trust that nobody here objects to industry efforts among a set of companies to define a standard way to share this kind of data? For a variety of reasons, we don't feel like the model for conclusion-based research will fit our requirements in these cases.

So our plans are to continue to support the record model in a separate project that will leverage as much as possible the core concepts, vocabulary, and models that we hash out here in this project.

We welcome any comments on this proposal. We'd really like to see this proposal implemented sooner rather than later so we can move on to the next set of issues.

ttwetmore commented 12 years ago
...our plans are to continue to support the record model in a separate project that will leverage as much as possible the core concepts, vocabulary, and models that we hash out here in this project.

My concern is simple. I believe that software that supports the full genealogical research process must be able to handle many person records (most taken directly from evidence) that refer to the same real human being. That software should allow genealogists to link those person records together, without merging their contents, as they conduct their research and make their conclusions about how the evidence fits together. This implies the need for N-tier person structures. There is software that handles evidence and conclusion records like this, but not yet any firmly in the genealogical domain that can be pointed to for examples.

In a conclusion only model the person records that refer to the same human being are not linked together into structures, but are merged together into single conclusion person records as soon as the genealogist decides they refer to the same human being. In my opinion this is inadequate for a robust genealogical research process. Once a conclusion person has been built up from facts taken from three or more sources, the history of the genealogist’s conclusion making is basically lost, and if the genealogist later decides a mistake was made, the unwinding is manual and labor intensive. This problem is built into every desktop system today, and is so ingrained that I believe it is hard to actually grok on the problems it causes.

My unshakable belief is that a genealogical data model that supports the research process must allow multiple person records that represent the same human being to persist in the the database, and that they be linkable in ways that show the results of the research process. So I believe that if GEDCOMX is to support the full genealogical research process it will have to find a way to support the N-tier approach.

EssyGreen commented 12 years ago

@stoicflame - I think that's an excellent plan :) I agree that there are different needs and priorities for the two audiences and it will be easier to tackle them separately.

@ttwetmore

software that supports the full genealogical research process must be able to handle many person records (most taken directly from evidence) that refer to the same real human being. That software should allow genealogists to link those person records together, without merging their contents

You have to merge the contents to retain the integrity ... For example, say I get an interpretation of a Person A from someone's published tree on Ancestry. If I just refer to it without "merging" the details then when the owner updates it, how do I know that my conclusions which depend on it are still valid? Unless Ancestry keep timestamped versions of every change made by the owner I will never be able to guarantee I am seeing what I originally saw. The only way I can be sure is to take my own copy as it was at the time I accessed it and treat it like any other source.

Once a conclusion person has been built up from facts taken from three or more sources, the history of the genealogist’s conclusion making is basically lost

We can fix that by defining Evidence and Proof objects. The time sequence is not vital ... A=B is the same as B=A regardless of the order of discovery.

a genealogical data model that supports the research process must allow multiple person records that represent the same human being to persist in the the database

They still can: Person A is recorded in an interpretation of Source1, Person B is recorded in an interpretation of Source2, Person C is being researched by the user and has Evidence that Person C=Person A and Person C=Person B

ttwetmore commented 12 years ago

@EssyGreen

You have to merge the contents to retain the integrity ... For example, say I get an interpretation of a Person A from someone's published tree on Ancestry. If I just refer to it without "merging" the details then when the owner updates it, how do I know that my conclusions which depend on it are still valid? Unless Ancestry keep timestamped versions of every change made by the owner I will never be able to guarantee I am seeing what I originally saw. The only way I can be sure is to take my own copy as it was at the time I accessed it and treat it like any other source.

I believe the opposite -- to maintain integrity you must keep them separate. Your example of "evidence shifting" is handled better if you don't merge the data. If you merge, when someone changes a person record it will be much harder for you to 1) know it happened and 2) be able to do anything about it. It seems to me that you are promoting the "head in the sand approach" where you don't want to know about the change. If you don't merge, then you 1) know instantly when the change occurs (admittedly with software support similar to the "little shaking, green leaf" software), and if the change leads to the conclusion that the person record really belongs to someone else, 2) it is trivial to extricate that record from the wrong human's record cluster and either let it stand alone or link it to another human's record cluster. Merging is the tired old approach of today's generation of conclusion-only systems whose support for the research process is rudimentary at best.

We can fix that by defining Evidence and Proof objects. The time sequence is not vital ... A=B is the same as B=A regardless of the order of discovery.

This is Louis's argument. What are those evidence objects? The very best evidence objects are persistent persona records -- that is exactly what a persona is -- the extracted, digital evidence of a person taken from a single source. Louis's argument is to keep the evidence about persons encoded inside source records. My argument is to liberate that evidence in such a way that software can handle the research process in a powerful way. And I disagree about importance of time order in complex cases. A=B implies B=A only in the two persona case. Now imagine deciding C is the same human. If you wouldn't have been able to decide C were the same until after you had decided A=B, then order matters. You might believe this situation wouldn't arise, that is, you could just have legitimately first decided that A=C and later that (A and C)=B, but I wouldn't agree. I have found complex cases in my research, where eventually 40 or even more personas are concluded to be the same human, that demonstrate that the order of matching/linking/combining/merging (however you wish to view the bringing together of evidence) was an important element in making conclusions and structuring proof statements.

EssyGreen commented 12 years ago

@ttwetmore

The auto-notify functionality you describe (e.g. "shaking leaf") is not something that I believe will be widespread for a long time (if ever). Certain web publishers may use it but personally I would consider it to be a nice-to-have and beyond the scope of what I actually need.

The most valuable sources are located in local archives and record offices who serve a wider audience than genealogists (e.g. local historians, legal advisors). They have neither the resources nor the need to publish everything in GEDCOM format - in many cases to re-scan into GEDCOMX format could damage the originals.

We seem to have quite different approaches towards on-line data. I use on-line publishers very frequently but I use them as a means to getting to the original (or an image copy). If the web publisher provides the image copy then I don't need or want them to interpret for me. If they don't have the original then I will seek it out and be extremely wary of the record until such time as I can verify it. If it's just someone else's family tree then I just use it as a way of discovering sources I may have overlooked.

In my experience, to trust the interpretation of a web publisher or another family tree is way too error-prone.

The interpreted data in the Record Model (Personas/Facts etc) in my model are something which the user creates and has responsibility for. It is effectively the same as in the Conclusion Model but with the simplification of being within the scope of a single source. To my mind this is something which one does naturally on examining a source anyway and being able to hold that with the Source is a way of preserving my thoughts before I go into the complexities of trying to map it with my other data. It serves as a way of documenting where I got what evidence from which can be referenced back when trying to resolve conflicts further down the line.

If a web-publisher produces a downloadable interpretation then this would be just a convenient starting point which the user could/should check and edit to fit their own interpretation.

What are those evidence objects?

We haven't yet defined them in the model but they need to be a link between the Person/Fact in the researcher's 'tree' and a Person/Fact in one of the researcher's interpreted Sources together with a justification for why the match seems likely (or not) and the confidence level (either negative or positive to indicate both degree and direction).

My argument is to liberate that evidence in such a way that software can handle the research process in a powerful way

I'm not sure how you can "liberate" evidence. My model also aids the research process by enabling the researcher to track their reasoning and to highlight conflicts and potential problems, but ultimately the responsibility must rest with the researcher. To "liberate" the researcher from the tedious task of checking and interpreting their sources is to demean the researcher and nullify their research.

I disagree about importance of time order in complex cases

Logically, the order in which the evidence is presented must be irrelevant. .... If you were on a jury and the witnesses appeared in a different order would you give a different verdict???

You are confusing this with the problem of having partial evidence at any given time (and for a genealogist this is all we ever have) and having to adapt our assumptions and conclusions when we find new/conflicting evidence.

The complexity you explained is exactly why you need a ProofStatement which covers all the evidence (even if this means breaking it down into sub-sources/trees/hypotheses to clarify the issues).

You are not the only person to have had complex relationships. I too have had to track many complex and conflicting relationships and I simplify these by building separate trees for each conflict then cross-reference these in my 'base' tree/project when/if a resolution is found. I'm not suggesting this is the only way but please don't patronise me by presuming that I only deal with simplistic cases.

stoicflame commented 12 years ago

Can I suggest that we try to limit discussion about specific product implementations and try to focus on which specific features need to be accommodated by a genealogical data standard? I understand there's overlap, but this just isn't the place to discuss why Product A (which behaves one way) is better than Product B (which behaves another). We'll never progress if we drown ourselves in those kinds of debates-- there is no end there. That's what the marketplace is for.

Instead, what we'd like to see here are issues that identify the holes that need to be filled to support both Product A and Product B and allow Product A and Product B to share data. This is best done by:

Describing and understanding how products behave is an important part of that work, but there's a difference between identifying standard product features and endlessly debating why my product is better than yours.

So, if I remember correctly :-), this issue is about what to do with the record model. Can we talk about that?

@ttwetmore, I opened up #149 for you so we can track what still needs to be done to define support for an "n-tiered" implementation of genealogical data. It's a great idea. I'd like to see it supported.

ttwetmore commented 12 years ago

Ryan,

Sorry if I seemed to jerk away from the topic at hand. My point was, very poorly made I admit, that if the Record Model is removed from GEDCOMX, then there is much greater pressure on the Conclusion Model to hold the "early in the research" evidence, what I think of as the persona level. So this forces the Conclusion Model IMHO to be N-tiered, with the bottom tier being the persona level. When there was a Record Model to hold the personas, then, if the Conclusion Model were 1-tiered, we'd still get at least a 2-tiered system. With the Record Model gone I very much worry that GEDCOMX would become a simple 1-tiered model like every Tom, Dick and Harry model out there.

ttwetmore commented 12 years ago

@EssyGreen

I don't want to inflate this discussion as it's getting far afield from the intent. I will just say that your point that we have to worry about evidence that changes over time seems so far-fetched that I would not consider it in designing a model. I may not fully appreciate that point, however.

I agree that no matter in what order the evidence is uncovered, one should eventually reach the same conclusion, and the final proof statement should be the same. But there is an inner structure to the decision making that is represented by the actual structure of the unmerged person cluster -- each node in that cluster represents a real conclusion. There is a time-independent structure to the cluster. Leaving the persons linked but unmerged maintains that conclusion structure; merging obfuscates it at best, obliterates it at worst.

EssyGreen commented 12 years ago

Ryan,

My apologies also for swaying from the topic and adding fuel to the fire.

this issue is about what to do with the record model

Trying to be as concise as possible:

  1. I agree with @joshhansen's initial post especially with regard for the need for a set of common entities (particularly Person/Fact/Role).
  2. I think the ability for the researcher to document their interpretations of a single source/record before they move on to the complexity of how it fits with the rest of their data is a critical one and should not be lost.
  3. Point 2 can be easily achieved if we think of the GEDCOMX file as a source in itself. Hence, every Person/Fact/Role resides within the context of a single source (the default null source being the current file).
  4. Evidence is the mapping of a Person/Fact/Role from a specific source Record (Record Model) to a Person/Fact/Role being researched (Conclusion Model)
  5. Proof is the collection of Evidence (see 4) for any particular Person/Fact/Role being researched (Conclusion Model)
  6. Since 4 & 5 may be overly complex and largely unncessary when in the context of a single source, you may wish to inherit sub-classes (e.g. ConcludedPerson inherits from RecordedPerson) to deal with them differently - but that's the detail for later. If, instead you would prefer to retain Record Model and Conclusion Model as a way of distinguishing them then that's OK with me as long as the objects are basically the same shape in each.
EssyGreen commented 12 years ago

@ttwetmore - I think you missed my point. I'll leave it at that.

jralls commented 12 years ago

ISTM the original proposition is that as presently expressed the conclusion classes and the record classes are so similar that they could be the same. I think that everyone who has commented so far agrees that we want to capture a lot more of the reasoning behind the eventual conclusions than is possible with GEDCOM 5.5 or indeed with most (all?) of the genealogy software presently available.

For my part I'd like to have a way of recording all of the 5 requirements of the Genealogical Proof Standard (reasonably exhaustive search, accurate citation, analysis and correlation, resolution of conflicts, soundly reasoned conclusion). While I think it would be cumbersome to have to do that separately for each fact, I recognize that the RDB community needs a way to fit those same elements into their tuples and relations format, so the new exchange specification will require some amount of atomization.

So, having digressed to set the context for what I see as necessary, we need the following classes:

Search: A record of what we looked for, where we found it (or didn't), what other things we might have looked for and where if we'd had more time and money, and why we think that what we've gotten so far is enough.

Source: Must accommodate online, self-documenting (meaning that one can query the source and get both its citation fields and those of its sources, perhaps a NARA or FHL film number and the citation for the original records that were filmed).

Analysis: A comparative discussion of the sources collected including their reliability, where they agree and don't, what's legible and what's not, and so on.

Researcher: Not terribly important for the typical genealogist right now who mostly does his or her own work, but as the practice becomes more collaborative everything must be tagged with who did the work.

Conclusions: Here's where I am inclined to go straight to a narrative, but the standard will need to have classes for the RDB folks: Person: The basic unit of genealogy. ;-) Event/Relationship/Fact: Descriptive elements that make up the biographical sketch which is the end result of family history research. How exactly to represent these in a way that works well for linking them is discussed in another issue (which I can't find right now.) Dates: Dates are a whole lot harder than most people think. Our British friends have several different ways of writing dates depending on who was doing the writing when and for what purpose. Converting, say, a regnal date to a (proleptic) Gregorian date (or better a Julian Day ) is a conclusion that requires a discussion and a citation. Place: Another hard subject that's been addressed at length in #79.

Another aspect of capturing the research process is that it's generally necessary to study a lot of people other than the focus person: We need to work out the whole family, the neighbors' families, the guys who witnessed their deeds and wills (and their families), their minister and fellow parishioners, and so on. All of that fits into the Person/Fact model, but encompass relationships far broader than the family ones supported by GEDCOM.

EssyGreen commented 12 years ago

I'd like to have a way of recording all of the 5 requirements of the Genealogical Proof Standard

+1 see #141

However, I have to say that I'm beginning to think that it might be better for GEDCOMX to have a minimum (but mandatory) spec. and let applications fill in all the gaps.

ttwetmore commented 12 years ago

I agree with @EssyGreen. I don't want the model complexified with five new record or structure types and additional problems to solve in order to make "RDB folk" happy.

Most of the 5 steps are procedural and don't require support in a model.

I also question the automatic assumption that we must be able to always and definitively convert all dates, in all notations, to a Gregorian calendar form, and vice versa. Sure, there are excellent reasons for this, but is it mandatory? What's wrong with "spring of 1545 or 1546" or "early in the reign of Charles II" as fully legitimate date values?

EssyGreen commented 12 years ago

I also question the automatic assumption that we must be able to always and definitively convert all dates, in all notations, to a Gregorian calendar form, and vice versa.

I agree. I think different applications may have different needs for how they want to store the dates in different circumstances (e.g. in Record Model maybe leave as is for user to view but in Conclusion Model maybe convert to earliest date and latest date in approximations to be able to sort etc). It's up to the application to work out what it needs.

I personally think this also applies to ages, places (tho' I realise there is a lengthy discussion on this elsewhere) and fact types but suspect we will have to comply to keep the status quo.

jralls commented 12 years ago

Most of the 5 steps are procedural and don't require support in a model.

I vehemently disagree. They aren't steps, they are the elements required by the standard to have a valid conclusion. If any are missing or incomplete, the researcher has more work to do, and others would be remiss in accepting the researcher's conclusions -- and by extension, the file transfer.

I also question the automatic assumption that we must be able to always and definitively convert all dates, in all notations, to a Gregorian calendar form, and vice versa.

Strawman alert: I didn't say that we must always and definitively convert all dates. I said that the model must be able to accommodate the conversion in a way that is easy to communicate between systems because it's something that most extant programs do (badly, and often silently without recording the original documented date, I might add).

That said, considering the importance of the date of a recorded event in making a genealogical conclusion and your interest in automating the formation of those conclusions, I'd think that you would want to always convert to a uniform date representation.

ttwetmore commented 12 years ago

@jralls

Here are the five steps taken from the web page that defines them:

Reasonably exhaustive search. Complete and accurate citation of sources. Analysis and correlation of the collected information. Resolution of conflicting evidence. Soundly reasoned, coherently written conclusion.

The first is procedural. The second is handled by source records. The third is extracting persona records, doing the brainwork needed to link persona records together into groups, along with the conclusion statements of why you did the grouping. The fourth is adding notes about the conflicting information and how you resolved it. The fifth is a document you write; if your genealogical program is powerful enough then it can do a lot of the writing for you by bringing together how you have grouped the personas and what you have added to the records to show your conclusions, your resolutions, and whatever beautiful words you want to link everything together into professional report.

Yes, I see that I said something entirely wrong in saying the these steps "don't need support in a model." Of course they do. I apologize for being such a poor wordsmith. The point I was trying to make is that the models we already have include all the elements needed to support the five steps. I was trying to avoid a reset to zero or back to the drawing board response based on bringing up the five steps. We have 'em covered. We should be able to say we have 'em covered and move on.

jralls commented 12 years ago

We have 'em covered. We should be able to say we have 'em covered and move on.

So aside from adopting N-tier and a few tweaks we're done, eh?

I don't think so. I'm not suggesting a "reset to zero", but I don't think that the requirements of the GPS are close to being covered. I know that you're familiar with the Gentech GDM. That model covered the GPS requirements. GedcomX doesn't, yet.

ttwetmore commented 12 years ago

@jralls Sorry, I didn't mean to imply that the GEDCOMX model was done, just that concerns over how well it supports the GPS are not near the top of the pile of worries.

I don't share your appreciation for the GDM. It is over-normalized to the point that the underlying model is unrecognizable. I consider the GDM's approach to assertions a nightmare. Your statement that GDM covers the GPS has me scratching my head.

I have explained how I see the GEDCOMX model supporting the 5 steps. Maybe you could do the same for the GDM?

EssyGreen commented 12 years ago

Most of the 5 steps are procedural and don't require support in a model.

I vehemently disagree. They aren't steps, they are the elements required by the standard to have a valid conclusion.

++1. Absolutely! And if the data to do this isn't in the model then we're stuffed.

the models we already have include all the elements needed to support the five steps.

I disagree - see #141

jralls commented 12 years ago

I don't share your appreciation for the GDM. It is over-normalized to the point that the underlying model is unrecognizable. I consider the GDM's approach to assertions a nightmare. Your statement that GDM covers the GPS has me scratching my head.

No argument about it being over-normalized, but perhaps you are missing the forest for the trees. The GDM has the following flow:

Search -> Source -> Representation -> Assertion \ / Repository

A large part of the model had to do with project administration -- clients, research objectives, and such that are pertinent to professionals. I've left that off here, as I don't think that its relevant to GedcomX. There was a separate Place entity (like the rest, normalized to a fare-thee-well). Searches were the link between the admin stuff and the parts we're concerned with here, and could be of either repositories (looking for sources) or sources looking for relevant evidence.

The "Source" entities (recursive, normalized into parts and collectable into "source groups") are the metadata that makes up a citation. Sources can have one or more "Representations", which can be an image, transcript, abstract, or something else. The source entity has a "comments" element which can be used to document the researcher's analysis of the physical condition (preservation, legibility, etc.).

As is contemplated by this issue, all extraction of what we've been calling evidence (that is, creating personas, facts, events, as database entities from what we observe in the source) and what we've been calling conclusions (everything else based on those sources) is an assertion. Each assertion refers to exactly two from the set {Persona, Event, Characteristic, Group}. Events and Groups have a place element, and assertions can be combined via an Assertion-Assertion table, thus accommodating your N-tier requirement. Each assertion is also associated with a "surety" entity and includes a Rationale and a Disproved element.

I have explained how I see the GEDCOMX model supporting the 5 steps. Maybe you could do the same for the GDM?

Well, you've explained how GedcomX models the requirements that you don't blow off as "procedural". But here goes:

Reasonably exhaustive search is documented by the research_objective, activity, search, repository, and source entities

Citation is documented by the repository, source, and citation entities

The rest is documented by assertions: One asserts the evidence (the personas, characteristics, events, and groups) from the source, asserts the combinations, the veracity, and so on, building a chain of assertions much as you've proposed in #149.

As you note, the assertion model is cumbersome. One winds up with a network of b-trees that would be quite daunting to coalesce into a "coherently written conclusion". The sound reasoning will be there somewhere, but presenting all of the necessary pieces in a way that the researcher can understand so that she can write up her report would be a major challenge for an implementer.

It's also bothersome that while there's an hierarchical place entity, it's not part of the assertion mechanism: Places are just elements of group and event entities, inserted without comment. Dates are also left out of the assertion mechanism; worse, they're represented as however the underlying database represents them, no strings attached (sorry), and no way to assert that "michaelmas in the 4th year of the reign of James II" is 29 September 1689 -- or to challenge an entry that gets it wrong.

Some might have trouble with using groups to model all relationships, but I submit that it's a better abstraction than GedcomX which forces one to model groups as a network of relationships.

ttwetmore commented 12 years ago

@jralls Thanks for great reply. The only step I left as procedural is the first. That's the step that requires administrative objects. I personally don't need them in the model I would like to see, so if they remain procedural (so I have to use paper and pencil) I can live it. If some inspired modeler can add them to GEDCOMX in a way that makes sense to me, I could live with that also. I simply don't think it's worth the effort and complication to add the administrative component to the GEDCOMX model.

As you note, the assertion model is cumbersome. One winds up with a network of b-trees that would be quite daunting to coalesce into a "coherently written conclusion". The sound reasoning will be there somewhere, but presenting all of the necessary pieces in a way that the researcher can understand so that she can write up her report would be a major challenge for an implementer.

Cumbersome puts is mildly; the assertion concept imagined by GDM is a nightmare. If employed, every fact, every relationship, every attribute, requires an two extra records and three extra references. Have you imagined what a person record would be like in the GDM model. It would consist of its id, a name (which for some reason in the GDM model does not require any justification), and then a bunch of assertion references. Each assertion then has a reference back to the person, a reference to a naked fact, and a reference to something in the evidence world. In a sane world the persons would simply contain their own facts directly and have their own, single and direct reference to its source. Exactly the same information conveyed with one third the number of records and references. I hope GEDCOMX stays far away from that mess.

The GDM assertion is a consequence of the extreme normalization inherent in the GDM. If one were to unnormalize the GDM a few steps, one would end up with my type of persona record outlined above, so it's really shame, shame on GDM for mixing modeling with relational implementations thereof. If MongoDB had existed at the time, and if the GDM team had been as enamored with "document" databases as they were of relational databases, I believe they would have come up with something GEDCOMX would have been very comfortable with.

EssyGreen commented 12 years ago

Re including goals/search logs etc ... I think this rather depends on the scope of GEDCOMX and hence on it's objectives ... If the intention is to provide a way of publishing genealogical data then I would agree with @ttwetmore in that it is not essential (but then I would suggest that much of the Conclusion Model would also fall into the nice-to-have category). But if GEDCOMX is to be used (as it is now) to transfer data between different genealogical research applications then GEDCOMX needs to provide for all the elements of the process model to minimise data loss.

I'm becoming less convinced that the second option is even viable (as much as I would love it to be) and if I think about it, if GEDCOMX could provide the 'be-all-and-end-all' model then there would be a very limited market for genealogy applications since there would be little room for differentiation of the products. From the discussions here I think that the opposite is true - that there are very many different view-points on what is best and hence great opportunities for different applications, but in so doing there is inevitable data loss .... GEDCOM X's task is therefore to determine the bare minimum that can be considered necessary for a 'good' application to support.

lkessler commented 12 years ago

EssyGreen said:

"If GEDCOMX could provide the 'be-all-and-end-all' model then there would be a very limited market for genealogy applications since there would be little room for differentiation of the products. From the discussions here I think that the opposite is true - that there are very many different view-points on what is best and hence great opportunities for different applications, but in so doing there is inevitable data loss .... GEDCOM X's task is therefore to determine the bare minimum that can be considered necessary for a 'good' application to support."

+1

Louis