FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
356 stars 67 forks source link

In need of a Source object #144

Closed EssyGreen closed 12 years ago

EssyGreen commented 12 years ago

I realise the infinite flexibility of inheriting everything from a Resource and hence allowing it to be considered a source but this also makes for infinite complexity and infinite nonsense!

To allow, for example, a DatePart to be used as evidence is nonsensical. Theoretically we could cite millions of documents with a DatePart of "Day=1" but it means absolutely nothing without the wider context of the Date, which means nothing without the wider context of the Fact.

I think we need a definite Source object (probably equating, or similar, to the Record) and it is this (not the Resource) which should be referenced in Citations and used in Evidence.

EssyGreen commented 12 years ago

I think @thomast73 was just trying to get a common vocabulary as a starting point for resolving the things we've discussed in this thread. Let's at least hear his next steps.

thomast73 commented 12 years ago

Why should we re-start a 5-month old discussion just because you've finally decided to join in?

I am sorry. I've no intention toward rudeness, nor am I attempting to re-start from the beginning. I am joining (at the request of @stoicflame) in hopes we can distill some of these ideas into something concrete in the GedcomX model.

The current "Description" (paragraph 3.1) isn't a class. It is including by reference the RDF specification, which is (unfortunately) used extensively in GedcomX.

In referring to the Description class, I am referring to the org.gedcomx.metadata.rdf.Description class in the GedcomX Souce Metadata model.

Also, I think the result of our work here will be that the RDF specification will not be so prominent and that the resulting model will feel more directly applicable to our community and not so abstract and general purpose and free form.

EssyGreen commented 12 years ago

Yup I'm with you :) So where do we go from here?

thomast73 commented 12 years ago

I don't like that you're conflating "conclusion" with "hypothesis".

I see that I have misread what has previously transpired in this regard. Indeed, they are not the same as @jralls has described them. I am not sure that @EssyGreen is making a clear distinction between the two in what she has written?

So, repeating from above:

And adding:

The concept of "reasonsing" and "attribution" are combined into a single Attribution class in the current model (Release 18). In the current model, Attribution instances can be associated with SourceReference instances and with GenealogicalResource instances. It is my belief that our model has a few problems with regards to Attribution and its use within our model. In my mind, the "reason" (currently proofStatement) should be pulled out of Attribution so that Attribution is just about the "who" and the "when". It makes sense to me to associate an instance of this new Attribution' with SourceReferences and with GenealogicalResources, but it does not seem true that all instances of GenealogicalResource need an associated "reason" (e.g., does a Note need a "reason" for being?). It does seem that instances of Conclusion need "reasons" so that I want to say that "reason" ought to be an attribute of Conclusion, but not everything I consider to be a "conclusion" is currently derived from Conclusion (something we will consider modifying). My musings...but off topic somewhat; may need to open a separate issue for this.

Also, I do not feel satisfied with any of the names proffered here: "reasoning", "synthesis", "hypothesis", "theory", .... [Please do not take my saying so as a personally-directed critique. :-)] None of them really "speak to me" of the intended usage as I currently see it. "Reason" seems the closest fit ...-but is so generic! :-) Still pondering...

EssyGreen commented 12 years ago

I don't like that you're conflating "conclusion" with "hypothesis".

I see that I have misread what has previously transpired in this regard. Indeed, they are not the same as @jralls has described them. I am not sure that @EssyGreen is making a clear distinction between the two in what she has written?

Apologies if I wasn't clear enough ... I believe that a "Conclusion" is the reasoning/rationale/explanation for why a specific Hypothesis is favoured over others. We could model this by having a collection of conflicting hypotheses together with a verbatim "conclusion" statement. However, if each Hypothesis contains not just a verbatim but also a weighted/numeric evaluation of the Evidence for and against it then a separate "Conclusion" is not necessary and will dynamically change as the Hypotheses are re-evaluated as/when new Evidence is found. If you don't like this then I would see the need for a "Conclusion" statement alongside the aggregate of conflicting Hypotheses - I think this will be more difficult to model.

I need a use case to get any clarity on this, so here's one for starters.

Say you had 2 conflicting sources (S1 & S2) for a Person's Name (but you have already established/proven that they relate to the same Person). You have a number of hypotheses: (a) S1 is correct and S2 is wrong for some reason (b) S2 is correct and S1 is wrong for some reason (c) both names were used simultaneously (d) S1 and S2 are in fact the same thing with different spellings/languages (e) the real name is something deduced from an amalgamation of S1 and S2

The "Conclusion" is (a), (b), (c), (d) or (e) plus the reason why. But if we explain "why" and give a score to each of (a)...(e) then we don't need a separate Conclusion.

It is my belief that our model has a few problems with regards to Attribution and its use within our model. In my mind, the "reason" (currently proofStatement) should be pulled out of Attribution so that Attribution is just about the "who" and the "when". It makes sense to me to associate an instance of this new Attribution' with SourceReferences and with GenealogicalResources, but it does not seem true that all instances of GenealogicalResource need an associated "reason" (e.g., does a Note need a "reason" for being?).

I totally agree - see #178 :)

I do not feel satisfied with any of the names proffered here: "reasoning", "synthesis", "hypothesis", "theory"

Fair enough - read EE chapter 1 - yes John I did just say that;) - and pick a phrase that describes what you mean :) Then at least we'll all have a common reference point.

EssyGreen commented 12 years ago

PS: Since we are allowing duplicate Births and Deaths etc in the model it cannot be a "Conclusion" model ... it is in fact a "Hypothesis" model with Person, Name, "Fact", Event and Relationship all being types of Hypothesis.

jralls commented 12 years ago

The "Conclusion" is (a), (b), (c), (d) or (e) plus the reason why. But if we explain "why" and give a score to each of (a)...(e) then we don't need a separate Conclusion.

This seems a bit mechanistic, and ISTM that the scoring will be subjective and thus will need a narrative argument of some sort for each score. It seems easier to just write a narrative argument and arrive at a single synthesis/hypothesis/reasoning.

jralls commented 12 years ago

It is my belief that our model has a few problems with regards to Attribution and its use within our model.

We discussed that briefly in #134 starting here.

I don't necessarily agree that the "who" and "when" should be separated from the "proof argument". Certainly a proof argument (or synthesis, hypothesis, reasoning) needs to have one-or-more "who", but that, along with "when" can be provided by version control in a collaborative environment, and is redundant in the single-researcher case.

But if there's no proof argument, what exactly are the "who" and "when" applying to? The mere atomization of the hypothesis into (generalized, including person, relationship, and event) conclusion objects?

jralls commented 12 years ago

Conclusion (aka: Assertion): a statement of what the researcher believes after analyzing the set evidence in hand

So if that's the "conclusion", what are you going to call its atomization into discrete characteristics, relationships, etc.?

Once one moves into the period before censuses enumerated everyone by name, reconstruction of families becomes much more dependent upon multiple sources and circumstantial evidence. Suppose you have couple of wills, one of which explicitly lists everyone, while the other doesn't name anyone, just says "To my loving wife ... the house, it's contents, and 1/3 of the land for her lifetime ... after which all remaining assets to be divided equally among my 6 children and their heirs or assigns". You find a couple of wills transferring parcels "for $1 and natural love and affection", a family bible, but the handwriting is the same for all of the children, making it a bit suspect. It doesn't match the detailed will, either. There are property and personal tax records (including tithables appearing to come of age), and of course a couple of census entries with the right name for the HoH and similar ages & sexes for other members. Naturally, not everything agrees...

A proficient genealogist will analyze all of this documentation together and write a single analysis (which might be quite long) separating the families as best s/he can. That's the initial hypothesis. You go and digs some more to try and confirm or refute your hypothesis. You find a few more records in the town clerk's basement, but nothing really conclusive. You make some adjustments to your hypothesis to reflect the new records, and you're now satisfied that you've conducted a "reasonably exhaustive search".

To capture that in GedcomX you'll have:

Since there are no Attribution references provided, each of those objects will have its own Attribution, all of which will be identical, and an identical set of SourceReferences. There's no provision for SourceReferences in the Attribution, so any citations in the ProofArgument will have to be ad-hoc and subject to referential integrity issues. If you were foolish enough to enter your hypothesis and create the "conclusion" objects before going doing the second round of research, you will have to individually edit each of the Attributions and SourceReference lists.

Granted, GedcomX isn't intended to have a UI, and there's nothing preventing your program from providing for AttributionReferences to a single proof statement and then copying that into the various objects when it exports the GedcomX file... but consider the logic needed in the receiving program if it's to recognize all of those duplicated statements and resolve them into a single object with references. I'm ignoring the fact that no existing genealogy program is anywhere near sophisticated enough to handle this anyway. GedcomX is supposed to model the Genealogical Proof Standard, not existing software.

jralls commented 12 years ago

In referring to the Description class, I am referring to the org.gedcomx.metadata.rdf.Description class in the GedcomX Souce Metadata model.

Yup. RDF.DESCRIPTION, or in the RDF XML Syntax spec rdf:Description.

Also, I think the result of our work here will be that the RDF specification will not be so prominent and that the resulting model will feel more directly applicable to our community and not so abstract and general purpose and free form.

Sure hope so. As I've said several times before, RDF should be an implementation detail. It has no place in the conceptual model/specification. But at present it's used many places, not just in SourceReferences.

jralls commented 12 years ago

PS: Since we are allowing duplicate Births and Deaths etc in the model it cannot be a "Conclusion" model ... it is in fact a "Hypothesis" model with Person, Name, "Fact", Event and Relationship all being types of Hypothesis.

That's a misfeature of many genealogy programs, a lame way to handle conflicting evidence. It fits well with RDB architecture but has little else to recommend it. On the other hand, constraining certain "fact" and "event" types to one-per-person rather complicates the model.

EssyGreen commented 12 years ago

The "Conclusion" is (a), (b), (c), (d) or (e) plus the reason why. But if we explain "why" and give a score to each of (a)...(e) then we don't need a separate Conclusion.

This seems a bit mechanistic, and ISTM that the scoring will be subjective and thus will need a narrative argument of some sort for each score. It seems easier to just write a narrative argument and arrive at a single synthesis/hypothesis/reasoning.

I totally agree that a narrative is needed to support the numeric evaluations (which are of course subjective) ... I think where the narrative goes (and whether it applies to an aggregate or single evidence objects) depends on the general approach being taken by the researcher ie conclusion-based (by this I mean the researcher focuses on presenting a cohesive tree of non-conflicting hypotheses which all hang together) vs evidence-based (by this I mean the researcher presents all their hypotheses - whether conflicting or not - directly in the tree and indicates which are "preferred" in some way). Personally I prefer conclusion based research (and would hence have the verbatim evaluation as part of each hypothesis) but the GEDCOM X model is evidence based and so there is no-where for the aggregate evaluation to go because no aggregation is ever done.

EssyGreen commented 12 years ago

if there's no proof argument, what exactly are the "who" and "when" applying to?

In my opinion the who and when are the researcher (assertion) whereas the proof statement is the "what" (e.g. Person A = Person B). The researcher info can be right at top as part of the source; the proof statement needs to be where the assertion/hypothesis/whatever is being made.

EssyGreen commented 12 years ago

A proficient genealogist will analyze all of this documentation together and write a single analysis (which might be quite long) separating the families as best s/he can. That's the initial hypothesis.

I agree - that's what I call conclusion-based research above

EssyGreen commented 12 years ago

PS: Since we are allowing duplicate Births and Deaths etc in the model it cannot be a "Conclusion" model ... it is in fact a "Hypothesis" model with Person, Name, "Fact", Event and Relationship all being types of Hypothesis.

That's a misfeature of many genealogy programs, a lame way to handle conflicting evidence.

I agree (again!) but it's also the way GEDCOM X is going

thomast73 commented 12 years ago

I do not feel satisfied with any of the names proffered here: "reasoning", "synthesis", "hypothesis", "theory"

Fair enough - read EE chapter 1 - yes John I did just say that;) - and pick a phrase that describes what you mean :)

EE 2nd Ed, Section 1.3 is titled "Conclusions: Hypothesis, Theory & Proof" and essentially states that hypotheses, theories and proofs are "conclusions" in various states of proven-ness. So these names are a classification system (of sorts) for conclusions.

The state of proven-ness seems rather subjective...what is proven in one person's view may not be satisfactorily so in another's view. Perhaps this is the reason @EssyGreen wishes to emphasize the "unproven" end of the spectrum with the "hypothesis" name.

For the model's sake, the generic "conclusion" designation seems good enough. We are not asking data providers to make statements about where a given conclusion is on the scale of proven-ness.

The @jralls' "synthesis" name seems to conflate the notion of "conclusion" with the processes required to demonstrate proven-ness -- the reasoning/rationale/explanation. It also seems to assume that all conclusions can be demonstrated with more than one piece of selected evidence. It seems to me that a "conclusion" can exist with or without proven-ness -- that conclusions can existing with no substantiation, based with a single piece of evidence, or based on a systhesis of selected evidence.

{@EssyGreen}a "Conclusion" is the reasoning/rationale/explanation for why a specific Hypothesis is favoured over others

The reasoning/rationale/explanation for a conclusion is not a conclusion, but rather information about how we arrived at said conclusion.

Conclusion (aka: Assertion): a statement of what the researcher believes after analyzing the set evidence in hand

I think my definition might still be susceptible to misinterpretation. I want to be sure it is narrowly construed to exclude the statement of rationale. So perhaps something like this:

In review, I think the following comparisons might be true.

{@jralls}"synthesis" is not {@thomast73}"conclusion" {@jralls}"synthesis" is not {@thomast73}"reasoning" {@jralls}"synthesis" might be the same as {@thomast73}"conclusion" + {@thomast73}"reasoning"

{@EssyGreen}"hypothesis" might be the same as {@thomast73}"conclusion" {@EssyGreen}"conclusion" might be the same as {@thomast73}"reasoning"

EssyGreen commented 12 years ago

these names are a classification system (of sorts) for conclusions.

Ouch I shot myself in the foot there didn't I ? lol :)

The state of proven-ness seems rather subjective...what is proven in one person's view may not be satisfactorily so in another's view. Perhaps this is the reason @EssyGreen wishes to emphasize the "unproven" end of the spectrum with the "hypothesis" name.

Indeed :) I'm a pessimist and very wary of over-optimism where secondary sources are concerned :)

{@EssyGreen}a "Conclusion" is the reasoning/rationale/explanation for why a specific Hypothesis is favoured over others

The reasoning/rationale/explanation for a conclusion is not a conclusion, but rather information about how we arrived at said conclusion.

OK ... so what/where is the " reasoning/rationale/explanation for why a specific Hypothesis is favoured over others" in the model? Where do I put conflicting/competing hypotheses? How do I distinguish these from non-conflicting hypotheses?

{@EssyGreen}"hypothesis" might be the same as {@thomast73}"conclusion" {@EssyGreen}"conclusion" might be the same as {@thomast73}"reasoning"

I think you might be right :)

thomast73 commented 12 years ago

I think where the narrative goes (and whether it applies to an aggregate or single evidence objects) depends on the general approach being taken by the researcher ie conclusion-based (by this I mean the researcher focuses on presenting a cohesive tree of non-conflicting hypotheses which all hang together) vs evidence-based (by this I mean the researcher presents all their hypotheses - whether conflicting or not - directly in the tree and indicates which are "preferred" in some way). Personally I prefer conclusion based research (and would hence have the verbatim evaluation as part of each hypothesis) but the GEDCOM X model is evidence based and so there is no-where for the aggregate evaluation to go because no aggregation is ever done.

An intriguing statement ... though I'm not sure I completely comprehend what you are saying here.

The object(s) I associate with my "rationale" statement is a function of what question(s) I am attempting to answer via that statement. If the statement is just about a birth date, I'd associate it with the conclusion object most closely associated with that proffered representation -- e.g., the "birth" Fact. If the "rationale" statement was more of a synthesis of analysis involving many conclusions and pieces of evidence -- i.e., {@EssyGreen} "a cohesive tree of non-conflicting [conclusions] which all hang together" -- perhaps we could associate it with each conclusion object relevant to the statement (perhaps even across multiple "person" conclusions and their subordinate conclusions), or perhaps just associate it with the enclosing conclusion (e.g., the "person" conclusion that includes the "birth", "death", etc. conclusions discussed in the "rationale" statement).

But if the model supported associating a "rationale" statement with any conclusion, and with more than one conclusion, wouldn't the model support both research "styles" -- "evidence-based" and "conclusion-based" research?

OK ... so what/where is the " reasoning/rationale/explanation..." in the model?

Good question. Right now it is conflated into Attribution. Right now a single statement cannot be referenced by multiple conclusions (@jralls also discusses this here). Seems like we need some work here.

thomast73 commented 12 years ago

The "rationale" statement, at a base level, could be represented as a Note. Is there a strong reason to distinguish this type of note -- the "rationale" note -- from other notes?

thomast73 commented 12 years ago

So, I guess I am going to give an answer to my own question -- in part, because I wanted to get in on the whole "talking to myself" thing. ;-)

Yes, the "rationale" statement would benefit from being something distinct from Note in that I would like to associate sources with my statement -- something that I would not do with Note.

jralls commented 12 years ago

The @jralls' "synthesis" name seems to conflate the notion of "conclusion" with the processes required to demonstrate proven-ness -- the reasoning/rationale/explanation.

That's because you're using "conclusion" in its broadest sense, while I'm using it to mean the atomized classes derived from "Conclusion" in GedcomX plus the Relationship class and (prospectively) the Event class.

It also seems to assume that all conclusions can be demonstrated with more than one piece of selected evidence.

No, that conclusions must be demonstrated with as much evidence as can be collected from a "reasonably exhaustive search", as specified by the Genealogical Proof Standard. No "selection" of evidence is permitted: All the evidence must be considered and all contradictions explained in the proof argument.

It seems to me that a "conclusion" can exist with or without proven-ness -- that conclusions can existing with no substantiation, based with a single piece of evidence, or based on a systhesis of selected evidence.

Perhaps, but that's not genealogy and it is certainly not consistent with the GPS.

Are you telling us that the GedcomX development team is repudiating Ryan's statement of purpose in #154, and that the official position of FamilySearch is that the GPS is no longer important to GedcomX?

thomast73 commented 12 years ago

Are you telling us that the GedcomX development team is repudiating Ryan's statement of purpose in #154, and that the official position of FamilySearch is that the GPS is no longer important to GedcomX?

No.

We would like the model to support GPS-based research. But the model is not about forcing all genealogical data into the GPS mold.

It seems to me that a "conclusion" can exist with or without proven-ness -- that conclusions can existing with no substantiation, based with a single piece of evidence, or based on a systhesis of selected evidence.

Perhaps, but that's not genealogy and it is certainly not consistent with the GPS.

Genealogical data evolves. I often start with someone's word, then find one source, then maybe another, and so on, until I reach my preferred level of proven-ness. At any given time, conclusions in my tree are in various states of proven-ness. Not only that, my preferred level of proven-ness may not meet your own standard from proven-ness. So, does a lack of proof, or a sub-standard proof, mean it is no longer a conclusion? I would say "no". In a data exchange, it would still be exchanged as a conclusion. In accepting it, you might require that more work be done, or that additional statements be made, or you may even choose to skip/reject it, but a lack of proven-ness ought not prevent an exchange.

The @jralls' "synthesis" name seems to conflate the notion of "conclusion" with the processes required to demonstrate proven-ness -- the reasoning/rationale/explanation.

That's because you're using "conclusion" in its broadest sense, while I'm using it to mean the atomized classes derived from "Conclusion" in GedcomX plus the Relationship class and (prospectively) the Event class.

I guess I am missing your point?

Whether I am talking about the abstract "conclusion" concept, or a specialization (atomization?) of it (e.g., an instance of Name), wouldn't we still model the "rationale" statement separate from the "conclusion"? How does talking about a specialization of conclusion change how we model the "conclusion" and the "rationale" that led to it?

EssyGreen commented 12 years ago

We would like the model to support GPS-based research. But the model is not about forcing all genealogical data into the GPS mold.

Why the heck not????? Surely the credibility of GEDCOM X lies in it supporting the GPS! How can you possibly produce a "standard" which doesn't support another established standard you already support on the same subject? The GPS isn't exactly controversial - in fact it's pretty much Motherhood & Apple Pie

EssyGreen commented 12 years ago

does a lack of proof, or a sub-standard proof, mean it is no longer a conclusion? I would say "no". In a data exchange, it would still be exchanged as a conclusion. In accepting it, you might require that more work be done, or that additional statements be made, or you may even choose to skip/reject it, but a lack of proven-ness ought not prevent an exchange.

That is true but the "level of proven-ness should also be able to be transported. If I had a work in progress and was investigating a theory that A=B I would want that indicated so that I couldn't be mis-cited as saying "Sarah has proven A=B"

EssyGreen commented 12 years ago

The "rationale" statement, at a base level, could be represented as a Note. Is there a strong reason to distinguish this type of note -- the "rationale" note -- from other notes?

YES!!! One of the problems of old GEDCOM was that you could stuff NOTEs in anywhere and they meant nothing so a receiving program can't tell the different between something like "I must remember to go to the record office next week" vs "This source is really unreliable" vs "I believe that Person A = Person B because ..." vs "This is a picture of my grandmother" vs "Although the marriage cert says this, it can't possibly relate to this person because ..." vs ... ad infinitum.

If we are going the "everything can be verbatim text" route then we don't need much of a model - let's just use wiki markup and have done with it. If not, then we need to understand and model the key statements which are important for an application to recognise/understand and make these distinctive objects which the applications can then utilise. A conclusion/rationale/hypothesis/whatever is absolutely vital in being able to "understand" genealogical data.

Personally I think generic Note objects are superfluous, ambiguous and confusing and I would rather not have them at all. Instead allow a single generic CDATA narrative field wherever appropriate.

jralls commented 12 years ago

Whether I am talking about the abstract "conclusion" concept, or a specialization (atomization?) of it (e.g., an instance of Name), wouldn't we still model the "rationale" statement separate from the "conclusion"? How does talking about a specialization of conclusion change how we model the "conclusion" and the "rationale" that led to it?

At the abstract level, everything from extraction of evidence on is a conclusion: Even the date of an event recorded in a document needs to be analyzed to determine the calendar in use at that time and place. Modelling that was the approach taken by the Gentech GDM, and it was so cumbersome that no-one implemented it. You'll find some choice comments from Tom Wetmore about it in some of the other issues.

"Atomization" means isolating each conclusion element (the name, the date, the place, the participants, etc.) into its own object (or RDB field) for machine representation. That's necessary to some extent for genealogy software to work -- particular relationships, for example, need to be recognized by the program in order to form a tree. When one is actually working on reconstructing a family it's rare to be able to isolate a particular "atom" because a source which provides no context to an atom isn't useful: Consider a bible you find at a garage sale with a single note, "John Smith was born July 11, 1854". No provenance, no other names, nothing to tell you which John Smith among the thousands born in that period it's talking about. Not very useful. At the other end of the spectrum, a bible in the possession of a member of the Smith family you're working on, with 3 generations of Smith BMD entries all written in different inks and hands. It also has an entry, "John Smith was born July 11, 1854", but the provenance of the bible and the other entries tell you exactly which John Smith it means. Other documents with overlapping information allow you to relate the evidence in each to the others and to build a complete picture of the family. You can't get there by taking John Smith's birth date out of the 3 or 4 which mention it and working only on that -- especially if the dates don't agree.

How does that change how we model conclusion and rationale? In the "standard model", used by almost every genealogy database I've seen, there are only molecules: A birth event, for example, with a date, a place, and a list of participants as atoms. There's a list of sources, and a "note" field to collect some sort of rationale if the user is motivated to do so -- but it's a pain to discuss the sources because there's no way to link them to the note. There's also no way to tie the note to other molecules. (GRAMPS is the exception: We support linking sources inside notes, and notes are separate objects that can be linked to as many other objects as the researcher wants.)

GedcomX as presently designed goes further: There aren't any molecules, just compounds (Persons, Relationships) and a glass-like blob of atoms for each. Ryan has proposed adding Events, which are almost molecular -- but he hasn't committed the change in spite of general approval from the 3 or 4 people who bothered to comment.

In the "synthesis" model I've been trying to explain, the "rationale" collects all of the relevant evidence and explains a body of the "abstract" conclusions -- however much in the researcher's judgement is needed to deal with manageable part of the puzzle. The "atomic" or "molecular" conclusions can then link to the rationale.

EssyGreen commented 12 years ago

In the "standard model", used by almost every genealogy database I've seen, there are only molecules: A birth event, for example, with a date, a place, and a list of participants as atoms. There's a list of sources, and a "note" field to collect some sort of rationale if the user is motivated to do so -- but it's a pain to discuss the sources because there's no way to link them to the note. There's also no way to tie the note to other molecules. (GRAMPS is the exception: We support linking sources inside notes, and notes are separate objects that can be linked to as many other objects as the researcher wants.)

Unless I misunderstand you, your experience differs from mine ... old GEDCOM spec has NOTEs exactly as you specify here (they can be linked to virtually anything and can reference any number of sources via citations). I've seen this implemented many times (e.g. Family Historian, Family Tree Builder, TNG). But this "flexibility" comes at a price ... it is impossible to deduce when importing from elsewhere what the "Note" was used for (whether that be evidence, proof, rationale, to-do lists, captions on pictures, narrative descriptions of places etc etc etc). So all the importing program can do is keep the structure and render it as "text". What I believe is missing is the ability to distinguish the type of "Note" being made (without having to hazard a guess by looking at the types of objects it links together).

jralls commented 12 years ago

Personally I think generic Note objects are superfluous, ambiguous and confusing and I would rather not have them at all. Instead allow a single generic CDATA narrative field wherever appropriate.

That's XML-specific: The "conceptual model" needs something to hold the string for serialization to CDATA for XML and to whatever else for other implementations. It need not be a top-level object, though; in many cases a simple string parameter will do. Other cases, though, and "rationale" is one, should have a top-level object that can be referenced by one-to-many other objects.

jralls commented 12 years ago

Genealogical data evolves. I often start with someone's word, then find one source, then maybe another, and so on, until I reach my preferred level of proven-ness. At any given time, conclusions in my tree are in various states of proven-ness. Not only that, my preferred level of proven-ness may not meet your own standard from proven-ness. So, does a lack of proof, or a sub-standard proof, mean it is no longer a conclusion? I would say "no".

Yes, agreed, that's the nature of the research process. As Sarah says, it is vital that the "proven-ness" is firmly attached to the conclusions. A proficient genealogist will insist that the "proven-ness" consists of:

In other words, a demonstration of how far along the research is in complying with the GPS. As I noted earlier, this level of detail is not supported by most extant programs.

In a data exchange, it would still be exchanged as a conclusion. In accepting it, you might require that more work be done, or that additional statements be made, or you may even choose to skip/reject it, but a lack of proven-ness ought not prevent an exchange.

Agreed, but the lack of "proven-ness" must be clearly stated and firmly attached to the conclusions.

jralls commented 12 years ago

perhaps we could associate it with each conclusion object relevant to the statement (perhaps even across multiple "person" conclusions and their subordinate conclusions), or perhaps just associate it with the enclosing conclusion (e.g., the "person" conclusion that includes the "birth", "death", etc. conclusions discussed in the "rationale" statement).

But if the model supported associating a "rationale" statement with any conclusion, and with more than one conclusion, wouldn't the model support both research "styles" -- "evidence-based" and "conclusion-based" research?

Now you're getting it. There's a discussion of whether the proof statement should be a giant one for each person or more focused on subsets of conclusions between Tom, Sarah, maybe Louis, and me a couple of weeks ago, but now I can't find it. If the model can be structured to support both that would be fine with me, though it would complicate parsing for receiving programs.

BTW, I don't agree that those are research styles. I'd say that they're presentation styles, and I find the one labelled "evidence-based" to be rather lacking because it doesn't allow for the evidence to be treated in a single unit. Unfortunately it's the one used by most programs, so it would be counter-productive to not support it.

EssyGreen commented 12 years ago

@jralls - I think your example of John Smith is excellent but I'm not convinced that it supports your need for a rationale as a top-level object ... The way I see it you have a subject you are researching and a number of "conclusions" (generic sense of the word) you are investigating. The source doesn't come out of the ether - you happen to notice it in the car boot sale because you are researching a John Smith and think "aha! I wonder if this will provide any evidence to support my theories about my John Smith" ... you analyse it in this context, firstly ascertaining the likelihood that this "John Smith" is your John Smith and then moving on to extracting further information to support, challenge and/or supplement your existing hypotheses about John Smith and his relationships etc. In each case the rationale is related to only one "Conclusion" (object) .... which can then be used as evidence for further conclusions. I think if you try to prove many Conclusions in one lump it gets very confusing and very hard to unravel. If you chain them together then if a link breaks you can trek back up the chain unpicking each dependant conclusion en route.

If you push the rationale to the top level then you are putting the answer before the question.

EssyGreen commented 12 years ago

I don't agree that those are research styles. I'd say that they're presentation styles, and I find the one labelled "evidence-based" to be rather lacking because it doesn't allow for the evidence to be treated in a single unit. Unfortunately it's the one used by most programs, so it would be counter-productive to not support it.

Sadly I believe that it is increasingly used as a "research" style - tho' it might be more appropriate called a "search style" since all the effort goes into the searching and none into the analysis and evaluation of the findings. Personally I would label it as "Junk Genealogy" but I thought that would be impolite.

jralls commented 12 years ago

We are not asking data providers to make statements about where a given conclusion is on the scale of proven-ness.

We should ask, and make it easy to comply. In the interest of supporting existing programs, we can't require it.

EssyGreen commented 12 years ago

We are not asking data providers to make statements about where a given conclusion is on the scale of proven-ness.

We should ask, and make it easy to comply. In the interest of supporting existing programs, we can't require it.

I'm actually split on this one ... yes it sounds sensible but as a genealogist I know I won't believe it anyway ... I have to take every secondary source and trace back to validate it so I don't care if they say "Definitely" ... I'm not going to believe them. I'm more interested in what their sources were. If they supply some rationale then I'd be interested to see their view-point in case it enlightened mine but that will come from the text and not from a "scale of proven-ness".

jralls commented 12 years ago

Personally I would label it as "Junk Genealogy"

+1!

but I thought that would be impolite.

Oh well. ;-)

Yes, it is a common style among novices, and is I think driven by the way that most programs work. Have you seen Ancestry Insider's Genealogical Maturity Model? He also took a shot at Evidence Management that's useful, though I don't think that he sufficiently treats circumstantial evidence.

EssyGreen commented 12 years ago

As an aside, I'm aware that we're debating from two different angles here ... what we want researchers to do and have available for themselves vs what we want researchers to do when making their data available to others which comes which comes back to #141 (as does the whole debate about fitting with the process model) . I didn't get a clear view from that discussion about what is most important to GEDCOM.

EssyGreen commented 12 years ago

it is a common style among novices, and is I think driven by the way that most programs work.

Indeed - I was sort of hoping GEDCOM X would "encourage" the apps to move towards a better way. Thx for the links - reading now :)

thomast73 commented 12 years ago

We would like the model to support GPS-based research. But the model is not about forcing all genealogical data into the GPS mold.

Why the heck not????? Surely the credibility of GEDCOM X lies in it supporting the GPS! How can you possibly produce a "standard" which doesn't support another established standard you already support on the same subject? The GPS isn't exactly controversial - in fact it's pretty much Motherhood & Apple Pie

It's not that it wouldn't be good. It's just that almost all existing data does not exist in that form. And as @jralls says, almost all tools do not lend themselves to the documentation needs of GPS. To get this existing data into GPS requires each conclusion to be individually examined, documented, etc. in the GPS way.

EssyGreen commented 12 years ago

We would like the model to support GPS-based research. But the model is not about forcing all genealogical data into the GPS mold.

Why the heck not????? [...]

It's not that it wouldn't be good. It's just that almost all existing data does not exist in that form.

But if we provide a model that does conform to the the standard but also makes it recommended but not mandatory then surely that would be better than just ignoring it? For example (top of my head):

There's many a genealogist out there just dying for this (in fact I'd put my hand up and say that that's exactly why I'm here) so they get a decent genealogy application and can fill in the blanks. If we just follow the status quo then let's just mock up GEDCOM 6 in XML and go home.

thomast73 commented 12 years ago

We would like the model to support GPS-based research. But the model is not about forcing all genealogical data into the GPS mold.

Why the heck not????? [...]

It's not that it wouldn't be good. It's just that almost all existing data does not exist in that form.

But if we provide a model that does conform [and make] it recommended but not mandatory [...]

As I understand it, that is our goal... ...so we are on the same page -- from a goal point of view.

EssyGreen commented 12 years ago

Wonderful :) So where are we on the detail?

thomast73 commented 12 years ago

We are not asking data providers to make statements about where a given conclusion is on the scale of proven-ness.

We should ask, and make it easy to comply. In the interest of supporting existing programs, we can't require it.

I'm actually split on this one ... yes it sounds sensible but as a genealogist I know I won't believe it anyway ... I have to take every secondary source and trace back to validate it so I don't care if they say "Definitely" ... I'm not going to believe them. I'm more interested in what their sources were. If they supply some rationale then I'd be interested to see their view-point in case it enlightened mine but that will come from the text and not from a "scale of proven-ness".

So this gets to the core of things, in my mind.

We need an object that is (perhaps) a specialization of Note (maybe called EvendenceAnalysis?), that can be associated with multiple conclusions, and that can have associated sources. [I have proposed some model changes that include such a thing to @stoicflame and his initial feedback is positive.]

The presence of such an object would indicate research documented in a GPS fashion. As @EssyGreen states, most researchers will re-evaluate all the sources and the statement of analysis anyway, so the analysis narrative and all of the subjective material ought to stay a part of the statement -- not be called out by the model.

At an application level, it seems that it might be useful to allow an individual user to further decorate conclusions or analysis with indicators to revisit them because the are not satisfied (perhaps even a proven-ness scale), but these indicators/scales break down when the data becomes shared or exchanged because it is opinion and cannot be reliably interpreted from one researcher to another -- unless (perhaps) it can be when the analysis itself has reached the "proven" state according to the strictest interpretation of the GPS standard.

{@jralls}A proficient genealogist will insist that the "proven-ness" consists of: sources and repositories searched so far

  • connected as sources to the EvidenceAnalysis, summarized in the analysis narrative

the results of each search (including searches which turned up no sources).

  • described in the analysis narrative

Complete citations for each source used

  • connected as sources to the EvidenceAnalysis

The evidence found, clearly traceable to which source and clearly identified as direct or inferred.

  • ...needs work...

An analysis of each source, including legibility, context, provenance, informant(s), etc.

  • described in the analysis narrative and/or the source decription

An analysis of the all of the evidence taken together, explaining any inconsistencies and leading to the conclusions made from the evidence.

  • described in the analysis narrative

A plan for further research if the "concluder" is not satisfied with the level of "proven-ness".

  • described in the analysis narrative

A clear statement of the conclusions.

  • described in the analysis narrative, represented as conclusions which reference this analysis

@EssyGreen gives a similar list here. If I am understanding things well enough, the differences between her list and @jralls' list can be found in their modeling of the extracted evidence -- the part above marked as needing work.

So in summary, if we could get an EvidenceAnalysis object introduced into the model as described, it takes us a fair distance on the path toward being able to represent data documented via the GPS standard...

...which brings us back to our original purpose?

jralls commented 12 years ago

...which brings us back to our original purpose?

Which was that the RDF.Description (incorrectly) specified in para 3.1 is inadequate, and that we need in addition to your EvidenceAnalysis class * a proper Source class, the definition of which Sarah and I settled upon before you arrived.

* How about a description, in the same sort of rough spirit as we used for the Source class, of what you have in mind?

thomast73 commented 12 years ago

How about a description, in the same sort of rough spirit as we used for the Source class, of what you have in mind?

SourceDescription (replacing Description)

The big hole here are the "needs work" aspects of extractedEvidence. There is the business of "extracts" (i.e., full or partial transcriptions), possibly translations, and the business about evidence objects that represent the genealogically significant evidence (e.g., persons, relationships, etc.). I do not want to discuss all that might or should be involved here. However, given that all of the data in this extractedEvidence category are derivatives of the Source, I would like to think about whether this information is part of (contained by) the SourceDescription for the Source (with attribution on each derivative), or whether the extracted evidence ought to be more loosely coupled (by reference) to the Source (via a SourceDescription indicating the person making the derivations and a componentOf reference pointing to the SourceDescription describing the Source)? Is SourceDescription better with or without an extractedEvidence member.

jralls commented 12 years ago

OK, that adds some more fields to the Source class Sarah and I had already worked out. But what I asked for was the EvidenceAnalysis structure.

Notation question: List<Note> is obviously java, a List of Note objects. But what is ResourceReference<foaf::Agent>? From context it seems to mean typedef foaf::Agent ResourceReference; (C; I don't know how to say that in Java). Is that what you meant?

ISTM "displayName" and "alternateNames" aren't really necessary for exchange.

Yes, a general purpose Note could be used to hold the source analysis, but that has the same problems as using a general Note object for the "rationale statement", which you already dismissed.

I started to say that I could go either way on extractedEvidence, but as I wrote it up I realized that it is likely to get too heavyweight. The argument in favor is that for XML I think it's better practice to keep hierarchies intact rather than to have a bunch of links all over the place. The problem is that extractedEvidence is conclusional: It's the result of a researcher's analysis, so it needs to be attributed in a collaborative environment, may need a note explaining inferences, and ideally it would be under version control of some sort. That argues for a separate class holding a SourceReference rather than a List inside the SourceDescription. Another point is that one is likely to want to have references to extractedEvidence objects in dependent conclusions, or at least in proof arguments ("reasonings"), which is more complicated with embedded elements. Yes, I realize that that's a change from my earlier position.

thomast73 commented 12 years ago

... what I asked for was the EvidenceAnalysis structure.

In my mind, EvidenceAnalysis extends Note and adds a list or SourceReferences.

thomast73 commented 12 years ago

Notation question: List<Note> is obviously java, a List of Note objects. But what is ResourceReference<foaf::Agent>? From context it seems to mean typedef foaf::Agent ResourceReference; (C; I don't know how to say that in Java). Is that what you meant?

Sorry. This is foaf::Agent. There are two specializations of Agent in the model: Organization and Person. If the owner/holder of the Source was an archive, this ResourceReference would probably point to an instance of Organization; if, instead, the source was held by an individual (e.g., a family bible?), it would reference a Person.

thomast73 commented 12 years ago

Yes, a general purpose Note could be used to hold the source analysis, but that has the same problems as using a general Note object for the "rationale statement", which you already dismissed.

Why would it be important to call attention to Source "analysis" from among the other possible notes one might associated with a given Source? Are you thinking every note needs to be classified and typed? If we make a special type for this, what other special types are needed? And why?

I see the purpose in calling attention to the presence of a "rationale statement" -- the existence of a "proof argument" would be important to someone reviewing the research being transmitted. But it seems a difficult task to assign meaningful categories to other types of notes.

thomast73 commented 12 years ago

OK, that adds some more fields to the Source class Sarah and I had already worked out.

Just to be clear, what I posted is our attempt to consolidate all the discussion (including the discussion here) into model changes. Making these changes would close this issue.

jralls commented 12 years ago

Just to be clear, what I posted is our attempt to consolidate all the discussion (including the discussion here) into model changes. Making these changes would close this issue.

OK, why don't you make a branch and do a pull request for review... or turn this issue into a pull request. Don't forget to update gedcom.zargo. Please separate the changes to specifications/ and gedcomx-common/ into separate commits; it makes seeing the differences easier.